When Slower Isn’t Truer:
Inverse Scaling Law of Truthfulness in Multimodal Reasoning

TL;DR: We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification, with which we evaluate the truthfulness of large language models (LLMs) in answering visual questions.

Overview

Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers?

Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify dishonest reasoning.

To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty.

These findings reveal a critical vulnerability of reasoning models: while effective in structured domains such as math, their DFS-style reasoning becomes fragile when confronted with ambiguous, multimodal inputs.

Contributions:

  • First and foremost, human-in-the-loop. We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification. Over 5k visually misleading images were collected and annotated by 50 professional annotators, and, critically, each sample was independently reviewed by five professional annotators on a case-by-case basis, ensuring evaluation robustness beyond automated metrics.
  • Hierarchical prompt design for deep truthfulness evaluation. We propose a three-tier human-written prompt that systematically probes models across increasing levels of reasoning complexity, enabling finer-grained diagnosis of dishonesty and misinformation vulnerabilities in MLLMs.
  • Revealing slow vs. fast thinking pitfalls in multimodal reasoning. We conduct the first comprehensive analysis comparing depth-first (slow thinking) reasoning models and breadth-first (fast thinking) chat models under adversarial visual conditions. Our findings show that reasoning models, despite their strengths in math and code, are significantly more prone to factual dishonestys in complex visual tasks, as evidenced by Figure 1.
  • TruthfulJudge — Reliable Human-Centric Evaluation Pipeline. We design TruthfulJudge, a reliable evaluation pipeline to mitigate the pitfalls of AI-as-judge setups. Our methodology emphasizes in-depth human involvement to prevent feedback loops of bias and errors, ensuring faithful assessment of multimodal model truthfulness. Our specialised judge model, TruthfulJudge, is well-calibrated (ECE=0.12), self-consistent, and highly inter-annotator agreed (Cohen’s κ = 0.79), achieving 88.4% judge accuracy.

Dataset

Dataset Composition

Each entry of TruthfulVQA undergoes rigorous multi-stage quality assurance, verified by at least five independent annotators. The dataset construction involves the following core components:

  • Human Annotation and Quality Assurance Team
    We collaborated with a professional annotation team of 50 members, implementing a multi-stage quality assurance protocol to ensure data quality and consistency.
  • Human-crafted Images from Webpages
    The dataset includes 5,000 web-sourced images: 4,500 manually curated to contain misleading or factually incorrect content, and 500 generated by image-generation models. Each image was accepted only after independent confirmation by five annotators.
  • Hierarchical Prompts Evaluation
    Each image is paired with three levels of prompts (Level 1, 2, and 3), designed to offer increasing informational depth and containing ambiguous, deceptive, or subtly manipulated content. This structure enables fine-grained evaluation of a model's ability to resist dishonesty and maintain factual accuracy.

S1. Eye Illusion

  • Perceptual Multiplicity
  • Optical Illusions

S2. Perspective Restriction

  • Cropped or Partial Observation
  • Unconventional Shooting Angles
  • Shape Distortion Caused by Natural Phenomena

S3. Contextual Bias

  • Background Interference
  • Manipulation of Emotional Atmosphere

S4. Information Hiding

  • Visual Information Distortion
  • Blurring / Low-Resolution Processing
  • Concealed Features and Information Masking

S5. Feature Forgery

  • Physical Feature Manipulation
  • Natural Feature Confusion
  • Insertion of Fake Objects or Elements

S6. Fictional Information

  • Fabricated Flags and Maps
  • Imaginary Species

S7. Imitative Falsehood

  • Misapplied Reasoning Transfer
  • Reinforcement of Semantic Bias
  • Inheritance of False Information

S8. Information Forgery

  • Factual Fabrication
  • Image Manipulation
  • False Reasoning

Leaderboard

Rank MLLMs Overall CAI L1 L2 L3 S1 S2 S3 S4 S5 S6 S7 S8
1 InternVL-2.5-38B 77.97% 0.2552 91.18% 76.48% 70.96% 81.06% 85.80% 79.22% 70.08% 85.69% 81.75% 80.22% 71.52%
2 Qwen2.5-VL-72B 77.33% 0.2537 88.24% 76.48% 67.27% 79.59% 84.59% 77.68% 69.96% 83.53% 79.91% 74.23% 69.25%
3 InternVL-2.5-78B 76.48% 0.3129 91.16% 73.70% 64.75% 79.43% 85.80% 76.64% 67.49% 82.10% 79.59% 68.19% 67.21%
4 Claude-3.7-Sonnet-Thinking 76.38% 0.1459 82.58% 75.60% 70.96% 77.08% 82.38% 75.48% 69.96% 78.52% 78.08% 80.22% 66.34%
5 InternVL-3-38B 75.60% 0.3553 91.67% 73.19% 61.94% 76.24% 82.52% 76.75% 65.82% 81.66% 78.73% 68.14% 70.59%
6 GPT-4.1 75.22% 0.3045 89.10% 72.60% 63.94% 79.28% 83.92% 74.93% 68.41% 77.99% 77.92% 67.87% 67.97%
7 Gemini-2.5-Pro 75.04% 0.3228 89.85% 72.19% 63.08% 76.50% 83.00% 74.93% 65.02% 74.56% 76.30% 74.82% 72.92%
8 InternVL-3-14B 74.14% 0.3532 89.97% 71.52% 60.92% 75.25% 81.51% 75.04% 64.90% 80.47% 78.51% 65.88% 66.92%
9 GPT-4o 73.79% 0.1680 80.07% 74.11% 67.18% 81.06% 77.23% 72.73% 67.43% 81.21% 81.75% 63.40% 60.69%
10 o4-mini 73.45% 0.2716 85.68% 70.82% 63.86% 77.39% 81.32% 70.64% 65.95% 78.76% 78.67% 69.33% 60.86%
11 Qwen2-VL-72B 72.77% 0.3823 88.59% 71.76% 57.96% 77.39% 84.93% 74.55% 61.88% 78.15% 75.70% 61.89% 62.26%
12 Claude-3.5-Sonnet 72.53% 0.2879 83.64% 72.68% 61.27% 76.30% 77.23% 66.41% 69.52% 76.60% 69.98% 76.87% 64.24%
13 InternVL-2.5-8B 71.53% 0.3498 87.62% 67.58% 59.39% 76.24% 79.35% 73.50% 60.64% 76.80% 75.00% 61.51% 64.59%
14 Gemini-2.0-Flash 71.32% 0.3928 90.10% 66.11% 57.75% 75.09% 79.39% 71.47% 62.99% 76.84% 73.97% 64.85% 61.33%
15 Qwen2.5-VL-7B 70.22% 0.4132 87.40% 68.25% 55.00% 73.21% 78.86% 72.68% 59.59% 73.13% 72.89% 62.16% 65.52%
16 Gemini-2.0-Flash-Thinking 70.06% 0.3997 88.48% 65.39% 56.31% 70.91% 78.62% 69.98% 56.69% 71.42% 69.11% 71.27% 69.25%
17 InternVL-3-8B 68.82% 0.4552 88.71% 64.95% 52.78% 73.21% 78.72% 72.68% 61.81% 76.03% 67.55% 57.20% 58.12%
18 Gemma-3-12B 68.56% 0.3287 80.80% 68.46% 56.41% 71.74% 75.54% 72.68% 60.21% 74.19% 65.28% 55.69% 69.48%
19 Claude-3.7-Sonnet 68.42% 0.3375 82.27% 66.21% 56.78% 69.07% 74.05% 68.61% 62.31% 73.26% 65.12% 70.03% 61.39%
20 Llama-4-Maverick 67.74% 0.4382 88.28% 61.68% 53.24% 72.00% 75.25% 68.33% 56.26% 72.60% 67.12% 66.58% 59.06%
21 LLaVA-v1.6-vicuna-13B 66.80% 0.3791 80.27% 67.48% 52.65% 65.99% 77.13% 69.05% 60.39% 74.15% 69.44% 53.15% 60.28%
22 InternVL-3-9B 65.92% 0.5171 88.87% 60.37% 48.51% 69.18% 74.43% 66.36% 57.50% 70.40% 71.87% 54.39% 59.11%
23 Llama-4-Scout 65.25% 0.5680 88.38% 62.07% 45.29% 69.02% 71.26% 63.50% 56.57% 66.00% 67.49% 67.82% 57.60%
24 LLaVA-v1.6-mistral-7B 64.60% 0.4824 81.83% 64.94% 47.02% 65.62% 72.08% 68.11% 54.78% 71.38% 72.19% 47.71% 60.34%
25 Gemma-3-27B 63.62% 0.4617 82.68% 59.47% 48.71% 67.92% 71.88% 69.93% 52.87% 66.16% 60.80% 50.84% 65.52%
26 Skywork-R1V-38B 61.84% 0.4071 79.76% 55.88% 49.86% 64.99% 69.04% 64.32% 51.26% 65.96% 61.99% 55.90% 57.31%
27 Mulberry-Qwen 60.25% 0.4659 78.88% 55.74% 46.12% 62.32% 68.08% 65.70% 57.06% 55.81% 63.39% 57.04% 52.13%
28 QVQ-72B 57.14% 0.6188 82.11% 50.59% 38.71% 61.43% 69.43% 59.70% 50.34% 53.12% 57.72% 54.07% 49.62%
29 InternVL-2.5-4B 56.16% 0.6965 87.60% 45.26% 35.61% 60.86% 66.30% 56.51% 50.77% 51.57% 61.34% 54.02% 46.65%
30 LlamaV-o1 55.68% 0.7045 83.32% 49.04% 34.67% 58.35% 64.66% 60.86% 47.87% 60.42% 51.40% 46.04% 52.01%
31 Qwen2-VL-2B-GRPO-8k 55.39% 0.6084 81.89% 46.08% 38.20% 59.50% 57.63% 56.29% 50.03% 57.52% 58.21% 51.48% 50.38%
32 Llama-3.2-CoT 55.15% 0.6964 83.60% 46.98% 34.84% 58.45% 65.53% 57.12% 48.61% 58.09% 51.40% 51.91% 46.30%
33 Kimi-VL-A3B-Instruct 54.71% 0.6888 78.46% 47.73% 37.92% 61.17% 64.61% 48.38% 47.93% 51.28% 64.09% 58.44% 39.37%
34 Qwen2.5-VL-3B 54.47% 0.7767 83.11% 49.22% 31.06% 59.50% 67.02% 54.48% 50.40% 47.74% 54.54% 49.97% 51.89%
35 Qwen2-VL-7B 53.95% 0.5808 70.98% 55.49% 35.37% 60.02% 66.35% 53.82% 45.28% 55.24% 52.43% 44.37% 50.67%
36 LLaVA-v1.6-vicuna-7B 53.45% 0.6641 80.27% 45.16% 34.92% 56.78% 65.29% 56.84% 50.52% 52.87% 54.32% 43.67% 45.08%
37 Llama-3.2-Vision 53.30% 0.7970 83.82% 43.17% 29.70% 51.90% 63.76% 51.51% 49.26% 57.91% 53.87% 47.47% 46.95%
38 InternVL-3-2B 53.00% 0.7885 86.44% 42.06% 30.49% 58.77% 63.70% 53.11% 48.30% 47.57% 55.67% 50.08% 46.01%
39 Qwen2-VL-2B 52.82% 0.6367 78.99% 43.98% 35.47% 55.78% 57.87% 53.49% 48.92% 55.77% 53.40% 45.71% 49.21%
40 LLaVA-1.5 51.92% 0.5647 75.72% 42.79% 37.24% 59.92% 62.49% 55.03% 50.22% 52.87% 52.43% 37.90% 41.76%
41 Kimi-VL-A3B-Thinking 50.45% 1.0908 72.62% 41.92% 36.78% 56.57% 56.62% 48.32% 43.00% 42.11% 58.10% 60.49% 38.26%
42 Gemma-3-4B 48.80% 0.7148 72.78% 43.69% 29.92% 53.69% 58.26% 51.29% 43.18% 50.10% 47.68% 39.08% 44.44%
43 InternVL-2.5-2B 46.69% 0.9077 84.30% 32.71% 23.04% 50.60% 58.06% 46.29% 45.40% 41.79% 46.27% 44.80% 39.72%
44 Mulberry-Llama 46.59% 0.3669 55.77% 46.84% 37.16% 51.65% 53.92% 51.07% 46.14% 41.95% 49.14% 43.61% 34.89%
45 InternVL-3-1B 43.79% 0.9901 82.13% 29.91% 19.31% 49.03% 54.98% 44.04% 44.23% 36.81% 39.90% 41.56% 40.30%
46 InternVL-2.5-1B 39.84% 1.0777 81.25% 23.40% 14.84% 45.11% 50.89% 36.72% 40.59% 34.57% 34.13% 40.59% 36.05%

* The results might be slightly different from the paper due to re-evaluation.