Overview
Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers?
Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify dishonest reasoning.
To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty.
These findings reveal a critical vulnerability of reasoning models: while effective in structured domains such as math, their DFS-style reasoning becomes fragile when confronted with ambiguous, multimodal inputs.
Contributions:
- First and foremost, human-in-the-loop. We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification. Over 5k visually misleading images were collected and annotated by 50 professional annotators, and, critically, each sample was independently reviewed by five professional annotators on a case-by-case basis, ensuring evaluation robustness beyond automated metrics.
- Hierarchical prompt design for deep truthfulness evaluation. We propose a three-tier human-written prompt that systematically probes models across increasing levels of reasoning complexity, enabling finer-grained diagnosis of dishonesty and misinformation vulnerabilities in MLLMs.
- Revealing slow vs. fast thinking pitfalls in multimodal reasoning. We conduct the first comprehensive analysis comparing depth-first (slow thinking) reasoning models and breadth-first (fast thinking) chat models under adversarial visual conditions. Our findings show that reasoning models, despite their strengths in math and code, are significantly more prone to factual dishonestys in complex visual tasks, as evidenced by Figure 1.
- TruthfulJudge — Reliable Human-Centric Evaluation Pipeline. We design TruthfulJudge, a reliable evaluation pipeline to mitigate the pitfalls of AI-as-judge setups. Our methodology emphasizes in-depth human involvement to prevent feedback loops of bias and errors, ensuring faithful assessment of multimodal model truthfulness. Our specialised judge model, TruthfulJudge, is well-calibrated (ECE=0.12), self-consistent, and highly inter-annotator agreed (Cohen’s κ = 0.79), achieving 88.4% judge accuracy.
Dataset
Dataset Composition
Each entry of TruthfulVQA undergoes rigorous multi-stage quality assurance, verified by at least five independent annotators. The dataset construction involves the following core components:
-
Human Annotation and Quality Assurance Team
We collaborated with a professional annotation team of 50 members, implementing a multi-stage quality assurance protocol to ensure data quality and consistency. -
Human-crafted Images from Webpages
The dataset includes 5,000 web-sourced images: 4,500 manually curated to contain misleading or factually incorrect content, and 500 generated by image-generation models. Each image was accepted only after independent confirmation by five annotators. -
Hierarchical Prompts Evaluation
Each image is paired with three levels of prompts (Level 1, 2, and 3), designed to offer increasing informational depth and containing ambiguous, deceptive, or subtly manipulated content. This structure enables fine-grained evaluation of a model's ability to resist dishonesty and maintain factual accuracy.
S1. Eye Illusion
- Perceptual Multiplicity
- Optical Illusions
S2. Perspective Restriction
- Cropped or Partial Observation
- Unconventional Shooting Angles
- Shape Distortion Caused by Natural Phenomena
S3. Contextual Bias
- Background Interference
- Manipulation of Emotional Atmosphere
S4. Information Hiding
- Visual Information Distortion
- Blurring / Low-Resolution Processing
- Concealed Features and Information Masking
S5. Feature Forgery
- Physical Feature Manipulation
- Natural Feature Confusion
- Insertion of Fake Objects or Elements
S6. Fictional Information
- Fabricated Flags and Maps
- Imaginary Species
S7. Imitative Falsehood
- Misapplied Reasoning Transfer
- Reinforcement of Semantic Bias
- Inheritance of False Information
S8. Information Forgery
- Factual Fabrication
- Image Manipulation
- False Reasoning
Leaderboard
| Rank | MLLMs | Overall | CAI | L1 | L2 | L3 | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | InternVL-2.5-38B | 77.97% | 0.2552 | 91.18% | 76.48% | 70.96% | 81.06% | 85.80% | 79.22% | 70.08% | 85.69% | 81.75% | 80.22% | 71.52% |
| 2 | Qwen2.5-VL-72B | 77.33% | 0.2537 | 88.24% | 76.48% | 67.27% | 79.59% | 84.59% | 77.68% | 69.96% | 83.53% | 79.91% | 74.23% | 69.25% |
| 3 | InternVL-2.5-78B | 76.48% | 0.3129 | 91.16% | 73.70% | 64.75% | 79.43% | 85.80% | 76.64% | 67.49% | 82.10% | 79.59% | 68.19% | 67.21% |
| 4 | Claude-3.7-Sonnet-Thinking | 76.38% | 0.1459 | 82.58% | 75.60% | 70.96% | 77.08% | 82.38% | 75.48% | 69.96% | 78.52% | 78.08% | 80.22% | 66.34% |
| 5 | InternVL-3-38B | 75.60% | 0.3553 | 91.67% | 73.19% | 61.94% | 76.24% | 82.52% | 76.75% | 65.82% | 81.66% | 78.73% | 68.14% | 70.59% |
| 6 | GPT-4.1 | 75.22% | 0.3045 | 89.10% | 72.60% | 63.94% | 79.28% | 83.92% | 74.93% | 68.41% | 77.99% | 77.92% | 67.87% | 67.97% |
| 7 | Gemini-2.5-Pro | 75.04% | 0.3228 | 89.85% | 72.19% | 63.08% | 76.50% | 83.00% | 74.93% | 65.02% | 74.56% | 76.30% | 74.82% | 72.92% |
| 8 | InternVL-3-14B | 74.14% | 0.3532 | 89.97% | 71.52% | 60.92% | 75.25% | 81.51% | 75.04% | 64.90% | 80.47% | 78.51% | 65.88% | 66.92% |
| 9 | GPT-4o | 73.79% | 0.1680 | 80.07% | 74.11% | 67.18% | 81.06% | 77.23% | 72.73% | 67.43% | 81.21% | 81.75% | 63.40% | 60.69% |
| 10 | o4-mini | 73.45% | 0.2716 | 85.68% | 70.82% | 63.86% | 77.39% | 81.32% | 70.64% | 65.95% | 78.76% | 78.67% | 69.33% | 60.86% |
| 11 | Qwen2-VL-72B | 72.77% | 0.3823 | 88.59% | 71.76% | 57.96% | 77.39% | 84.93% | 74.55% | 61.88% | 78.15% | 75.70% | 61.89% | 62.26% |
| 12 | Claude-3.5-Sonnet | 72.53% | 0.2879 | 83.64% | 72.68% | 61.27% | 76.30% | 77.23% | 66.41% | 69.52% | 76.60% | 69.98% | 76.87% | 64.24% |
| 13 | InternVL-2.5-8B | 71.53% | 0.3498 | 87.62% | 67.58% | 59.39% | 76.24% | 79.35% | 73.50% | 60.64% | 76.80% | 75.00% | 61.51% | 64.59% |
| 14 | Gemini-2.0-Flash | 71.32% | 0.3928 | 90.10% | 66.11% | 57.75% | 75.09% | 79.39% | 71.47% | 62.99% | 76.84% | 73.97% | 64.85% | 61.33% |
| 15 | Qwen2.5-VL-7B | 70.22% | 0.4132 | 87.40% | 68.25% | 55.00% | 73.21% | 78.86% | 72.68% | 59.59% | 73.13% | 72.89% | 62.16% | 65.52% |
| 16 | Gemini-2.0-Flash-Thinking | 70.06% | 0.3997 | 88.48% | 65.39% | 56.31% | 70.91% | 78.62% | 69.98% | 56.69% | 71.42% | 69.11% | 71.27% | 69.25% |
| 17 | InternVL-3-8B | 68.82% | 0.4552 | 88.71% | 64.95% | 52.78% | 73.21% | 78.72% | 72.68% | 61.81% | 76.03% | 67.55% | 57.20% | 58.12% |
| 18 | Gemma-3-12B | 68.56% | 0.3287 | 80.80% | 68.46% | 56.41% | 71.74% | 75.54% | 72.68% | 60.21% | 74.19% | 65.28% | 55.69% | 69.48% |
| 19 | Claude-3.7-Sonnet | 68.42% | 0.3375 | 82.27% | 66.21% | 56.78% | 69.07% | 74.05% | 68.61% | 62.31% | 73.26% | 65.12% | 70.03% | 61.39% |
| 20 | Llama-4-Maverick | 67.74% | 0.4382 | 88.28% | 61.68% | 53.24% | 72.00% | 75.25% | 68.33% | 56.26% | 72.60% | 67.12% | 66.58% | 59.06% |
| 21 | LLaVA-v1.6-vicuna-13B | 66.80% | 0.3791 | 80.27% | 67.48% | 52.65% | 65.99% | 77.13% | 69.05% | 60.39% | 74.15% | 69.44% | 53.15% | 60.28% |
| 22 | InternVL-3-9B | 65.92% | 0.5171 | 88.87% | 60.37% | 48.51% | 69.18% | 74.43% | 66.36% | 57.50% | 70.40% | 71.87% | 54.39% | 59.11% |
| 23 | Llama-4-Scout | 65.25% | 0.5680 | 88.38% | 62.07% | 45.29% | 69.02% | 71.26% | 63.50% | 56.57% | 66.00% | 67.49% | 67.82% | 57.60% |
| 24 | LLaVA-v1.6-mistral-7B | 64.60% | 0.4824 | 81.83% | 64.94% | 47.02% | 65.62% | 72.08% | 68.11% | 54.78% | 71.38% | 72.19% | 47.71% | 60.34% |
| 25 | Gemma-3-27B | 63.62% | 0.4617 | 82.68% | 59.47% | 48.71% | 67.92% | 71.88% | 69.93% | 52.87% | 66.16% | 60.80% | 50.84% | 65.52% |
| 26 | Skywork-R1V-38B | 61.84% | 0.4071 | 79.76% | 55.88% | 49.86% | 64.99% | 69.04% | 64.32% | 51.26% | 65.96% | 61.99% | 55.90% | 57.31% |
| 27 | Mulberry-Qwen | 60.25% | 0.4659 | 78.88% | 55.74% | 46.12% | 62.32% | 68.08% | 65.70% | 57.06% | 55.81% | 63.39% | 57.04% | 52.13% |
| 28 | QVQ-72B | 57.14% | 0.6188 | 82.11% | 50.59% | 38.71% | 61.43% | 69.43% | 59.70% | 50.34% | 53.12% | 57.72% | 54.07% | 49.62% |
| 29 | InternVL-2.5-4B | 56.16% | 0.6965 | 87.60% | 45.26% | 35.61% | 60.86% | 66.30% | 56.51% | 50.77% | 51.57% | 61.34% | 54.02% | 46.65% |
| 30 | LlamaV-o1 | 55.68% | 0.7045 | 83.32% | 49.04% | 34.67% | 58.35% | 64.66% | 60.86% | 47.87% | 60.42% | 51.40% | 46.04% | 52.01% |
| 31 | Qwen2-VL-2B-GRPO-8k | 55.39% | 0.6084 | 81.89% | 46.08% | 38.20% | 59.50% | 57.63% | 56.29% | 50.03% | 57.52% | 58.21% | 51.48% | 50.38% |
| 32 | Llama-3.2-CoT | 55.15% | 0.6964 | 83.60% | 46.98% | 34.84% | 58.45% | 65.53% | 57.12% | 48.61% | 58.09% | 51.40% | 51.91% | 46.30% |
| 33 | Kimi-VL-A3B-Instruct | 54.71% | 0.6888 | 78.46% | 47.73% | 37.92% | 61.17% | 64.61% | 48.38% | 47.93% | 51.28% | 64.09% | 58.44% | 39.37% |
| 34 | Qwen2.5-VL-3B | 54.47% | 0.7767 | 83.11% | 49.22% | 31.06% | 59.50% | 67.02% | 54.48% | 50.40% | 47.74% | 54.54% | 49.97% | 51.89% |
| 35 | Qwen2-VL-7B | 53.95% | 0.5808 | 70.98% | 55.49% | 35.37% | 60.02% | 66.35% | 53.82% | 45.28% | 55.24% | 52.43% | 44.37% | 50.67% |
| 36 | LLaVA-v1.6-vicuna-7B | 53.45% | 0.6641 | 80.27% | 45.16% | 34.92% | 56.78% | 65.29% | 56.84% | 50.52% | 52.87% | 54.32% | 43.67% | 45.08% |
| 37 | Llama-3.2-Vision | 53.30% | 0.7970 | 83.82% | 43.17% | 29.70% | 51.90% | 63.76% | 51.51% | 49.26% | 57.91% | 53.87% | 47.47% | 46.95% |
| 38 | InternVL-3-2B | 53.00% | 0.7885 | 86.44% | 42.06% | 30.49% | 58.77% | 63.70% | 53.11% | 48.30% | 47.57% | 55.67% | 50.08% | 46.01% |
| 39 | Qwen2-VL-2B | 52.82% | 0.6367 | 78.99% | 43.98% | 35.47% | 55.78% | 57.87% | 53.49% | 48.92% | 55.77% | 53.40% | 45.71% | 49.21% |
| 40 | LLaVA-1.5 | 51.92% | 0.5647 | 75.72% | 42.79% | 37.24% | 59.92% | 62.49% | 55.03% | 50.22% | 52.87% | 52.43% | 37.90% | 41.76% |
| 41 | Kimi-VL-A3B-Thinking | 50.45% | 1.0908 | 72.62% | 41.92% | 36.78% | 56.57% | 56.62% | 48.32% | 43.00% | 42.11% | 58.10% | 60.49% | 38.26% |
| 42 | Gemma-3-4B | 48.80% | 0.7148 | 72.78% | 43.69% | 29.92% | 53.69% | 58.26% | 51.29% | 43.18% | 50.10% | 47.68% | 39.08% | 44.44% |
| 43 | InternVL-2.5-2B | 46.69% | 0.9077 | 84.30% | 32.71% | 23.04% | 50.60% | 58.06% | 46.29% | 45.40% | 41.79% | 46.27% | 44.80% | 39.72% |
| 44 | Mulberry-Llama | 46.59% | 0.3669 | 55.77% | 46.84% | 37.16% | 51.65% | 53.92% | 51.07% | 46.14% | 41.95% | 49.14% | 43.61% | 34.89% |
| 45 | InternVL-3-1B | 43.79% | 0.9901 | 82.13% | 29.91% | 19.31% | 49.03% | 54.98% | 44.04% | 44.23% | 36.81% | 39.90% | 41.56% | 40.30% |
| 46 | InternVL-2.5-1B | 39.84% | 1.0777 | 81.25% | 23.40% | 14.84% | 45.11% | 50.89% | 36.72% | 40.59% | 34.57% | 34.13% | 40.59% | 36.05% |
* The results might be slightly different from the paper due to re-evaluation.