When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

TL;DR

We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification, with which we evaluate the truthfulness of large language models (LLMs) in answering visual questions.

5,000+

Annotated Samples

46

Models Evaluated

50

Human Annotators

88.4%

Judge Accuracy

Overview

Slower Thinking, Less Truthful?

Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers?

Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify dishonest reasoning.

To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty.

DFS — Slow Thinking

Reasoning models persistently explore flawed premises, doubling down on incorrect initial assumptions. More vulnerable to fabricating plausible-sounding justifications for wrong answers.

BFS — Fast Thinking

Chat models show greater caution under uncertainty, quickly abandoning unpromising lines of reasoning. More likely to admit uncertainty rather than confabulate.

Average Accuracy by Prompt Complexity Level

83.7%

Level 1
Basic

57.8%

Level 2
Intermediate

44.6%

Level 3
Advanced

Contributions:

First and foremost, human-in-the-loop. We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification. Over 5k visually misleading images were collected and annotated by 50 professional annotators, each independently reviewed by five annotators on a case-by-case basis.
Hierarchical prompt design for deep truthfulness evaluation. We propose a three-tier human-written prompt that systematically probes models across increasing levels of reasoning complexity, enabling finer-grained diagnosis of dishonesty and misinformation vulnerabilities.
Revealing slow vs. fast thinking pitfalls in multimodal reasoning. We conduct the first comprehensive analysis comparing depth-first (slow thinking) reasoning models and breadth-first (fast thinking) chat models under adversarial visual conditions. Reasoning models are significantly more prone to factual dishonesty in complex visual tasks.
TruthfulJudge — Reliable Human-Centric Evaluation Pipeline. We design TruthfulJudge, a reliable evaluation pipeline with specialized judge model (ECE=0.12, Cohen's κ=0.79, 88.4% judge accuracy).

Dataset

TruthfulVQA Dataset

Each entry of TruthfulVQA undergoes rigorous multi-stage quality assurance, verified by at least five independent annotators. The dataset construction involves the following core components:

Human Annotation and Quality Assurance Team We collaborated with a professional annotation team of 50 members, implementing a multi-stage quality assurance protocol.
Human-crafted Images from Webpages The dataset includes 5,000 web-sourced images: 4,500 manually curated misleading content and 500 generated by image-generation models. Each accepted only after independent confirmation by five annotators.
Hierarchical Prompts Evaluation Each image is paired with three levels of prompts (Level 1, 2, and 3), designed to offer increasing informational depth with ambiguous, deceptive, or subtly manipulated content.

S1. Eye Illusion

Perceptual Multiplicity
Optical Illusions

S2. Perspective Restriction

Cropped or Partial Observation
Unconventional Shooting Angles
Shape Distortion (Natural Phenomena)

S3. Contextual Bias

Background Interference
Manipulation of Emotional Atmosphere

S4. Information Hiding

Visual Information Distortion
Blurring / Low-Resolution Processing
Concealed Features & Info Masking

S5. Feature Forgery

Physical Feature Manipulation
Natural Feature Confusion
Insertion of Fake Objects/Elements

S6. Fictional Information

Fabricated Flags and Maps
Imaginary Species
Fake Historical/Cultural Artifacts

S7. Imitative Falsehood

Misapplied Reasoning Transfer
Reinforcement of Semantic Bias
Inheritance of False Information

S8. Information Forgery

Factual Fabrication
Image Manipulation
False Reasoning

Benchmark Results

Leaderboard

#	Model	Overall	CAI	L1	L2	L3	S1	S2	S3	S4	S5	S6	S7	S8

R Reasoning C Chat

* Results might differ slightly from the paper due to re-evaluation. Click column headers to sort.