When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark with rigorous human-in-the-loop verification, revealing that slow-thinking reasoning models are more prone to fabricating false details.

Sitong Fang1, Wenjing Cao1, Jiahao Li1, Xuyao Wang1, Chi-Min Chan2, Sirui Han2, Juntao Dai1, Yike Guo2, Yaodong Yang†1, Jiaming Ji†1
1Institute for AI, Peking University 2Hong Kong University of Science and Technology
Paper Code Data
TL;DR

We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification, with which we evaluate the truthfulness of large language models (LLMs) in answering visual questions.

5,000+
Annotated Samples
46
Models Evaluated
50
Human Annotators
88.4%
Judge Accuracy

Overview

Slower Thinking, Less Truthful?

Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers?

Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify dishonest reasoning.

To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty.

DFS — Slow Thinking

Reasoning models persistently explore flawed premises, doubling down on incorrect initial assumptions. More vulnerable to fabricating plausible-sounding justifications for wrong answers.

BFS — Fast Thinking

Chat models show greater caution under uncertainty, quickly abandoning unpromising lines of reasoning. More likely to admit uncertainty rather than confabulate.

Average Accuracy by Prompt Complexity Level
83.7%
Level 1
Basic
57.8%
Level 2
Intermediate
44.6%
Level 3
Advanced

Contributions:


Dataset

TruthfulVQA Dataset

Each entry of TruthfulVQA undergoes rigorous multi-stage quality assurance, verified by at least five independent annotators. The dataset construction involves the following core components:

S1. Eye Illusion

  • Perceptual Multiplicity
  • Optical Illusions

S2. Perspective Restriction

  • Cropped or Partial Observation
  • Unconventional Shooting Angles
  • Shape Distortion (Natural Phenomena)

S3. Contextual Bias

  • Background Interference
  • Manipulation of Emotional Atmosphere

S4. Information Hiding

  • Visual Information Distortion
  • Blurring / Low-Resolution Processing
  • Concealed Features & Info Masking

S5. Feature Forgery

  • Physical Feature Manipulation
  • Natural Feature Confusion
  • Insertion of Fake Objects/Elements

S6. Fictional Information

  • Fabricated Flags and Maps
  • Imaginary Species
  • Fake Historical/Cultural Artifacts

S7. Imitative Falsehood

  • Misapplied Reasoning Transfer
  • Reinforcement of Semantic Bias
  • Inheritance of False Information

S8. Information Forgery

  • Factual Fabrication
  • Image Manipulation
  • False Reasoning

Benchmark Results

Leaderboard

# Model Overall CAI L1 L2 L3 S1 S2 S3 S4 S5 S6 S7 S8
R Reasoning C Chat

* Results might differ slightly from the paper due to re-evaluation. Click column headers to sort.