We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark with rigorous human-in-the-loop verification, revealing that slow-thinking reasoning models are more prone to fabricating false details.
We introduce TruthfulVQA, the first large-scale multimodal truthfulness benchmark built with rigorous human-in-the-loop verification, with which we evaluate the truthfulness of large language models (LLMs) in answering visual questions.
Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers?
Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify dishonest reasoning.
To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty.
Reasoning models persistently explore flawed premises, doubling down on incorrect initial assumptions. More vulnerable to fabricating plausible-sounding justifications for wrong answers.
Chat models show greater caution under uncertainty, quickly abandoning unpromising lines of reasoning. More likely to admit uncertainty rather than confabulate.
Contributions:
Each entry of TruthfulVQA undergoes rigorous multi-stage quality assurance, verified by at least five independent annotators. The dataset construction involves the following core components:
| # | Model | Overall | CAI | L1 | L2 | L3 | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 |
|---|
* Results might differ slightly from the paper due to re-evaluation. Click column headers to sort.