Video-Holmes
Can MLLM Think like Holmes for Complex Video Reasoning?



Junhao Cheng1,2, Yuying Ge 1,, Teng Wang 1,, Yixiao Ge 1, Jing Liao 2, Ying Shan 1
1ARC Lab, Tencent PCG, 2City University of Hong Kong

Abstract

Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.

Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.

⭐ Key Aspects of Video-Holmes:
  • One-Click Evaluation: Videos, audios, questions, and evaluation codes are packaged on GitHub and Hugging Face.
  • High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
  • Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.
We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field.

Teaser Image

LeaderBoard

Leaderboard of Video-Holmes, where SR means Social Reasoning; IMC means Intention and Motive Chaining; TCI means Temporal Causal Inference; TA Timeline Analysis; MHR means Multimodal Hint Reasoning; PAR means Physical Anomaly Reasoning; CTI means Core Theme Inference.
# Model Audio SR IMC TCI TA MHR PAR CTI Avg
1 Gemini-2.5-Pro 🥇 54.8 54.3 53.8 56.0 48.8 46.4 44.8 51.3
2 Gemini-2.0-Flash-Thinking 🥈 56.5 54.2 43.4 44.5 43.9 55.1 50.1 49.5
3 Gemini-1.5-Pro 🥉 59.6 54.7 37.4 33.5 40.4 47.4 44.4 45.7
4 Gemini-2.5-Pro 46.6 49.3 46.9 53.0 40.1 44.3 37.4 45.0
5 Gemini-2.0-Flash-Thinking 43.4 46.9 43.1 51.0 37.9 43.6 39.3 43.1
6 GPT-4o 50.0 49.6 38.8 30.0 44.0 39.2 37.0 42.0
7 Gemini-1.5-Pro 52.1 48.2 34.4 26.0 39.2 46.4 38.9 41.2
8 Claud 3.5 Sonnet 45.9 48.2 33.7 39.5 40.7 39.7 38.1 41.0
9 Claud 3.7 Sonnet 48.6 43.5 30.8 41.0 39.8 36.6 33.7 39.3
10 Qwen2.5-VL-32B 43.2 44.2 31.5 51.0 36.4 31.4 32.2 38.4
11 Video-R1 48.6 41.7 28.9 34.5 31.0 33.5 35.9 36.5
12 SpaceR 48.2 39.4 26.0 33.0 28.9 35.1 35.6 35.2
13 SEED-Bench-R1 42.8 35.1 25.6 40.5 29.2 29.9 32.6 33.5
14 VideoChat-R1 42.1 38.8 24.5 39.5 29.5 27.8 29.3 33.0
15 InternVL3-8B 29.5 40.7 37.9 35.1 24.6 38.9 24.1 32.3
16 Gemini-2.0-Flash 41.8 33.7 23.1 20.5 30.1 26.8 33.7 30.6
17 OpenAI o4-mini 36.3 31.2 20.5 34.0 30.1 30.9 27.4 29.9
18 Qwen2.5-VL-7B 38.4 34.8 17.6 30.0 27.1 18.6 25.2 27.8
19 Qwen2.5-Omni-7B 38.4 30.8 22.3 12.0 21.1 21.1 20.7 24.4
20 InternVL2.5-8B 27.8 32.1 21.2 7.6 25.4 23.6 22.4 23.6
21 Qwen2.5-Omni-7B 27.1 19.9 13.9 7.5 14.8 14.9 13.7 16.4

Construction and Evaluation Pipeline

We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).

Teaser Image

Question Types

Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.

Teaser Image

Examples

Examples of questions, explanations, model answers, and analyses of the reasoning process of Video-Holmes.

Citation

@article{cheng2025video,
  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
  journal={arXiv preprint arXiv:2505.21374},
  year={2025}
}