Video-Holmes

Abstract

Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.

Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.

⭐ Key Aspects of Video-Holmes:

One-Click Evaluation: Videos, audios, questions, and evaluation codes are packaged on GitHub and Hugging Face.
High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.

We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field.

Teaser Image

LeaderBoard

Leaderboard of Video-Holmes, where SR means Social Reasoning; IMC means Intention and Motive Chaining; TCI means Temporal Causal Inference; TA Timeline Analysis; MHR means Multimodal Hint Reasoning; PAR means Physical Anomaly Reasoning; CTI means Core Theme Inference.

#	Model	Audio	SR	IMC	TCI	TA	MHR	PAR	CTI	Avg
1	Gemini-2.5-Pro 🥇	✅	54.8	54.3	53.8	56.0	48.8	46.4	44.8	51.3
2	Gemini-2.0-Flash-Thinking 🥈	✅	56.5	54.2	43.4	44.5	43.9	55.1	50.1	49.5
3	Gemini-1.5-Pro 🥉	✅	59.6	54.7	37.4	33.5	40.4	47.4	44.4	45.7
4	Gemini-2.5-Pro	❌	46.6	49.3	46.9	53.0	40.1	44.3	37.4	45.0
5	Gemini-2.0-Flash-Thinking	❌	43.4	46.9	43.1	51.0	37.9	43.6	39.3	43.1
6	GPT-4o	❌	50.0	49.6	38.8	30.0	44.0	39.2	37.0	42.0
7	Gemini-1.5-Pro	❌	52.1	48.2	34.4	26.0	39.2	46.4	38.9	41.2
8	Claud 3.5 Sonnet	❌	45.9	48.2	33.7	39.5	40.7	39.7	38.1	41.0
9	Claud 3.7 Sonnet	❌	48.6	43.5	30.8	41.0	39.8	36.6	33.7	39.3
10	Qwen2.5-VL-32B	❌	43.2	44.2	31.5	51.0	36.4	31.4	32.2	38.4
11	Video-R1	❌	48.6	41.7	28.9	34.5	31.0	33.5	35.9	36.5
12	SpaceR	❌	48.2	39.4	26.0	33.0	28.9	35.1	35.6	35.2
13	SEED-Bench-R1	❌	42.8	35.1	25.6	40.5	29.2	29.9	32.6	33.5
14	VideoChat-R1	❌	42.1	38.8	24.5	39.5	29.5	27.8	29.3	33.0
15	InternVL3-8B	❌	29.5	40.7	37.9	35.1	24.6	38.9	24.1	32.3
16	Gemini-2.0-Flash	❌	41.8	33.7	23.1	20.5	30.1	26.8	33.7	30.6
17	OpenAI o4-mini	❌	36.3	31.2	20.5	34.0	30.1	30.9	27.4	29.9
18	Qwen2.5-VL-7B	❌	38.4	34.8	17.6	30.0	27.1	18.6	25.2	27.8
19	Qwen2.5-Omni-7B	✅	38.4	30.8	22.3	12.0	21.1	21.1	20.7	24.4
20	InternVL2.5-8B	❌	27.8	32.1	21.2	7.6	25.4	23.6	22.4	23.6
21	Qwen2.5-Omni-7B	❌	27.1	19.9	13.9	7.5	14.8	14.9	13.7	16.4

Construction and Evaluation Pipeline

We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).

Question Types

Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.

Examples

Examples of questions, explanations, model answers, and analyses of the reasoning process of Video-Holmes.

Citation

@article{cheng2025video,
  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
  journal={arXiv preprint arXiv:2505.21374},
  year={2025}
}