Leaderboard of Video-Holmes, where
SR means Social Reasoning;
IMC means Intention and Motive Chaining;
TCI means Temporal Causal Inference;
TA Timeline Analysis;
MHR means Multimodal Hint Reasoning;
PAR means Physical Anomaly Reasoning;
CTI means Core Theme Inference.
# |
Model |
Audio |
SR |
IMC |
TCI |
TA |
MHR |
PAR |
CTI |
Avg |
1 |
Gemini-2.5-Pro 🥇 |
✅ |
54.8 |
54.3 |
53.8 |
56.0 |
48.8 |
46.4 |
44.8 |
51.3 |
2 |
Gemini-2.0-Flash-Thinking 🥈 |
✅ |
56.5 |
54.2 |
43.4 |
44.5 |
43.9 |
55.1 |
50.1 |
49.5 |
3 |
Gemini-1.5-Pro 🥉 |
✅ |
59.6 |
54.7 |
37.4 |
33.5 |
40.4 |
47.4 |
44.4 |
45.7 |
4 |
Gemini-2.5-Pro |
❌ |
46.6 |
49.3 |
46.9 |
53.0 |
40.1 |
44.3 |
37.4 |
45.0 |
5 |
Gemini-2.0-Flash-Thinking |
❌ |
43.4 |
46.9 |
43.1 |
51.0 |
37.9 |
43.6 |
39.3 |
43.1 |
6 |
GPT-4o |
❌ |
50.0 |
49.6 |
38.8 |
30.0 |
44.0 |
39.2 |
37.0 |
42.0 |
7 |
Gemini-1.5-Pro |
❌ |
52.1 |
48.2 |
34.4 |
26.0 |
39.2 |
46.4 |
38.9 |
41.2 |
8 |
Claud 3.5 Sonnet |
❌ |
45.9 |
48.2 |
33.7 |
39.5 |
40.7 |
39.7 |
38.1 |
41.0 |
9 |
Claud 3.7 Sonnet |
❌ |
48.6 |
43.5 |
30.8 |
41.0 |
39.8 |
36.6 |
33.7 |
39.3 |
10 |
Qwen2.5-VL-32B |
❌ |
43.2 |
44.2 |
31.5 |
51.0 |
36.4 |
31.4 |
32.2 |
38.4 |
11 |
Video-R1 |
❌ |
48.6 |
41.7 |
28.9 |
34.5 |
31.0 |
33.5 |
35.9 |
36.5 |
12 |
SpaceR |
❌ |
48.2 |
39.4 |
26.0 |
33.0 |
28.9 |
35.1 |
35.6 |
35.2 |
13 |
SEED-Bench-R1 |
❌ |
42.8 |
35.1 |
25.6 |
40.5 |
29.2 |
29.9 |
32.6 |
33.5 |
14 |
VideoChat-R1 |
❌ |
42.1 |
38.8 |
24.5 |
39.5 |
29.5 |
27.8 |
29.3 |
33.0 |
15 |
InternVL3-8B |
❌ |
29.5 |
40.7 |
37.9 |
35.1 |
24.6 |
38.9 |
24.1 |
32.3 |
16 |
Gemini-2.0-Flash |
❌ |
41.8 |
33.7 |
23.1 |
20.5 |
30.1 |
26.8 |
33.7 |
30.6 |
17 |
OpenAI o4-mini |
❌ |
36.3 |
31.2 |
20.5 |
34.0 |
30.1 |
30.9 |
27.4 |
29.9 |
18 |
Qwen2.5-VL-7B |
❌ |
38.4 |
34.8 |
17.6 |
30.0 |
27.1 |
18.6 |
25.2 |
27.8 |
19 |
Qwen2.5-Omni-7B |
✅ |
38.4 |
30.8 |
22.3 |
12.0 |
21.1 |
21.1 |
20.7 |
24.4 |
20 |
InternVL2.5-8B |
❌ |
27.8 |
32.1 |
21.2 |
7.6 |
25.4 |
23.6 |
22.4 |
23.6 |
21 |
Qwen2.5-Omni-7B |
❌ |
27.1 |
19.9 |
13.9 |
7.5 |
14.8 |
14.9 |
13.7 |
16.4 |