TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen^1,2* Disen Lan^3* Wen-Jie Shu^2* Qingyang Liu⁴ Zihan Wang¹
Sirui Chen¹ Wenkai Cheng¹ Kanghao Chen^1,2 Hongfei Zhang¹ Zixin Zhang^1,2 Rongjin Guo⁵
Yu Cheng⁶^† Ying-Cong Chen^1,2^†
¹HKUST(GZ) ²HKUST ³FDU ⁴SJTU ⁵CityUHK ⁶CUHK
^*Equal Contribution ^†Corresponding Author

[Paper] [Github]

Abstract

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical manner benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

Overview

Overview of our proposed (Left) TiViBench benchmark and (Right) VideoTPO framework

Overview of TiViBench's statistical distributions. (Left) Word distribution of prompt suites; (Middle) Data distribution across 24 tasks; and (Right) Data distribution across 3 difficulty levels.

Evaluation Results

Pass@1 performance overview on TiViBench of 3 commercial models and 4 open-source models.

Detailed pass@1 performance of both open-source and commercial models on TiViBench.

Detailed pass@5 performance of open-source models on TiViBench.

Evaluation on TiViBench with VideoTPO.

Failure Case Demonstration

(Top) Performance of the best-performing models, i.e., Sora 2 and Veo 3.1, on TiViBench across 24 tasks; (Bottom) Case study of the lowest-performing tasks, i.e. maze solving (MS), temporal ordering (TO), odd-one-out (Odd), and sudoku completion (SC).

Case Demonstration: Structural Reasoning & Search

Graph Traversal

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Maze Solving

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Sorting Numbers

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Temporal Ordering

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Rule Extrapolation

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Game Move

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Case Demonstration: Spatial & Visual Pattern Reasoning

Shape Fitting

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Connecting Colors

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Pattern Recognition

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Odd-one-out

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Counting Objects

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Visual Analogy

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Case Demonstration: Symbolic & Logical Reasoning

Sudoku Completion

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Symbolic Reasoning

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Arithmetic

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Visual Deduction

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Transitive Reasoning

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Game Rule

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Case Demonstration: Action Planning & Task Execution

Tool Use

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Robot Navigation

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Goal-directed Planning

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Multi-step Manipulation

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Instruction Following

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

Game Strategy

Sora 2

Veo 3.1

Kling 2.1

Wan2.2

Wan2.1

HunyuanVideo

CogVideoX1.5

BibTeX


  @article{chen2025tivibench,

      title={TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models},

      author={Chen, Harold Haodong and Lan, Disen and Shu, Wen-Jie and Liu, Qingyang and Wang, Zihan and Chen, Sirui and Cheng, Wenkai and Chen, Kanghao and Zhang, Hongfei and Zhang, Zixin and Guo, Rongjin and Cheng, Yu and Chen, Ying-Cong},

      journal={arXiv preprint arXiv:2511.13704},

      year={2025}

  }

Project page template is borrowed from DreamBooth.