Abstract
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical manner benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
Overview
Overview of our proposed (Left) TiViBench benchmark and (Right) VideoTPO framework
Overview of TiViBench's statistical distributions. (Left) Word distribution of prompt suites; (Middle) Data distribution across 24 tasks; and (Right) Data distribution across 3 difficulty levels.
Evaluation Results
Pass@1 performance overview on TiViBench of 3 commercial models and 4 open-source models.
Detailed pass@1 performance of both open-source and commercial models on TiViBench.
Detailed pass@5 performance of open-source models on TiViBench.
Evaluation on TiViBench with VideoTPO.
Failure Case Demonstration
(Top) Performance of the best-performing models, i.e., Sora 2 and Veo 3.1, on TiViBench across 24 tasks; (Bottom) Case study of the lowest-performing tasks, i.e. maze solving (MS), temporal ordering (TO), odd-one-out (Odd), and sudoku completion (SC).
Case Demonstration: Structural Reasoning & Search
Case Demonstration: Spatial & Visual Pattern Reasoning
Case Demonstration: Symbolic & Logical Reasoning
Case Demonstration: Action Planning & Task Execution
BibTeX
@article{chen2025tivibench,
title={TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models},
author={Chen, Harold Haodong and Lan, Disen and Shu, Wen-Jie and Liu, Qingyang and Wang, Zihan and Chen, Sirui and Cheng, Wenkai and Chen, Kanghao and Zhang, Hongfei and Zhang, Zixin and Guo, Rongjin and Cheng, Yu and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2511.13704},
year={2025}
}
Project page template is borrowed from DreamBooth.