Abstract
Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.
Overview
Evaluations on CogVideoX
A martial artist practicing slow and controlled Tai Chi movements.
A person opening a book and flipping through its pages in a library.
A runner sprinting through a forest trail during sunset.
A skateboarder performing a kickflip on a ramp in an urban skatepark.
A person swimming underwater in a clear blue pool.
A child flying a kite on a windy beach.
Honey diffusing into warm milk.
An apple falls into a vat of red wine.
Peeler peels an apple.
Yogurt merging with strawberry puree.
A butter knife spreads a layer of butter over bread.
Knife slices the tomato.
An ice cream scooper cuts through creamy vanilla ice cream.
A delicate, fragile egg is hurled with significant force towards a rugged, solid rock surface, where it collides upon impact.
A bucket scoops up sea water at the beach.
A vibrant, elastic tennis ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.
A collection of prisms is arranged in a pattern with sunlight shining through them.
A magnifying glass is gradually moving closer to a coin, revealing the intricate details and textures of the embossed design as it approaches.
A swimmer glides through the calm ocean waves.
An airplane zooms through a patch of fluffy clouds.
A vibrant, elastic beach ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.
A black pen is used to write on the smooth, white surface of a notebook, showcasing the interaction between the pen and the notebook surface.
Evaluations on HunyuanVideo
A bulldozer clears debris from a construction site, moving it into a dumpster.
Multiple candles of varying heights and widths are blown out simultaneously by a single breath, some flames extinguishing faster than others.
An apple submerging into a water.
A volleyball is spiked, hitting the wooden floor of the court and producing a soundless bounce visible by its slight dip and subsequent rebound.
A group of friends dancing energetically at a party.
A person skiing down a snowy mountain slope with speed and control.
A person gracefully ice skating on a frozen lake.
A person doing push-ups in a gym with perfect form.
BibTeX
@article{chen2025hierarchical,
title={Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation},
author={Chen, Harold Haodong and Huang, Haojian and Chen, Qifeng and Yang, Harry and Lim, Ser-Nam},
journal={arXiv preprint arXiv:2508.10858},
year={2025}
}
Project page template is borrowed from DreamBooth.