Hierarchical Fine-grained Preference Optimization
for Physically Plausible Video Generation

Harold Haodong Chen1 Haojian Huang2
Qifeng Chen1 Harry Yang1† Ser-Nam Lim3† 1HKUST  ·  2HKU  ·  3UCF  ·  Corresponding Author Primary Contact: haroldchen328@gmail.com

Abstract

Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.

Overview

PhysHPO pipeline overview

Evaluations on CogVideoX

A martial artist practicing slow and controlled Tai Chi movements.

CogVideoX

+ PhysHPO

A person opening a book and flipping through its pages in a library.

CogVideoX

+ PhysHPO

A runner sprinting through a forest trail during sunset.

CogVideoX

+ PhysHPO

A skateboarder performing a kickflip on a ramp in an urban skatepark.

CogVideoX

+ PhysHPO

A person swimming underwater in a clear blue pool.

CogVideoX

+ PhysHPO

A child flying a kite on a windy beach.

CogVideoX

+ PhysHPO

Honey diffusing into warm milk.

CogVideoX

+ PhysHPO

An apple falls into a vat of red wine.

CogVideoX

+ PhysHPO

Peeler peels an apple.

CogVideoX

+ PhysHPO

Yogurt merging with strawberry puree.

CogVideoX

+ PhysHPO

A butter knife spreads a layer of butter over bread.

CogVideoX

+ PhysHPO

Knife slices the tomato.

CogVideoX

+ PhysHPO

An ice cream scooper cuts through creamy vanilla ice cream.

CogVideoX

+ PhysHPO

A delicate egg is hurled with significant force towards a rugged rock surface, where it collides upon impact.

CogVideoX

+ PhysHPO

A bucket scoops up sea water at the beach.

CogVideoX

+ PhysHPO

A vibrant tennis ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.

CogVideoX

+ PhysHPO

A collection of prisms is arranged in a pattern with sunlight shining through them.

CogVideoX

+ PhysHPO

A magnifying glass gradually moves closer to a coin, revealing the intricate details of the embossed design.

CogVideoX

+ PhysHPO

A swimmer glides through the calm ocean waves.

CogVideoX

+ PhysHPO

An airplane zooms through a patch of fluffy clouds.

CogVideoX

+ PhysHPO

Comparison with Baselines

A vibrant, elastic beach ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

A black pen is used to write on the smooth, white surface of a notebook, showcasing the interaction between the pen and the notebook surface.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

A whisk mixes an egg in a bowl.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

Evaluations on HunyuanVideo

A bulldozer clears debris from a construction site, moving it into a dumpster.

HunyuanVideo

+ PhysHPO

Multiple candles of varying heights and widths are blown out simultaneously by a single breath, some flames extinguishing faster than others.

HunyuanVideo

+ PhysHPO

An apple submerging into water.

HunyuanVideo

+ PhysHPO

A volleyball is spiked, hitting the wooden floor of the court and producing a soundless bounce visible by its slight dip and subsequent rebound.

HunyuanVideo

+ PhysHPO

A group of friends dancing energetically at a party.

HunyuanVideo

+ PhysHPO

A person skiing down a snowy mountain slope with speed and control.

HunyuanVideo

+ PhysHPO

A person gracefully ice skating on a frozen lake.

HunyuanVideo

+ PhysHPO

A person doing push-ups in a gym with perfect form.

HunyuanVideo

+ PhysHPO

A boxer throwing punches and dodging during a training session.

HunyuanVideo

+ PhysHPO

A runner stretching their legs before starting a race.

HunyuanVideo

+ PhysHPO

BibTeX

@article{chen2025hierarchical,
  title   = {Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation},
  author  = {Chen, Harold Haodong and Huang, Haojian and Chen, Qifeng and Yang, Harry and Lim, Ser-Nam},
  journal = {arXiv preprint arXiv:2508.10858},
  year    = {2025}
}