Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Harold Haodong Chen1 Haojian Huang2
Qifeng Chen1 Harry Yang1 Ser-Nam Lim3
1HKUST 2HKU 3UCF Corresponding Author
Primary Contact: haroldchen328@gmail.com

[Paper]    


Abstract

Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.

Overview

Evaluations on CogVideoX

A martial artist practicing slow and controlled Tai Chi movements.

CogVideoX

+ PhysHPO

A person opening a book and flipping through its pages in a library.

CogVideoX

+ PhysHPO

A runner sprinting through a forest trail during sunset.

CogVideoX

+ PhysHPO

A skateboarder performing a kickflip on a ramp in an urban skatepark.

CogVideoX

+ PhysHPO

A person swimming underwater in a clear blue pool.

CogVideoX

+ PhysHPO

A child flying a kite on a windy beach.

CogVideoX

+ PhysHPO

Honey diffusing into warm milk.

CogVideoX

+ PhysHPO

An apple falls into a vat of red wine.

CogVideoX

+ PhysHPO

Peeler peels an apple.

CogVideoX

+ PhysHPO

Yogurt merging with strawberry puree.

CogVideoX

+ PhysHPO

A butter knife spreads a layer of butter over bread.

CogVideoX

+ PhysHPO

Knife slices the tomato.

CogVideoX

+ PhysHPO

An ice cream scooper cuts through creamy vanilla ice cream.

CogVideoX

+ PhysHPO

A delicate, fragile egg is hurled with significant force towards a rugged, solid rock surface, where it collides upon impact.

CogVideoX

+ PhysHPO

A bucket scoops up sea water at the beach.

CogVideoX

+ PhysHPO

A vibrant, elastic tennis ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.

CogVideoX

+ PhysHPO

A collection of prisms is arranged in a pattern with sunlight shining through them.

CogVideoX

+ PhysHPO

A magnifying glass is gradually moving closer to a coin, revealing the intricate details and textures of the embossed design as it approaches.

CogVideoX

+ PhysHPO

A swimmer glides through the calm ocean waves.

CogVideoX

+ PhysHPO

An airplane zooms through a patch of fluffy clouds.

CogVideoX

+ PhysHPO

A vibrant, elastic beach ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

A black pen is used to write on the smooth, white surface of a notebook, showcasing the interaction between the pen and the notebook surface.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

A whisk mixes an egg in a bowl.

CogVideoX

+ PhyT2V

+ Vanilla DPO

+ PhysHPO (Ours)

Evaluations on HunyuanVideo

A bulldozer clears debris from a construction site, moving it into a dumpster.

HunyuanVideo

+ PhysHPO

Multiple candles of varying heights and widths are blown out simultaneously by a single breath, some flames extinguishing faster than others.

HunyuanVideo

+ PhysHPO

An apple submerging into a water.

HunyuanVideo

+ PhysHPO

A volleyball is spiked, hitting the wooden floor of the court and producing a soundless bounce visible by its slight dip and subsequent rebound.

HunyuanVideo

+ PhysHPO

A group of friends dancing energetically at a party.

HunyuanVideo

+ PhysHPO

A person skiing down a snowy mountain slope with speed and control.

HunyuanVideo

+ PhysHPO

A person gracefully ice skating on a frozen lake.

HunyuanVideo

+ PhysHPO

A person doing push-ups in a gym with perfect form.

HunyuanVideo

+ PhysHPO

A boxer throwing punches and dodging during a training session.

HunyuanVideo

+ PhysHPO

A runner stretching their legs before starting a race.

HunyuanVideo

+ PhysHPO

BibTeX

@article{chen2025hierarchical,
    title={Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation},
    author={Chen, Harold Haodong and Huang, Haojian and Chen, Qifeng and Yang, Harry and Lim, Ser-Nam},
    journal={arXiv preprint arXiv:2508.10858},
    year={2025}
}

Project page template is borrowed from DreamBooth.