Precise Action-to-Video Generation Through Visual Action Prompts
Abstract
Visual action prompts, using visual skeletons, enable precise action control in video generation while maintaining cross-domain transferability.
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos (2025)
- EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos (2025)
- X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents (2025)
- GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning (2025)
- GenHSI: Controllable Generation of Human-Scene Interaction Videos (2025)
- EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow (2025)
- Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper