Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Abstract
LPWM autonomously discovers object representations from video data and models stochastic particle dynamics for decision-making applications.
We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
Community
Latent Particle World Model (LPWM) - a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Factored Latent Action World Models (2026)
- UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models (2026)
- BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks (2026)
- VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model (2026)
- Video Generation Models in Robotics - Applications, Research Challenges, Future Directions (2026)
- Chain of World: World Model Thinking in Latent Motion (2026)
- Causal World Modeling for Robot Control (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 7
Browse 7 datasets citing this paperSpaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper