EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Abstract
EO-Robotics, comprising EO-1 model and EO-Data1.5M dataset, advances multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training.
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
Community
EO: Open-source Unified Embodied Foundation Model Series
We introduce EO-1 model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The EO-1 model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:
- โก Unified Architecture: A single decoder-only transformer integrating text, image, video, and actions.
- ๐ EO-1.5M Dataset: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
- ๐ Interleaved Pretraining: Seamless synergy between language and action with autoregressive + flow matching.
- ๐ค Reasoning-Enhanced Generalization: Superior generalization capabilities with multimodal embodied reasoning and real robot control.
Links
- EO-Robotics Website: http://eo-robotics.ai/eo-1
- EO-Robotics Code: https://github.com/EO-Robotics/EO-1
- EO-Robotics Paper on arXiv: https://arxiv.org/abs/2508.21112
- EO-1 Model: https://huggingface.co/IPEC-COMMUNITY/EO-1-3B
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MolmoAct: Action Reasoning Models that can Reason in Space (2025)
- Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation (2025)
- Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey (2025)
- Vision Language Action Models in Robotic Manipulation: A Systematic Review (2025)
- Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos (2025)
- Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (2025)
- RynnEC: Bringing MLLMs into Embodied World (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper