Alibaba-DAMO-Academy/RynnEC-7B

RynnEC: Bringing MLLMs into Embodied World

If our project helps you, please give us a star ⭐ on Github to support us. 🙏🙏

📰 News

[2025.08.17] 🤗 RynnEC-7B model checkpoint has been released in Huggingface.
[2025.08.08] 🔥🔥 Release our RynnEC-2B model, RynnEC-Bench and training code.

🌟 Introduction

RynnEC is a video multi-modal large language model (MLLM) specifically designed for embodied cognition tasks.

📐Architecture

RynnEC can handle a variety of input types, including images, videos, visual prompts, and task instructions. Visual inputs are processed using a Vision Encoder equipped with an any-resolution strategy, while visual prompts are handled by a region encoder to extract fine-grained features. Textual inputs are seamlessly converted into a unified token stream through tokenization. For video segmentation tasks, a mask decoder is employed to transform the output segmentation embeddings into binary masks, ensuring precise and effective results.

🌎 Model Zoo

Model	Base Model	HF Link
RynnEC-2B	Qwen2.5-1.5B-Instruct	Alibaba-DAMO-Academy/RynnEC-2B
RynnEC-7B	Qwen2.5-7B-Instruct	Alibaba-DAMO-Academy/RynnEC-7B

📊 Main Results

Benchmark comparison across object cognition and spatial cognition. With a highly efficient 2B-parameter architecture, RynnEC-2B achieves state-of-the-art (SOTA) performance on complex spatial cognition tasks.

📑 Citation

If you find RynnEC useful for your research and applications, please cite using this BibTeX:

@misc{dang2025rynnecbringingmllmsembodied,
      title={RynnEC: Bringing MLLMs into Embodied World}, 
      author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao},
      year={2025},
      eprint={2508.14160},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.14160}, 
}

Alibaba-DAMO-Academy
/

RynnEC-7B