metadata

license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: video-text-to-text

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

This model utilizes Reinforcement Learning with Verifiable Reward (RLVR) to perform omni-multimodal emotion recognition. Built upon the HumanOmni-0.5B model, R1-Omni excels at understanding visual and audio cues for emotion identification, even in out-of-distribution scenarios.

📖 Introduction

R1-Omni is the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. It focuses on emotion recognition, where visual and audio modalities play crucial roles. Key insights include:

Enhanced Reasoning Capability: R1-Omni demonstrates superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.

📦 Model Download

The model is based on the open-source HumanOmni-0.5B model. The following models are available: HumanOmni-0.5B, the cold-start model EMER-SFT, the model MAFW-DFEW-SFT fine-tuned directly on the MAFW and DFEW training sets, and the final model R1-Omni.

Model	HuggingFace	ModelScope
`HumanOmni-0.5B`
`EMER-SFT`
`MAFW-DFEW-SFT`
`R1-Omni`

🏆 Performance

Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is in-distribution (⬤) or out-of-distribution (△).

Method	DFEW (WAR) ⬤	DFEW (UAR) ⬤	MAFW (WAR) ⬤	MAFW (UAR) ⬤	RAVDESS (WAR) △	RAVDESS (UAR) △
HumanOmni-0.5B	22.64	19.44	20.18	13.52	7.33	9.38
EMER-SFT	38.66	35.31	38.39	28.02	29.00	27.19
MAFW-DFEW-SFT	60.23	44.39	50.44	30.39	29.33	30.75
R1-Omni	65.83	56.27	57.68	40.04	43.00	44.69

Legend

⬤: Indicates in-distribution data (DFEW and MAFW).
△: Indicates out-of-distribution data (RAVDESS).

🛠️ Environment Setup

Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the R1-V repository

🔍 Inference

Our inference code is based on the implementation from HumanOmni.

📚 Citation

If you find our work helpful, feel free to cite us.

{zhao2025r1omniexplainableomnimultimodalemotion,
      title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning}, 
      author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
      journal={arXiv preprint arXiv:2503.05379},
      year={2025}
}