File size: 6,809 Bytes
788dcd7 658b18b 788dcd7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: apache-2.0
---
# R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
[](https://github.com/Jiaxing-star/R1-Omni)
[](https://modelscope.cn/models/iic/R1-Omni-0.5B)
[](https://arxiv.org/abs/2503.05379)
## 📖 Introduction
**R1-Omni** is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:
1) **Enhanced Reasoning Capability**: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
2) **Improved Understanding Capability**: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
3) **Stronger Generalization Capability**: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.
## 🏆 Performance
Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is **in-distribution (⬤)** or **out-of-distribution (△)**.
| Method | DFEW (WAR) ⬤ | DFEW (UAR) ⬤ | MAFW (WAR) ⬤ | MAFW (UAR) ⬤ | RAVDESS (WAR) △ | RAVDESS (UAR) △ |
|----------------------------------|---------------|---------------|---------------|---------------|------------------|------------------|
| HumanOmni-0.5B | 22.64 | 19.44 | 20.18 | 13.52 | 7.33 | 9.38 |
| EMER-SFT | 38.66 | 35.31 | 38.39 | 28.02 | 29.00 | 27.19 |
| MAFW-DFEW-SFT | 60.23 | 44.39 | 50.44 | 30.39 | 29.33 | 30.75 |
| R1-Omni | 65.83 | 56.27 | 57.68 | 40.04 | 43.00 | 44.69 |
### Legend
- **⬤**: Indicates **in-distribution data** (DFEW and MAFW).
- **△**: Indicates **out-of-distribution data** (RAVDESS).
## 🛠️ Environment Setup
Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the [R1-V repository](https://github.com/Deep-Agent/R1-V/)
## 🔍 Inference
Our inference code is based on the implementation from **HumanOmni**. To ensure the model runs inference smoothly, follow these steps:
1. **Download the Required Models**:
- [siglip-224](https://huggingface.co/google/siglip-base-patch16-224).
- [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3).
2. **Update the Configuration File**:
- In the directory where you downloaded the R1-Omni model, locate the config.json file.
- Update the paths on line 23 and line 31 to point to the local folders where you saved the models.
#### Example: Updating config.json
If you saved the models to the following local paths::
- `/path/to/local/models/siglip-base-patch16-224`
- `/path/to/local/models/whisper-large-v3`
Update the relevant lines in config.json as follows:
```json
"mm_audio_tower": "/path/to/local/models/whisper-large-v3",
"mm_vision_tower": "/path/to/local/models/siglip-base-patch16-224"
```
We provide inference.py for singe video inference.
```
python inference.py --modal video_audio \
--model_path ./R1-Omni-0.5B \
--video_path video.mp4 \
--instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
```
## 🧠 Training
### Cold Start
we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the [Explainable Multimodal Emotion Reasoning](https://github.com/zeroQiaoba/AffectGPT) dataset and 348 samples from HumanOmni dataset.
An example json file of the training data:
```
[
{
"video": "MER24/sample_00000967.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
},
{
"from": "gpt",
"value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>\n<answer>anxious</answer>"
}
]
},
...
]
```
All of the cold-start data will be released as soon as possible.
### RLVR
In this stage, we utilize the training sets from the [MAFW](https://mafw-database.github.io/MAFW/) and DFEW(https://dfew-dataset.github.io/) datasets, comprising a total of 15,306 video sample.
An example json file of the training data:
```
[
{
"video": "DFEW/videos/1.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
},
{
"from": "gpt",
"value": "sad"
}
],
},
...
]
```
## 🤝 Related Work
- [R1-V](https://github.com/Deep-Agent/R1-V)
- [HumanOmni](https://github.com/HumanMLLM/HumanOmni)
- [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
## 📚 Citation
If you find our work helpful, feel free to give us a cite.
```
{zhao2025r1omniexplainableomnimultimodalemotion,
title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning},
author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
journal={arXiv preprint arXiv:2503.05379},
year={2025}
}
``` |