File size: 6,809 Bytes
788dcd7
 
 
658b18b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
788dcd7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
---
# R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

[![GitHub](https://img.shields.io/badge/GitHub-R1Omni-green)](https://github.com/Jiaxing-star/R1-Omni)
[![ModelScope](https://img.shields.io/badge/ModelScope-R1Omni-blue)](https://modelscope.cn/models/iic/R1-Omni-0.5B)
[![arXiv](https://img.shields.io/badge/arXiv-2503.05379-red)](https://arxiv.org/abs/2503.05379)


## 📖 Introduction
**R1-Omni** is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:
1) **Enhanced Reasoning Capability**: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
2) **Improved Understanding Capability**: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
3) **Stronger Generalization Capability**: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.


## 🏆 Performance

Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is **in-distribution (⬤)** or **out-of-distribution (△)**.

| Method                          | DFEW (WAR) ⬤ | DFEW (UAR) ⬤ | MAFW (WAR) ⬤ | MAFW (UAR) ⬤ | RAVDESS (WAR) △ | RAVDESS (UAR) △ |
|----------------------------------|---------------|---------------|---------------|---------------|------------------|------------------|
| HumanOmni-0.5B                   | 22.64         | 19.44         | 20.18         | 13.52         | 7.33             | 9.38             |
| EMER-SFT                         | 38.66       | 35.31         | 38.39         | 28.02         | 29.00            | 27.19            |
| MAFW-DFEW-SFT                    | 60.23         | 44.39         | 50.44         | 30.39         | 29.33            | 30.75            |
| R1-Omni                          | 65.83         | 56.27         | 57.68         | 40.04         | 43.00            | 44.69            |


### Legend
- **⬤**: Indicates **in-distribution data** (DFEW and MAFW).
- **△**: Indicates **out-of-distribution data** (RAVDESS).


## 🛠️ Environment Setup
Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the [R1-V repository](https://github.com/Deep-Agent/R1-V/)

## 🔍 Inference
Our inference code is based on the implementation from **HumanOmni**. To ensure the model runs inference smoothly, follow these steps:

1. **Download the Required Models**   - [siglip-224](https://huggingface.co/google/siglip-base-patch16-224).
   - [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3).

2. **Update the Configuration File**   - In the directory where you downloaded the R1-Omni model, locate the config.json file.
   - Update the paths on line 23 and line 31 to point to the local folders where you saved the models.


#### Example: Updating config.json
If you saved the models to the following local paths::
- `/path/to/local/models/siglip-base-patch16-224`
- `/path/to/local/models/whisper-large-v3`

Update the relevant lines in config.json as follows:
```json
 "mm_audio_tower": "/path/to/local/models/whisper-large-v3",
 "mm_vision_tower": "/path/to/local/models/siglip-base-patch16-224"
```

We provide inference.py for singe video inference. 
```
python inference.py --modal video_audio \
  --model_path ./R1-Omni-0.5B \
  --video_path video.mp4 \
  --instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?  Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
```



## 🧠 Training
### Cold Start
we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the [Explainable Multimodal Emotion Reasoning](https://github.com/zeroQiaoba/AffectGPT) dataset and 348 samples from HumanOmni dataset.
An example json file of the training data:
```
[
    {
        "video": "MER24/sample_00000967.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
            },
            {
                "from": "gpt",
                "value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>\n<answer>anxious</answer>"
            }
        ]
    },
  ...
]
```
All of the cold-start data will be released as soon as possible.

### RLVR
In this stage, we utilize the training sets from the [MAFW](https://mafw-database.github.io/MAFW/) and DFEW(https://dfew-dataset.github.io/) datasets, comprising a total of 15,306 video sample. 
An example json file of the training data:
```
[
    {
        "video": "DFEW/videos/1.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ],
    },
  ...
]
```


## 🤝 Related Work
- [R1-V](https://github.com/Deep-Agent/R1-V)
- [HumanOmni](https://github.com/HumanMLLM/HumanOmni)
- [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)

## 📚 Citation
If you find our work helpful, feel free to give us a cite.
```
{zhao2025r1omniexplainableomnimultimodalemotion,
      title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning}, 
      author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
      journal={arXiv preprint arXiv:2503.05379},
      year={2025}
}
```