R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

GitHub ModelScope arXiv

πŸ“– Introduction

R1-Omni is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:

  1. Enhanced Reasoning Capability: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
  2. Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
  3. Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.

πŸ† Performance

Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is in-distribution (⬀) or out-of-distribution (β–³).

Method DFEW (WAR) ⬀ DFEW (UAR) ⬀ MAFW (WAR) ⬀ MAFW (UAR) ⬀ RAVDESS (WAR) β–³ RAVDESS (UAR) β–³
HumanOmni-0.5B 22.64 19.44 20.18 13.52 7.33 9.38
EMER-SFT 38.66 35.31 38.39 28.02 29.00 27.19
MAFW-DFEW-SFT 60.23 44.39 50.44 30.39 29.33 30.75
R1-Omni 65.83 56.27 57.68 40.04 43.00 44.69

Legend

  • ⬀: Indicates in-distribution data (DFEW and MAFW).
  • β–³: Indicates out-of-distribution data (RAVDESS).

πŸ› οΈ Environment Setup

Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the R1-V repository

πŸ” Inference

Our inference code is based on the implementation from HumanOmni. To ensure the model runs inference smoothly, follow these steps:

  1. Download the Required Models:

  2. Update the Configuration File:

    • In the directory where you downloaded the R1-Omni model, locate the config.json file.
    • Update the paths on line 23 and line 31 to point to the local folders where you saved the models.

Example: Updating config.json

If you saved the models to the following local paths::

  • /path/to/local/models/siglip-base-patch16-224
  • /path/to/local/models/whisper-large-v3

Update the relevant lines in config.json as follows:

 "mm_audio_tower": "/path/to/local/models/whisper-large-v3",
 "mm_vision_tower": "/path/to/local/models/siglip-base-patch16-224"

We provide inference.py for singe video inference.

python inference.py --modal video_audio \
  --model_path ./R1-Omni-0.5B \
  --video_path video.mp4 \
  --instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?  Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."

🧠 Training

Cold Start

we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the Explainable Multimodal Emotion Reasoning dataset and 348 samples from HumanOmni dataset. An example json file of the training data:

[
    {
        "video": "MER24/sample_00000967.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
            },
            {
                "from": "gpt",
                "value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>\n<answer>anxious</answer>"
            }
        ]
    },
  ...
]

All of the cold-start data will be released as soon as possible.

RLVR

In this stage, we utilize the training sets from the MAFW and DFEW(https://dfew-dataset.github.io/) datasets, comprising a total of 15,306 video sample. An example json file of the training data:

[
    {
        "video": "DFEW/videos/1.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ],
    },
  ...
]

🀝 Related Work

πŸ“š Citation

If you find our work helpful, feel free to give us a cite.

{zhao2025r1omniexplainableomnimultimodalemotion,
      title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning}, 
      author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
      journal={arXiv preprint arXiv:2503.05379},
      year={2025}
}
Downloads last month
12
Safetensors
Model size
1.37B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.