HumanOmni-0.5B / README.md
nielsr's picture
nielsr HF staff
Add model card and metadata for R1-Omni-0.5B
43d55ed verified
|
raw
history blame
5.13 kB
metadata
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: video-text-to-text

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

ModelScope Hugging Face arXiv

This model utilizes Reinforcement Learning with Verifiable Reward (RLVR) to perform omni-multimodal emotion recognition. Built upon the HumanOmni-0.5B model, R1-Omni excels at understanding visual and audio cues for emotion identification, even in out-of-distribution scenarios.

πŸ“– Introduction

R1-Omni is the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. It focuses on emotion recognition, where visual and audio modalities play crucial roles. Key insights include:

  1. Enhanced Reasoning Capability: R1-Omni demonstrates superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
  2. Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
  3. Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.

πŸ“¦ Model Download

The model is based on the open-source HumanOmni-0.5B model. The following models are available: HumanOmni-0.5B, the cold-start model EMER-SFT, the model MAFW-DFEW-SFT fine-tuned directly on the MAFW and DFEW training sets, and the final model R1-Omni.

Model HuggingFace ModelScope
HumanOmni-0.5B HF MS
EMER-SFT HF MS
MAFW-DFEW-SFT HF MS
R1-Omni HF MS

πŸ† Performance

Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is in-distribution (⬀) or out-of-distribution (β–³).

Method DFEW (WAR) ⬀ DFEW (UAR) ⬀ MAFW (WAR) ⬀ MAFW (UAR) ⬀ RAVDESS (WAR) β–³ RAVDESS (UAR) β–³
HumanOmni-0.5B 22.64 19.44 20.18 13.52 7.33 9.38
EMER-SFT 38.66 35.31 38.39 28.02 29.00 27.19
MAFW-DFEW-SFT 60.23 44.39 50.44 30.39 29.33 30.75
R1-Omni 65.83 56.27 57.68 40.04 43.00 44.69

image

Legend

  • ⬀: Indicates in-distribution data (DFEW and MAFW).
  • β–³: Indicates out-of-distribution data (RAVDESS).

πŸ› οΈ Environment Setup

Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the R1-V repository

πŸ” Inference

Our inference code is based on the implementation from HumanOmni.

πŸ“š Citation

If you find our work helpful, feel free to cite us.

{zhao2025r1omniexplainableomnimultimodalemotion,
      title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning}, 
      author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
      journal={arXiv preprint arXiv:2503.05379},
      year={2025}
}