HumanOmni-0.5B / README.md

Add model card and metadata for R1-Omni-0.5B

43d55ed verified 2 days ago

5.13 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	pipeline_tag: video-text-to-text
	---

	# R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

	[![ModelScope](https://img.shields.io/badge/ModelScope-R1Omni-blue)](https://modelscope.cn/models/iic/R1-Omni-0.5B)
	[![Hugging Face](https://img.shields.io/badge/HuggingFace-R1Omni-yellow)](https://huggingface.co/StarJiaxing/R1-Omni-0.5B)
	[![arXiv](https://img.shields.io/badge/arXiv-2503.05379-red)](https://arxiv.org/abs/2503.05379)

	This model utilizes Reinforcement Learning with Verifiable Reward (RLVR) to perform omni-multimodal emotion recognition. Built upon the HumanOmni-0.5B model, R1-Omni excels at understanding visual and audio cues for emotion identification, even in out-of-distribution scenarios.

	## 📖 Introduction
	R1-Omni is the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. It focuses on emotion recognition, where visual and audio modalities play crucial roles. Key insights include:

	1) Enhanced Reasoning Capability: R1-Omni demonstrates superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
	2) Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
	3) Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.

	## 📦 Model Download
	The model is based on the open-source HumanOmni-0.5B model. The following models are available: HumanOmni-0.5B, the cold-start model EMER-SFT, the model MAFW-DFEW-SFT fine-tuned directly on the MAFW and DFEW training sets, and the final model R1-Omni.

	<div align="center">

	\| Model \| HuggingFace \| ModelScope \|
	\|------------------------\|---------------------------------------------------------------------------------\|-------------------------------------------------------------------------\|
	\| `HumanOmni-0.5B` \| [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://hf.co/StarJiaxing/HumanOmni-0.5B) \| [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/HumanOmni-0.5B) \|
	\| `EMER-SFT` \| [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://hf.co/StarJiaxing/EMER-SFT-0.5B) \| [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/EMER-SFT-0.5B) \|
	\| `MAFW-DFEW-SFT` \| [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://huggingface.co/StarJiaxing/MAFW-DFEW-0.5B) \| [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/MAFW-DFEW-0.5B) \|
	\| `R1-Omni` \| [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://huggingface.co/StarJiaxing/R1-Omni-0.5B) \| [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/R1-Omni-0.5B) \|
	</div>

	## 🏆 Performance

	Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is in-distribution (⬤) or out-of-distribution (△).

	\| Method \| DFEW (WAR) ⬤ \| DFEW (UAR) ⬤ \| MAFW (WAR) ⬤ \| MAFW (UAR) ⬤ \| RAVDESS (WAR) △ \| RAVDESS (UAR) △ \|
	\|----------------------------------\|---------------\|---------------\|---------------\|---------------\|------------------\|------------------\|
	\| HumanOmni-0.5B \| 22.64 \| 19.44 \| 20.18 \| 13.52 \| 7.33 \| 9.38 \|
	\| EMER-SFT \| 38.66 \| 35.31 \| 38.39 \| 28.02 \| 29.00 \| 27.19 \|
	\| MAFW-DFEW-SFT \| 60.23 \| 44.39 \| 50.44 \| 30.39 \| 29.33 \| 30.75 \|
	\| R1-Omni \| 65.83 \| 56.27 \| 57.68 \| 40.04 \| 43.00 \| 44.69 \|

	![image](https://github.com/user-attachments/assets/f0239753-8a70-4e8b-9088-35c420595978)

	### Legend
	- ⬤: Indicates in-distribution data (DFEW and MAFW).
	- △: Indicates out-of-distribution data (RAVDESS).

	## 🛠️ Environment Setup
	Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the [R1-V repository](https://github.com/Deep-Agent/R1-V/)

	## 🔍 Inference
	Our inference code is based on the implementation from HumanOmni.

	## 📚 Citation
	If you find our work helpful, feel free to cite us.
	```
	{zhao2025r1omniexplainableomnimultimodalemotion,
	title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning},
	author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
	journal={arXiv preprint arXiv:2503.05379},
	year={2025}
	}
	```