Add pipeline tag and library name

#1
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +13 -10
README.md CHANGED
@@ -1,20 +1,21 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  # R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
5
 
6
  [![GitHub](https://img.shields.io/badge/GitHub-R1Omni-green)](https://github.com/Jiaxing-star/R1-Omni)
7
  [![ModelScope](https://img.shields.io/badge/ModelScope-R1Omni-blue)](https://modelscope.cn/models/iic/R1-Omni-0.5B)
8
  [![arXiv](https://img.shields.io/badge/arXiv-2503.05379-red)](https://arxiv.org/abs/2503.05379)
9
 
10
-
11
  ## 📖 Introduction
12
  **R1-Omni** is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:
13
  1) **Enhanced Reasoning Capability**: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
14
  2) **Improved Understanding Capability**: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
15
  3) **Stronger Generalization Capability**: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.
16
 
17
-
18
  ## 🏆 Performance
19
 
20
  Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is **in-distribution (⬤)** or **out-of-distribution (△)**.
@@ -26,12 +27,10 @@ Below are the performance on emotion recognition datasets. We use symbols to ind
26
  | MAFW-DFEW-SFT | 60.23 | 44.39 | 50.44 | 30.39 | 29.33 | 30.75 |
27
  | R1-Omni | 65.83 | 56.27 | 57.68 | 40.04 | 43.00 | 44.69 |
28
 
29
-
30
  ### Legend
31
  - **⬤**: Indicates **in-distribution data** (DFEW and MAFW).
32
  - **△**: Indicates **out-of-distribution data** (RAVDESS).
33
 
34
-
35
  ## 🛠️ Environment Setup
36
  Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the [R1-V repository](https://github.com/Deep-Agent/R1-V/)
37
 
@@ -46,7 +45,6 @@ Our inference code is based on the implementation from **HumanOmni**. To ensure
46
  - In the directory where you downloaded the R1-Omni model, locate the config.json file.
47
  - Update the paths on line 23 and line 31 to point to the local folders where you saved the models.
48
 
49
-
50
  #### Example: Updating config.json
51
  If you saved the models to the following local paths::
52
  - `/path/to/local/models/siglip-base-patch16-224`
@@ -66,8 +64,6 @@ python inference.py --modal video_audio \
66
  --instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
67
  ```
68
 
69
-
70
-
71
  ## 🧠 Training
72
  ### Cold Start
73
  we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the [Explainable Multimodal Emotion Reasoning](https://github.com/zeroQiaoba/AffectGPT) dataset and 348 samples from HumanOmni dataset.
@@ -79,11 +75,14 @@ An example json file of the training data:
79
  "conversations": [
80
  {
81
  "from": "human",
82
- "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
 
 
83
  },
84
  {
85
  "from": "gpt",
86
- "value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>\n<answer>anxious</answer>"
 
87
  }
88
  ]
89
  },
@@ -102,7 +101,9 @@ An example json file of the training data:
102
  "conversations": [
103
  {
104
  "from": "human",
105
- "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
 
 
106
  },
107
  {
108
  "from": "gpt",
@@ -114,6 +115,8 @@ An example json file of the training data:
114
  ]
115
  ```
116
 
 
 
117
 
118
  ## 🤝 Related Work
119
  - [R1-V](https://github.com/Deep-Agent/R1-V)
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: video-text-to-text
5
  ---
6
+
7
  # R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
8
 
9
  [![GitHub](https://img.shields.io/badge/GitHub-R1Omni-green)](https://github.com/Jiaxing-star/R1-Omni)
10
  [![ModelScope](https://img.shields.io/badge/ModelScope-R1Omni-blue)](https://modelscope.cn/models/iic/R1-Omni-0.5B)
11
  [![arXiv](https://img.shields.io/badge/arXiv-2503.05379-red)](https://arxiv.org/abs/2503.05379)
12
 
 
13
  ## 📖 Introduction
14
  **R1-Omni** is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:
15
  1) **Enhanced Reasoning Capability**: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
16
  2) **Improved Understanding Capability**: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
17
  3) **Stronger Generalization Capability**: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.
18
 
 
19
  ## 🏆 Performance
20
 
21
  Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is **in-distribution (⬤)** or **out-of-distribution (△)**.
 
27
  | MAFW-DFEW-SFT | 60.23 | 44.39 | 50.44 | 30.39 | 29.33 | 30.75 |
28
  | R1-Omni | 65.83 | 56.27 | 57.68 | 40.04 | 43.00 | 44.69 |
29
 
 
30
  ### Legend
31
  - **⬤**: Indicates **in-distribution data** (DFEW and MAFW).
32
  - **△**: Indicates **out-of-distribution data** (RAVDESS).
33
 
 
34
  ## 🛠️ Environment Setup
35
  Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the [R1-V repository](https://github.com/Deep-Agent/R1-V/)
36
 
 
45
  - In the directory where you downloaded the R1-Omni model, locate the config.json file.
46
  - Update the paths on line 23 and line 31 to point to the local folders where you saved the models.
47
 
 
48
  #### Example: Updating config.json
49
  If you saved the models to the following local paths::
50
  - `/path/to/local/models/siglip-base-patch16-224`
 
64
  --instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
65
  ```
66
 
 
 
67
  ## 🧠 Training
68
  ### Cold Start
69
  we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the [Explainable Multimodal Emotion Reasoning](https://github.com/zeroQiaoba/AffectGPT) dataset and 348 samples from HumanOmni dataset.
 
75
  "conversations": [
76
  {
77
  "from": "human",
78
+ "value": "<video>
79
+ <audio>
80
+ As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
81
  },
82
  {
83
  "from": "gpt",
84
+ "value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>
85
+ <answer>anxious</answer>"
86
  }
87
  ]
88
  },
 
101
  "conversations": [
102
  {
103
  "from": "human",
104
+ "value": "<video>
105
+ <audio>
106
+ As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
107
  },
108
  {
109
  "from": "gpt",
 
115
  ]
116
  ```
117
 
118
+ ### wandb
119
+ ![image](https://github.com/user-attachments/assets/3395bafa-aaba-4212-902d-91067a1cd19a)
120
 
121
  ## 🤝 Related Work
122
  - [R1-V](https://github.com/Deep-Agent/R1-V)