Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
Inference Endpoints
franken GrantL10 commited on
Commit
c0e9ff4
·
verified ·
1 Parent(s): 2d1002f

Update README.md (#1)

Browse files

- Update README.md (c52630455adf9c1e269516ed020d8bdad543c16e)


Co-authored-by: Grant Lee <[email protected]>

Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -10,9 +10,33 @@ tags: []
10
 
11
  ## Introduction
12
 
13
- R1-AQA extends `Qwen2-Audio-7B-Instruc` by integrating group relative policy optimization (GRPO). This adaptation enhances the model's capacity for temporal reasoning and contextual alignment in audio question answering (AQA) tasks.
 
14
  For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Inference
18
  ```python
 
10
 
11
  ## Introduction
12
 
13
+ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
14
+ This implementation achieved state-of-the-art performance on MMAU *test-mini* benchmark with only 38k post-training samples.
15
  For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
16
 
17
+ ### Table: Accuracies (%) on MMAU Test-mini benchmark
18
+ | Model | Method | Sound | Music | Speech | Average |
19
+ |--------------------------------------------|-------------------------|--------|--------|--------|---------|
20
+ | \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
21
+ | Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 58.68 | 51.65 | 55.60 |
22
+ | Audio Flamingo 2 | Direct Inference\* | 61.56 | **73.95** | 30.93 | 55.48 |
23
+ | GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 49.70 | **64.86** | 57.30 |
24
+ | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 48.93 | 55.25 | 52.10 |
25
+ | Gemini Pro v1.5 | Direct Inference\* | 56.75 | 49.40 | 58.55 | 54.90 |
26
+ | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 50.98 | 42.04 | 49.20 |
27
+ | GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
28
+ | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
29
+ | SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
30
+ | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
31
+ | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
32
+ | Qwen2-Audio-7B-Instruct | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
33
+
34
+ #### Notes:
35
+ \* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
36
+ \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
37
+ \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
38
+
39
+
40
 
41
  ## Inference
42
  ```python