Update README.md (#1)
Browse files- Update README.md (c52630455adf9c1e269516ed020d8bdad543c16e)
Co-authored-by: Grant Lee <[email protected]>
README.md
CHANGED
@@ -10,9 +10,33 @@ tags: []
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
-
R1-AQA
|
|
|
14 |
For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
## Inference
|
18 |
```python
|
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
+
R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
|
14 |
+
This implementation achieved state-of-the-art performance on MMAU *test-mini* benchmark with only 38k post-training samples.
|
15 |
For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
|
16 |
|
17 |
+
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
18 |
+
| Model | Method | Sound | Music | Speech | Average |
|
19 |
+
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|
20 |
+
| \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
|
21 |
+
| Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 58.68 | 51.65 | 55.60 |
|
22 |
+
| Audio Flamingo 2 | Direct Inference\* | 61.56 | **73.95** | 30.93 | 55.48 |
|
23 |
+
| GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 49.70 | **64.86** | 57.30 |
|
24 |
+
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 48.93 | 55.25 | 52.10 |
|
25 |
+
| Gemini Pro v1.5 | Direct Inference\* | 56.75 | 49.40 | 58.55 | 54.90 |
|
26 |
+
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 50.98 | 42.04 | 49.20 |
|
27 |
+
| GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
|
28 |
+
| Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
|
29 |
+
| SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
|
30 |
+
| Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
|
31 |
+
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
|
32 |
+
| Qwen2-Audio-7B-Instruct | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
|
33 |
+
|
34 |
+
#### Notes:
|
35 |
+
\* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
|
36 |
+
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
37 |
+
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
38 |
+
|
39 |
+
|
40 |
|
41 |
## Inference
|
42 |
```python
|