Update README.md (#1)

Browse files

- Update README.md (c52630455adf9c1e269516ed020d8bdad543c16e)

Co-authored-by: Grant Lee <[email protected]>

Files changed (1) hide show

README.md +25 -1

README.md CHANGED Viewed

@@ -10,9 +10,33 @@ tags: []
 ## Introduction
-R1-AQA extends `Qwen2-Audio-7B-Instruc` by integrating group relative policy optimization (GRPO). This adaptation enhances the model's capacity for temporal reasoning and contextual alignment in audio question answering (AQA) tasks.
 For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
 ## Inference
 ```python

 ## Introduction
+R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
+This implementation achieved state-of-the-art performance on MMAU *test-mini* benchmark with only 38k post-training samples.
 For more details, please refer to our [Github](https://github.com/xiaomi/r1-aqa) and [Report]().
+### Table: Accuracies (%) on MMAU Test-mini benchmark
+| Model                                      | Method                  | Sound  | Music  | Speech | Average |
+|--------------------------------------------|-------------------------|--------|--------|--------|---------|
+| \                                          | Human\*                 | 86.31  | 78.22  | 82.17  | 82.23   |
+| Gemini Pro 2.0 Flash                       | Direct Inference\*      | 56.46  | 58.68  | 51.65  | 55.60   |
+| Audio Flamingo 2                           | Direct Inference\*      | 61.56  | **73.95** | 30.93  | 55.48   |
+| GPT4o + Strong Cap.                        | Direct Inference\*      | 57.35  | 49.70  | **64.86** | 57.30   |
+| Llama-3-8B-Instruct + Strong Cap.          | Direct Inference\*      | 50.75  | 48.93  | 55.25  | 52.10   |
+| Gemini Pro v1.5                            | Direct Inference\*      | 56.75  | 49.40  | 58.55  | 54.90   |
+| Qwen2-Audio-7B-Instruct                    | Direct Inference\*      | 54.95  | 50.98  | 42.04  | 49.20   |
+| GPT4o + Weak Cap.                          | Direct Inference\*      | 39.33  | 41.90  | 58.25  | 45.70   |
+| Llama-3-8B-Instruct + Weak Cap.            | Direct Inference\*      | 34.23  | 38.02  | 54.05  | 42.10   |
+| SALMONN                                    | Direct Inference\*      | 41.00  | 34.80  | 25.50  | 33.70   |
+| Qwen2-Audio-7B-Instruct                    | CoTA \[1\]            | 60.06  | 64.30  | 60.70  | 61.71   |
+| Qwen2-Audio-7B-Instruct                    | Zero-Shot-CoT \[2\]   | 61.86  | 56.29  | 55.26  | 57.80   |
+| Qwen2-Audio-7B-Instruct                    | **GRPO (Ours)**         | **69.37** | 66.77  | 57.36  | **64.50** |
+#### Notes:
+\* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
+\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
+\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
 ## Inference
 ```python