File size: 6,441 Bytes
daebb2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: mit
---


# Audio-Reasoner

We implemented inference scaling on **Audio-Reasoner**, a large audio language model, enabling **deepthink** and **structured chain-of-thought (COT) reasoning** for multimodal understanding and reasoning. To achieve this, we constructed CoTA, a high-quality dataset with **1.2M reasoning-rich samples** using structured COT techniques. Audio-Reasoner achieves state-of-the-art results on **MMAU-mini(+25.42%)** and **AIR-Bench-Chat(+14.57%)** benchmarks.

<p align="center">
Audio-Reasoner-7B <a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> |  CoTA Dataset <a href="https://huggingface.co"></a> 🤗 (coming soon)<br>
Paper <a href="https://arxiv.org/abs/2503.02318"> 📑</a> | Wechat <a href="https://github.com/xzf-thu/Audio-Reasoner/blob/main/assets/wechat.jpg">💭</a> | Code <a href="https://github.com/xzf-thu/Audio-Reasoner"> ⚙️</a>
<br>
<a href="#demo"> Demo</a><a href="#install">Install</a><a href="#quick-start">Quick Start</a><a href="#faq">FAQ</a><a href="#contact">Contact us</a><br>
<br>
If you like us, pls give us a star⭐ !
</p>



## Main Results


## News and Updates
- **2025.03.05:****Audio-Reasoner-7B checkpoint is released on HuggingFace<a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> !**
- **2025.03.05:****Audio-Reasoner Paper is uploaded to arXiv<a href="https://arxiv.org/abs/2503.02318"> 📑</a>.**
- **2025.03.04:****Demos, inference code and evaluation results have been released.**
- **2025.03.04:****Create this repo.**

## Roadmap
- **2025.03:** **🔜Upload CoTA dataset to HuggingFace🤗.**

- **2025.04:** **🔜Open-source data systhesis pipeline and training code**.

## Demo
<p align="center" width="80%">
<video controls src="https://github.com/user-attachments/assets/d50f75e7-288b-454b-92a3-c6f058be231b" title="v" width="100%"></video>
</p>

## Features
✅ Audio-Reasoner enables **deep reasoning and inference scaling** in audio-based tasks, built on Qwen2-Audio-Instruct with structured CoT training.

✅ CoTA offers **1.2M** high-quality captions and QA pairs across domains for structured reasoning and enhanced pretraining. 

✅ Pretrained model and dataset encompassing various types of audio including sound, music, and speech, has achieved state-of-the-art results across multiple benchmarks. Refer to our <a href="https://arxiv.org/abs/2503.02318">paper</a> for details.


## Install

**Clone and install**

- Clone the repo
``` sh
git clone https://github.com/xzf-thu/Audio-Reasoner.git

cd Audio-Reasoner
```

- Install the required packages
```sh
conda create -n Audio-Reasoner python=3.10
conda activate Audio-Reasoner

pip install -r requirements.txt
pip install transformers==4.49.1
```

## Quick Start

**Chat using ms-swift**
```sh
import os
import re
from typing import List, Literal
from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, load_dataset, get_template
from swift.plugin import InferStats


def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'):
    request_config = RequestConfig(max_tokens=2048, temperature=0, stream=True)
    metric = InferStats()
    gen = engine.infer([infer_request], request_config, metrics=[metric])
    query = infer_request.messages[0]['content']
    output = ""
    print(f'query: {query}\nresponse: ', end='')
    for resp_list in gen:
        if resp_list[0] is None:
            continue
        print(resp_list[0].choices[0].delta.content, end='', flush=True)
        output += resp_list[0].choices[0].delta.content
    print()
    print(f'metric: {metric.compute()}')
    return output


def get_message(audiopath, prompt):
    messages = [
        {"role": "system", "content": system},
        {
        'role':
        'user',
        'content': [{
            'type': 'audio',
            'audio': audiopath
        }, {
            'type': 'text',
            'text':  prompt
        }]
    }]
    return messages

system = 'You are an audio deep-thinking model. Upon receiving a question, please respond in two parts: <THINK> and <RESPONSE>. The <THINK> section should be further divided into four parts: <PLANNING>, <CAPTION>, <REASONING>, and <SUMMARY>.'
infer_backend = 'pt'
model = 'qwen2_audio'
last_model_checkpoint = "" #Please replace it with the path to checkpoint
engine = PtEngine(last_model_checkpoint, max_batch_size=64,  model_type = model)

def audioreasoner_gen(audiopath, prompt):
    return infer_stream(engine, InferRequest(messages=get_message(audiopath, prompt)))

def main():
    #Please replace it with your test aduio 
    audiopath = "assets/test.wav" 
    #Please replace it with your questions about the test aduio    
    prompt = "Which of the following best describes the rhythmic feel and time signature of the song?"  
    audioreasoner_gen(audiopath, prompt)

if __name__ == '__main__':
    main()
```

**Local test**

```sh
conda activate Audio-Reasoner
cd Audio-Reasoner
# test run the preset audio samples and questions
python inference.py 
```

## FAQ

**1. What kind of audio can Audio - Reasoner understand and what kind of thinking does it perform?**
Audio - Reasoner can understand various types of audio, including sound, music, and speech. It conducts in - depth thinking in four parts: **planning, caption, reasoning, and summary**.

**2. Why is transformers installed after 'ms-swift' in the environment configuration?**
The version of transformers has a significant impact on the performance of the model. We have tested that version `transformers==4.49.1` is one of the suitable versions. Installing ms-swift first may ensure a more stable environment for the subsequent installation of transformers to avoid potential version conflicts that could affect the model's performance.

##  Contact 

If you have any questions, please feel free to contact us via `[email protected]`.

##  Citation 
Please cite our paper if you find our model and detaset useful. Thanks! 
```
@misc{xie2025audioreasonerimprovingreasoningcapability,
      title={Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models}, 
      author={Zhifei Xie and Mingbao Lin and Zihang Liu and Pengcheng Wu and Shuicheng Yan and Chunyan Miao},
      year={2025},
      eprint={2503.02318},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.02318}, 
}
```