File size: 6,441 Bytes
daebb2e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: mit
---
# Audio-Reasoner
We implemented inference scaling on **Audio-Reasoner**, a large audio language model, enabling **deepthink** and **structured chain-of-thought (COT) reasoning** for multimodal understanding and reasoning. To achieve this, we constructed CoTA, a high-quality dataset with **1.2M reasoning-rich samples** using structured COT techniques. Audio-Reasoner achieves state-of-the-art results on **MMAU-mini(+25.42%)** and **AIR-Bench-Chat(+14.57%)** benchmarks.
<p align="center">
Audio-Reasoner-7B <a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> | CoTA Dataset <a href="https://huggingface.co"></a> 🤗 (coming soon)<br>
Paper <a href="https://arxiv.org/abs/2503.02318"> 📑</a> | Wechat <a href="https://github.com/xzf-thu/Audio-Reasoner/blob/main/assets/wechat.jpg">💭</a> | Code <a href="https://github.com/xzf-thu/Audio-Reasoner"> ⚙️</a>
<br>
<a href="#demo"> Demo</a> • <a href="#install">Install</a> • <a href="#quick-start">Quick Start</a> • <a href="#faq">FAQ</a> • <a href="#contact">Contact us</a><br>
<br>
If you like us, pls give us a star⭐ !
</p>
## Main Results
## News and Updates
- **2025.03.05:** ✅**Audio-Reasoner-7B checkpoint is released on HuggingFace<a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> !**
- **2025.03.05:** ✅**Audio-Reasoner Paper is uploaded to arXiv<a href="https://arxiv.org/abs/2503.02318"> 📑</a>.**
- **2025.03.04:** ✅**Demos, inference code and evaluation results have been released.**
- **2025.03.04:** ✅**Create this repo.**
## Roadmap
- **2025.03:** **🔜Upload CoTA dataset to HuggingFace🤗.**
- **2025.04:** **🔜Open-source data systhesis pipeline and training code**.
## Demo
<p align="center" width="80%">
<video controls src="https://github.com/user-attachments/assets/d50f75e7-288b-454b-92a3-c6f058be231b" title="v" width="100%"></video>
</p>
## Features
✅ Audio-Reasoner enables **deep reasoning and inference scaling** in audio-based tasks, built on Qwen2-Audio-Instruct with structured CoT training.
✅ CoTA offers **1.2M** high-quality captions and QA pairs across domains for structured reasoning and enhanced pretraining.
✅ Pretrained model and dataset encompassing various types of audio including sound, music, and speech, has achieved state-of-the-art results across multiple benchmarks. Refer to our <a href="https://arxiv.org/abs/2503.02318">paper</a> for details.
## Install
**Clone and install**
- Clone the repo
``` sh
git clone https://github.com/xzf-thu/Audio-Reasoner.git
cd Audio-Reasoner
```
- Install the required packages
```sh
conda create -n Audio-Reasoner python=3.10
conda activate Audio-Reasoner
pip install -r requirements.txt
pip install transformers==4.49.1
```
## Quick Start
**Chat using ms-swift**
```sh
import os
import re
from typing import List, Literal
from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, load_dataset, get_template
from swift.plugin import InferStats
def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'):
request_config = RequestConfig(max_tokens=2048, temperature=0, stream=True)
metric = InferStats()
gen = engine.infer([infer_request], request_config, metrics=[metric])
query = infer_request.messages[0]['content']
output = ""
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:
if resp_list[0] is None:
continue
print(resp_list[0].choices[0].delta.content, end='', flush=True)
output += resp_list[0].choices[0].delta.content
print()
print(f'metric: {metric.compute()}')
return output
def get_message(audiopath, prompt):
messages = [
{"role": "system", "content": system},
{
'role':
'user',
'content': [{
'type': 'audio',
'audio': audiopath
}, {
'type': 'text',
'text': prompt
}]
}]
return messages
system = 'You are an audio deep-thinking model. Upon receiving a question, please respond in two parts: <THINK> and <RESPONSE>. The <THINK> section should be further divided into four parts: <PLANNING>, <CAPTION>, <REASONING>, and <SUMMARY>.'
infer_backend = 'pt'
model = 'qwen2_audio'
last_model_checkpoint = "" #Please replace it with the path to checkpoint
engine = PtEngine(last_model_checkpoint, max_batch_size=64, model_type = model)
def audioreasoner_gen(audiopath, prompt):
return infer_stream(engine, InferRequest(messages=get_message(audiopath, prompt)))
def main():
#Please replace it with your test aduio
audiopath = "assets/test.wav"
#Please replace it with your questions about the test aduio
prompt = "Which of the following best describes the rhythmic feel and time signature of the song?"
audioreasoner_gen(audiopath, prompt)
if __name__ == '__main__':
main()
```
**Local test**
```sh
conda activate Audio-Reasoner
cd Audio-Reasoner
# test run the preset audio samples and questions
python inference.py
```
## FAQ
**1. What kind of audio can Audio - Reasoner understand and what kind of thinking does it perform?**
Audio - Reasoner can understand various types of audio, including sound, music, and speech. It conducts in - depth thinking in four parts: **planning, caption, reasoning, and summary**.
**2. Why is transformers installed after 'ms-swift' in the environment configuration?**
The version of transformers has a significant impact on the performance of the model. We have tested that version `transformers==4.49.1` is one of the suitable versions. Installing ms-swift first may ensure a more stable environment for the subsequent installation of transformers to avoid potential version conflicts that could affect the model's performance.
## Contact
If you have any questions, please feel free to contact us via `[email protected]`.
## Citation
Please cite our paper if you find our model and detaset useful. Thanks!
```
@misc{xie2025audioreasonerimprovingreasoningcapability,
title={Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models},
author={Zhifei Xie and Mingbao Lin and Zihang Liu and Pengcheng Wu and Shuicheng Yan and Chunyan Miao},
year={2025},
eprint={2503.02318},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2503.02318},
}
```
|