|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
|
|
# NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction |
|
> **Authors: Qichao Wang\*, Ziqiao Meng\*, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao†** |
|
|
|
|
|
[](https://arxiv.org/abs/2506.00975) |
|
[](https://github.com/Chaos96/NTPP) |
|
[](https://huggingface.co/aigc-x/NTPP) |
|
[](https://audio-3059.pages.dev/) |
|
|
|
|
|
|
|
<!-- <embed src="assert/audio-introduction.pdf" width="620" height="500" type="application/pdf"> --> |
|
|
|
Key features: |
|
- Pre-training: Transform single-channel audio into discrete tokens for next-token prediction |
|
- SFT: Novel "next-token-pair prediction" objective for natural conversation comprehension |
|
- Result: More natural and fluid spoken interactions compared to baseline approaches |
|
|
|
<img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/07/20250707172902461.png" alt="Parrot" width="500"/> |
|
|
|
## Installation |
|
|
|
```bash |
|
git clone https://github.com/Chaos96/NTPP.git |
|
cd parrot |
|
python -m venv venv |
|
source venv/bin/activate # On Windows, use `venv\Scripts\activate` |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
|
|
1. Prepare audio data for pre-training and fine-tuning |
|
2. Pre-train: `python pretrain.py --input_data path/to/single_channel_data` |
|
3. Fine-tune: `python finetune.py --input_data path/to/double_channel_data` |
|
4. Inference: `python inference.py --input_audio path/to/input.wav` |
|
|
|
|