Text-to-Speech
File size: 5,665 Bytes
ea29acd
 
 
 
 
 
 
0b8c0b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74a5c0a
3e8b4db
0b8c0b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
pipeline_tag: text-to-speech
---

This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).

For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.

# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

<div>
<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>

**Authors:**  
**[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**

**Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**

<p align="center">
    <img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
</p>

<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>

## Overview

LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.

Key features:
- πŸš€ **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
- πŸ”Œ **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
- 🌊 **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
- 🌐 **Multilingual Support**: Adaptable to new languages with dataset adaptation

## Quick Start

### Installation

```bash
# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU

git clone https://github.com/mbzuai-oryx/LLMVoX.git
cd LLMVoX

conda create -n llmvox python=3.9
conda activate llmvox

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn --no-build-isolation
pip install -r requirements.txt

# Download checkpoints from Hugging Face
# https://huggingface.co/MBZUAI/LLMVoX/tree/main
mkdir -p CHECKPOINTS
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
```

### Voice Chat

```bash
# Basic usage
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"

# With multiple GPUs
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
  --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2

# Balance latency/quality
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
  --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
```

### Text Chat & Visual Speech

```bash
# Text-to-Speech
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"

# Visual Speech (Speech + Image β†’ Speech)
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
  --eos_token "<|im_end|>"

# Multimodal (support for models like Phi-4)
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
  --eos_token "<|end|>"
```

## API Reference

| Endpoint | Purpose | Required Parameters |
|----------|---------|---------------------|
| `/tts` | Text-to-speech | `text`: String to convert |
| `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
| `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
| `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |

## Local UI Demo

<p align="center">
    <img src="assets/ui.png" alt="Demo UI" width="800px">
</p>

```bash
# Start server
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT

# Launch UI
python run_ui.py --ip STREAMING_SERVER_IP --port PORT
```

## Citation

```bibtex
@article{shikhar2025llmvox,
  title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
  author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
  journal={arXiv preprint arXiv:2503.04724},
  year={2025}
}
```

## Acknowledgments

- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
- [Whisper](https://github.com/openai/whisper)
- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.