File size: 5,665 Bytes
ea29acd 0b8c0b8 74a5c0a 3e8b4db 0b8c0b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
pipeline_tag: text-to-speech
---
This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
<div>
<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>
**Authors:**
**[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**
**Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
<p align="center">
<img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
</p>
<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>
## Overview
LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.
Key features:
- π **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
- π **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
- π **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
- π **Multilingual Support**: Adaptable to new languages with dataset adaptation
## Quick Start
### Installation
```bash
# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU
git clone https://github.com/mbzuai-oryx/LLMVoX.git
cd LLMVoX
conda create -n llmvox python=3.9
conda activate llmvox
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
# Download checkpoints from Hugging Face
# https://huggingface.co/MBZUAI/LLMVoX/tree/main
mkdir -p CHECKPOINTS
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
```
### Voice Chat
```bash
# Basic usage
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
# With multiple GPUs
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
--llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
# Balance latency/quality
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
--initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
```
### Text Chat & Visual Speech
```bash
# Text-to-Speech
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
# Visual Speech (Speech + Image β Speech)
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
--eos_token "<|im_end|>"
# Multimodal (support for models like Phi-4)
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
--eos_token "<|end|>"
```
## API Reference
| Endpoint | Purpose | Required Parameters |
|----------|---------|---------------------|
| `/tts` | Text-to-speech | `text`: String to convert |
| `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
| `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
| `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |
## Local UI Demo
<p align="center">
<img src="assets/ui.png" alt="Demo UI" width="800px">
</p>
```bash
# Start server
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT
# Launch UI
python run_ui.py --ip STREAMING_SERVER_IP --port PORT
```
## Citation
```bibtex
@article{shikhar2025llmvox,
title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
journal={arXiv preprint arXiv:2503.04724},
year={2025}
}
```
## Acknowledgments
- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
- [Whisper](https://github.com/openai/whisper)
- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |