FLM-Audio

FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity. It simultaneously listens, speaks, and composes internal monologue, delivering low‑latency, duplex conversational responses in both English and Chinese. FLM‑Audio is robust to noise and user interruptions, prioritizing responsiveness and naturalness.

📄 Model Card

Language(s): Chinese; English;

📚 Technical Report

Motivation & Survey: Toward Embodied AGI: A Review of Embodied AI and the Road Ahead

System Card: RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

⚠️ Bias, Risks, and Limitations

Despite extensive data cleaning, FLM‑Audio may still produce undesired content (e.g., biased or offensive language). Users should not disseminate unsafe outputs. Project authors are not responsible for misuse or harmful consequences.

🚀 Quick Start

Please refer to the repository of FLM-Audio server to interact with FLM-Audio via WebUI.

ℹ️ Usage Notice

This project is intended for research use only in compliance with applicable laws. For commercial use, please contact us.

🏗️ Training Details

Overview

We initialize the FLM-Audio backbone with a pre-trained language model. This initialization strategy significantly reduces computational cost while remaining effective for validating the core concepts of omnimodality and full duplexity. The training process of FLM-Audio consists of two stages: post-training and fine-tuning.

1. Post-training

In post-training, we introduce audio-oriented capabilities to the backbone model using a large-scale corpus of audio data, while preserving the language modeling abilities of the pre-trained foundation model. This stage encompasses a broad spectrum of speech-related tasks, including automatic speech recognition (ASR), text-to-speech synthesis (TTS).

2. Supervised Fine-tuning (SFT)

In this stage, we fine-tune FLM-Audio to function as a general-purpose, full-duplex audio-language chatbot. To this end, we primarily utilize synthesized multi-turn, speech dialogues. This dataset is further augmented to support full-duplex interruption handling and to enhance robustness against environmental noise.

Model Architecture

To handle real-time language and audio, FLM-Audio features an LLM-based backbone with 7B parameters, enhanced by an audio encoder that embeds incoming speech into semantic + acoustic tokens, and a decoder that generates audio tokens. Listening, speaking, and internal monologue are interleaved in synchronized timesteps, with improved stream organization compared to related work (e.g. Moshi).

🧪 Evaluation

Audio Understanding, Generation

FLM-Audio performs comparably with strong audio-language models, most of which lacks native duplexity.

Model	ASR-zh↓	ASR-en↓	TTS-zh↓	TTS-en↓
	Fleurs-zh	LibriSpeech-clean	Seed-tts-zh	Seed-tts-en
GPT-4o	5.4	-	-	-
MinMo	3.0	1.7	2.48	2.90
GLM-4-Voice	-	2.8	2.10	2.91
Moshi	-	5.7	-	-
Qwen-2.5-omni	3.0	1.8	1.70	2.72
FLM-Audio	5.4	3.2	2.10	2.95

Chat

Regarding chatting experience, FLM-Audio demonstrates advantages in speech naturalness and responsiveness. The following are LLM-scores for audio chatting scenarios like Alpaca-eval, as well as human evaluation in video-grounded omnimodal chatting. The human scores in Naturalness and Responsiveness reflect the contribution of the same audio-oriented training as FLM-Audio.

Model	LLM score↑	Helpfulness↑	Naturalness↑	Responsiveness↑	Robustness↑
Qwen-2.5-omni	6.36	7.4	7.9	8.1	7.7
FLM-Audio	6.58	7.2	8.2	8.8	8.0

🙏 Acknowledgements

This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).

🗨️ Citation

If you find our work helpful, please consider citing the following papers.

@article{embodied-agi,
  title={Toward embodied agi: A review of embodied ai and the road ahead},
  author={Wang, Yequan and Sun, Aixin},
  journal={arXiv preprint arXiv:2505.14235},
  year={2025}
}
@article{roboego,
  title={RoboEgo System Card: An Omnimodal Model with Native Full Duplexity},
  author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Sun, Aixin and Wang, Yequan},
  journal={arXiv preprint arXiv:2506.01934},
  year={2025}
}

CofeAI
/

FLM-Audio