FLM-Audio
FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity. It simultaneously listens, speaks, and composes internal monologue, delivering lowโlatency, duplex conversational responses in both English and Chinese. FLMโAudio is robust to noise and user interruptions, prioritizing responsiveness and naturalness.
๐ Model Card
- Language(s): Chinese; English;
๐ Technical Report
Motivation & Survey: Toward Embodied AGI: A Review of Embodied AI and the Road Ahead
System Card: RoboEgo System Card: An Omnimodal Model with Native Full Duplexity
โ ๏ธ Bias, Risks, and Limitations
Despite extensive data cleaning, FLMโAudio may still produce undesired content (e.g., biased or offensive language). Users should not disseminate unsafe outputs. Project authors are not responsible for misuse or harmful consequences.
๐ Quick Start
Please refer to the repository of FLM-Audio server to interact with FLM-Audio via WebUI.
โน๏ธ Usage Notice
This project is intended for research use only in compliance with applicable laws. For commercial use, please contact us.
๐๏ธ Training Details
Overview
We initialize the FLM-Audio backbone with a pre-trained language model. This initialization strategy significantly reduces computational cost while remaining effective for validating the core concepts of omnimodality and full duplexity. The training process of FLM-Audio consists of two stages: post-training and fine-tuning.
1. Post-training
In post-training, we introduce audio-oriented capabilities to the backbone model using a large-scale corpus of audio data, while preserving the language modeling abilities of the pre-trained foundation model. This stage encompasses a broad spectrum of speech-related tasks, including automatic speech recognition (ASR), text-to-speech synthesis (TTS).
2. Supervised Fine-tuning (SFT)
In this stage, we fine-tune FLM-Audio to function as a general-purpose, full-duplex audio-language chatbot. To this end, we primarily utilize synthesized multi-turn, speech dialogues. This dataset is further augmented to support full-duplex interruption handling and to enhance robustness against environmental noise.
Model Architecture
To handle real-time language and audio, FLM-Audio features an LLM-based backbone with 7B parameters, enhanced by an audio encoder that embeds incoming speech into semantic + acoustic tokens, and a decoder that generates audio tokens. Listening, speaking, and internal monologue are interleaved in synchronized timesteps, with improved stream organization compared to related work (e.g. Moshi).
๐งช Evaluation
Audio Understanding, Generation
FLM-Audio performs comparably with strong audio-language models, most of which lacks native duplexity.
Model | ASR-zhโ | ASR-enโ | TTS-zhโ | TTS-enโ |
---|---|---|---|---|
Fleurs-zh | LibriSpeech-clean | Seed-tts-zh | Seed-tts-en | |
GPT-4o | 5.4 | - | - | - |
MinMo | 3.0 | 1.7 | 2.48 | 2.90 |
GLM-4-Voice | - | 2.8 | 2.10 | 2.91 |
Moshi | - | 5.7 | - | - |
Qwen-2.5-omni | 3.0 | 1.8 | 1.70 | 2.72 |
FLM-Audio | 5.4 | 3.2 | 2.10 | 2.95 |
Chat
Regarding chatting experience, FLM-Audio demonstrates advantages in speech naturalness and responsiveness. The following are LLM-scores for audio chatting scenarios like Alpaca-eval, as well as human evaluation in video-grounded omnimodal chatting. The human scores in Naturalness and Responsiveness reflect the contribution of the same audio-oriented training as FLM-Audio.
Model | LLM scoreโ | Helpfulnessโ | Naturalnessโ | Responsivenessโ | Robustnessโ |
---|---|---|---|---|---|
Qwen-2.5-omni | 6.36 | 7.4 | 7.9 | 8.1 | 7.7 |
FLM-Audio | 6.58 | 7.2 | 8.2 | 8.8 | 8.0 |
๐ Acknowledgements
This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).
๐จ๏ธ Citation
If you find our work helpful, please consider citing the following papers.
@article{embodied-agi,
title={Toward embodied agi: A review of embodied ai and the road ahead},
author={Wang, Yequan and Sun, Aixin},
journal={arXiv preprint arXiv:2505.14235},
year={2025}
}
@article{roboego,
title={RoboEgo System Card: An Omnimodal Model with Native Full Duplexity},
author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Sun, Aixin and Wang, Yequan},
journal={arXiv preprint arXiv:2506.01934},
year={2025}
}
- Downloads last month
- 10