|
--- |
|
license: mit |
|
datasets: |
|
- openslr/librispeech_asr |
|
language: |
|
- en |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-360M-Instruct |
|
tags: |
|
- audio |
|
- speech |
|
- tts |
|
- asr |
|
- unified_model |
|
pipeline_tag: any-to-any |
|
library_name: transformers |
|
--- |
|
|
|
## 1. Introduction |
|
|
|
This work introduces MonoSpeech, a novel approach that integrates autoregression and flow matching |
|
within a transformer-based framework for speech unified understanding and generation. |
|
MonoSpeech is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. |
|
Our experiments demonstrate that MonoSpeech delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. |
|
By combining autoregression and flow matching, MonoSpeech establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future. |
|
|
|
[**Github Repository**](https://github.com/gwh22/MonoSpeech) |
|
|
|
<div align="center"> |
|
<img alt="image" src="assets/MonoSpeech.jpg" style="width:90%;"> |
|
</div> |
|
|
|
|
|
|
|
## 2. Quick Start |
|
|
|
Please refer to [**Github Repository**](https://github.com/gwh22/Univoice) |
|
|
|
|
|
|
|
|