guanwenhao
/

MonoSpeech

Inference Endpoints

Model card Files Files and versions Community

MonoSpeech / README.md

guanwenhao's picture

Update README.md

795b0a7 verified 7 days ago

|

history blame contribute delete

1.16 kB

	---
	license: mit
	datasets:
	- openslr/librispeech_asr
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-360M-Instruct
	tags:
	- audio
	- speech
	- tts
	- asr
	- unified_model
	pipeline_tag: any-to-any
	library_name: transformers
	---

	## 1. Introduction

	This work introduces MonoSpeech, a novel approach that integrates autoregression and flow matching
	within a transformer-based framework for speech unified understanding and generation.
	MonoSpeech is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage.
	Our experiments demonstrate that MonoSpeech delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks.
	By combining autoregression and flow matching, MonoSpeech establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

	[Github Repository](https://github.com/gwh22/MonoSpeech)

	<div align="center">
	<img alt="image" src="assets/MonoSpeech.jpg" style="width:90%;">
	</div>



	## 2. Quick Start

	Please refer to [Github Repository](https://github.com/gwh22/Univoice)