arxiv:2506.11160

S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

Published on Jun 11

Authors:

Abstract

S2ST-Omni, a framework for multilingual speech-to-speech translation, decomposes the task into speech-to-text and text-to-speech components, using Whisper and Qwen models, and achieves high translation quality.

AI-generated summary

Despite recent advances in multilingual speech-to-speech translation (S2ST), several critical challenges persist: 1) achieving high-quality translation remains a major hurdle, and 2) most existing methods heavily rely on large-scale parallel speech corpora, which are costly and difficult to obtain. To address these issues, we propose S2ST-Omni, an efficient and scalable framework for multilingual S2ST. Specifically, we decompose the S2ST task into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS). For S2TT, we propose an effective speech language model that integrates the pretrained Whisper encoder for robust audio understanding and Qwen 3.0 for advanced text comprehension. A lightweight speech adapter is employed to bridge the modality gap between speech and text representations. To further facilitate the multimodal knowledge learning, a two-stage fine-tuning strategy is introduced. In the TTS stage, we adopt a streaming autoregressive generation approach to produce natural and fluent target speech. Experiments on the CVSS benchmark show that S2ST-Omni consistently outperforms existing state-of-the-art S2ST systems in translation quality, highlighting its effectiveness and superiority.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.11160 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.11160 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.11160 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.