From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Abstract
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
Community
Spire extends Tower to speech through discrete audio representations and shows strong results on ASR and speech translation without losing any of Tower's performance on text translation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM (2025)
- When Large Language Models Meet Speech: A Survey on Integration Approaches (2025)
- Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation (2025)
- LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM (2025)
- UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation (2025)
- Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders (2025)
- Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper