Introducing Marvis TTS: Real-Time Streaming Speech Synthesis

Community Article Published August 27, 2025

image/png

We’re thrilled to announce the release of Marvis TTS v0.1, a groundbreaking conversational speech model that’s pushing the boundaries of what’s possible in text-to-speech (TTS) technology. Designed for efficiency, accessibility, and real-time performance, Marvis enables seamless streaming audio generation right on consumer devices like Apple Silicon, iPhones, iPads, and more. Whether you’re building voice assistants, content creation tools, or accessibility apps, Marvis brings high-quality, natural-sounding speech to the edge without compromising on speed or quality.

You can find the model on Hugging Face at Marvis-AI/marvis-tts-250m-v0.1, complete with code examples, datasets, and everything you need to get started. Check out the GitHub repo for more details: Marvis TTS on GitHub.

What Makes Marvis Stand Out?

In a world where traditional TTS models often demand full text inputs or sacrifice real-time capabilities, Marvis flips the script. It streams audio chunks as text is processed, creating a truly conversational experience. No more awkward pauses or unnatural breaks—Marvis handles the entire text context intelligently to deliver coherent, expressive speech.

Built on the shoulders of innovative open-source foundations like Sesame’s CSM-1B and Kyutai’s Mimi codec, Marvis uses a multimodal transformer architecture that processes interleaved text and audio tokens. This results in a compact model that’s only 414MB when quantized, making it perfect for on-device inference without needing massive cloud resources.

Key Features

  • Real-Time Streaming: Generate and stream audio on-the-fly for natural, interactive dialogues.
  • Compact and Efficient: Runs smoothly on edge devices, with optimizations for mobile deployment (e.g., iOS and Android).
  • Natural Prosody: Processes full text contexts to avoid chunking artifacts, ensuring smooth intonation and flow.
  • Multimodal Design: Handles text and audio seamlessly for advanced speech-to-speech (STS) scenarios.
  • Expressive Synthesis: Tuned for emotional and varied speech, with an expressiveness setting of 0.5 in post-training.
  • Voice Adaptation: Supports basic voice customization with short audio references.

Currently optimized for English, Marvis delivers top-notch expressive synthesis, with support for languages like German, Portuguese, French, and Mandarin on the horizon.

Under the Hood: Model Architecture

Marvis leverages a dual-transformer setup for optimal performance:

  • Multimodal Backbone (250M parameters): A core transformer that models the semantic layer using the zeroth codebook level of Residual Vector Quantization (RVQ) tokens. It processes interleaved text and audio for deep contextual understanding.
  • Audio Decoder (60M parameters): A lightweight transformer that reconstructs the full 32-level RVQ codes into high-fidelity speech using Kyutai’s Mimi codec.

This end-to-end approach ensures low-latency generation while maintaining quality, without relying on regex-based text chunking like some older models. The base model is Marvis-AI/marvis-tts-250m-v0.1-base, fine-tuned from Sesame’s CSM.

Training Journey

Marvis was trained in phases to balance efficiency and expressiveness:

  • Pretraining: On the Amphion/Emilia-Dataset (Emilia-YODAS subset), with 2M steps on a single NVIDIA GH200 (96GB). Used bfloat16 precision, a 3e-4 learning rate, and batch size of 64.
  • Post-Training: Fine-tuned on expressive speech data for 200K steps, with an expressiveness factor of 0.5. Same hardware setup but with a 1e-4 learning rate.

Total training cost? Around $2,000, split across pretraining/fine-tuning ($246.69 on GH200), data generation for post-training ($167.94 on RTX6000 Ada), and experiments (~$1,500 on various GPUs via Prime-Intellect and Jarvis-Labs). This frugal approach keeps Marvis accessible for indie developers and researchers.

Real-World Use Cases

Marvis isn’t just a tech demo—it’s built for impact:

  • Voice Assistants: Create real-time voices for smarter chatbots and virtual companions.
  • Content Creation: Automate voiceovers, narrations, and podcasts with expressive tones.
  • Accessibility: Empower communication aids with tailored speech synthesis for those with speech impairments.
  • Interactive Apps: Enable consistent voices in games, education, or customer service bots.
  • Media Production: Generate automated audio for news, audiobooks, or social media content.

Deploy Anywhere, Scale Everywhere:

Local First: Run locally with just 2GB RAM using our ultra-compact 414MB quantized model, or use 4GB VRAM for full precision (GPU recommended for optimal speed).

Universal Reach: Works seamlessly across iOS, Windows, macOS, and Linux platforms.

Scale Seamlessly: Begin with local deployment for maximum privacy, control, and instant response times, then tap into cloud APIs when you need access to larger, more capable models that exceed your local hardware limits.

Getting Started with Marvis

You can get started with Marvis using MLX for edge-optimized inference or Hugging Face Transformers for broader compatibility. Here’s how to generate your first audio sample.

MLX

For fast, streaming TTS on devices like Apple Silicon:

pip install -U mlx-audio
python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."

This command will stream the audio output in real-time, showcasing Marvis’s low-latency prowess. You can also choose between two beautiful expressive voices we created by passing --voice argument with either conversational_a for female or conversational_b for male.

🤗Transformers (Basic Synthesis)

For a simple Python setup:

import torch
from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
from tokenizers.processors import TemplateProcessing
import soundfile as sf

model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# Prepare inputs (use [0] for default speaker)
text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device)
inputs.pop("token_type_ids")  # Remove if not needed

# Generate audio
audio = model.generate(**inputs, output_audio=True)
sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")

Advanced features like voice cloning are coming soon to MLX and Transformers after we release the base model.

Limitations and Considerations

While powerful, Marvis has some boundaries:

  • Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
  • Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
  • Background Noise: Performance degrades with noisy reference audio or inference environments
  • Potential Hallucinations: May add or mispronounce words, especially in short or novel inputs.

On the ethical front, we’re committed to responsible AI. Users must comply with laws on voice synthesis, avoid unauthorized use, and respect privacy/IP rights. Always obtain permissions where necessary.

License and Citation

Marvis is released under the Apache 2.0 license, encouraging open innovation.

If you’re using Marvis in your work, please cite us:

@misc{marvis-tts-2025,
  title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
  author={Prince Canuma and Lucas Newman},
  year={2025}
}

Acknowledgments and Next Steps

A huge shoutout to the teams at Sesame and Kyutai for their foundational models, and to the open-source community for making this possible. We’re grateful for platforms like Hugging Face, which make sharing models like Marvis effortless.

This is just v0.1—stay tuned for multilingual support, improved robustness, and more integrations. We’d love your feedback: star the repo, try the model, and let us know what you build!

For the full model card, head to Hugging Face. Let’s make conversational AI more accessible together. 🚀

References:

Community

Sign up or log in to comment