Voxtral-Mini-3B-2507-FP8-dynamic

Model Overview

  • Model Architecture: VoxtralForConditionalGeneration
    • Input: Audio-Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
  • Intended Use Cases: Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
    • Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
    • Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
    • Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
    • Natively multilingual: Automatic language detection and state-of-the-art performance in the world鈥檚 most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
    • Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
    • Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B
  • Release Date: 08/21/2025
  • Version: 1.0
  • Model Developers: Red Hat

Quantized version of Voxtral-Mini-3B-2507.

Model Optimizations

This model was obtained by quantizing activation and weights of Voxtral-Mini-3B-2507 to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks of the language model are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

Deployment

Use with vLLM

  1. Initialize vLLM server:
vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
  1. Send requests to the server, according to the use case. See the following examples.
Audio Instruct
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")

def file_to_chunk(file: str) -> AudioChunk:
    audio = Audio.from_file(file, strict=False)
    return AudioChunk.from_audio(audio)

text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()

print(30 * "=" + "USER 1" + 30 * "=")
print(text_chunk.text)
print("\n\n")

response = client.chat.completions.create(
    model=model,
    messages=[user_msg],
    temperature=0.2,
    top_p=0.95,
)
content = response.choices[0].message.content

print(30 * "=" + "BOT 1" + 30 * "=")
print(content)
print("\n\n")
# The speaker who is more inspiring is the one who delivered the farewell address, as they express
# gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
# self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
# the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
# lacks the emotional and motivational content of the farewell address.

# **Differences:**
# - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
# - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.


messages = [
    user_msg,
    AssistantMessage(content=content).to_openai(),
    UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
]
print(30 * "=" + "USER 2" + 30 * "=")
print(messages[-1]["content"])
print("\n\n")

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.2,
    top_p=0.95,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 2" + 30 * "=")
print(content)
Transcription
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)

audio = RawAudio.from_audio(audio)
req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))

response = client.audio.transcriptions.create(**req)
print(response)

Creation

This model was quantized using the llm-compressor library as shown below.

Creation details
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Select model and load it.
MODEL_ID = "mistralai/Voxtral-Mini-3B-2507"

model = VoxtralForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Recipe
recipe = QuantizationModifier(
    targets="Linear", 
    scheme="FP8_DYNAMIC", 
    ignore=["language_model.lm_head", "re:audio_tower.*" ,"re:multi_modal_projector.*"],
)

# Apply algorithms.
oneshot(
    model=model,
    recipe=recipe,
)

SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

After quantization, the model can be converted back into the mistralai format using the convert_voxtral_hf_to_mistral.py script included with the model.

Evaluation

The model was evaluated on the Fleurs transcription task. Recovery is computed with respect to the complement of the word error rate (WER).

Benchmark Language Voxtral-Mini-3B-2507 Voxtral-Mini-3B-2507-FP8-dynamic
(this model)
Recovery
Fleurs
WER
English 3.89% 3.95% 99.9%
French 5.07% 4.86% 100.2%
Spanish 3.63% 3.55% 100.1%
German 5.00% 5.01% 100.0%
Italian 2.54% 2.57% 100.0%
Portuguese 3.85% 4.03% 99.8%
Dutch 7.01% 7.20% 99.8%
Downloads last month
261
Safetensors
Model size
4.68B params
Tensor type
F32
BF16
F8_E4M3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic

Quantized
(7)
this model