InternVL3.5 38B FP8

This is an FP8 dynamically quantized (W8A8) version of OpenGVLab/InternVL3_5-38Boptimized for high-performance inference with vLLM.

The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 40%.

Just Run It (vLLM serve)

You can serve the model using vLLM's OpenAI-compatible API server.

vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
    --quantization compressed-tensors \
    --served-model-name internvl3_5-38b \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 # Adjust based on your GPU setup

Notes

32k max context length
reasoning parser ready to go, requires system prompt to run in thinking mode
still investigating tool calling

Key Features

Calibration-Free FP8: Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
Vision-Language Optimized: The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
vLLM Ready: Designed for seamless integration with vLLM for high-throughput serving.
Memory Efficient: ~40% memory reduction compared to the original FP16 model.
Performance Boost: Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).

Model Details

Attribute	Value
Original Model	OpenGVLab/InternVL3_5-38B
Quantized Model	`brandonbeiler/InternVL3_5-38B-FP8-Dynamic`
Quantization Method	FP8 Dynamic (W8A8)
Quantization Library	LLM Compressor v0.7.1
Quantized By	brandonbeiler

Usage with vLLM in Python

The following snippet demonstrates inference using the vLLM library.

from vllm import LLM, SamplingParams

# Load the quantized model
# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
model = LLM(
    model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=32768,          # InternVL 3.5 supports a 32k context length. [19, 41]
    tensor_parallel_size=1,      # Adjust for your hardware setup. [11, 15, 38, 40]
)

# Set sampling parameters
# A temperature of 0.6 is recommended for this model. [39]
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)

# Generate a response
# Note: Replace "<image>" with your image input
prompt = "Describe this image: <image>"
response = model.generate(prompt, sampling_params)

print(response[0].outputs[0].text)

Technical Specifications

Hardware Requirements

Base VRAM: ~47GB (for model weights)
Context VRAM:
- + ~1.3GB for 10k token context
- + ~2GB for 32k token context with FP8 KV cache
Recommended GPUs: NVIDIA H100, L40S
Supported GPUs: NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
Optimal Performance: NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).

Quantization Details

Weights: FP8 E4M3 with per-tensor scales.
Activations: Dynamically quantized to FP8 E4M3 with per-tensor scales.
Preserved Modules (Full Precision): Vision tower, embeddings, and the first MLP layer (mlp1).

Package Versions

This model was quantized using the following environment:

llmcompressor==0.7.1
compressed-tensors==0.10.2
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1

Quantized with ❤️ using LLM Compressor for the open-source community.

brandonbeiler
/

InternVL3_5-38B-FP8-Dynamic