InternVL3.5 38B FP8

This is an FP8 dynamically quantized (W8A8) version of OpenGVLab/InternVL3_5-38Boptimized for high-performance inference with vLLM.

The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 40%.

Just Run It (vLLM serve)

You can serve the model using vLLM's OpenAI-compatible API server.

vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
    --quantization compressed-tensors \
    --served-model-name internvl3_5-38b \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 # Adjust based on your GPU setup

Notes

  • 32k max context length
  • reasoning parser ready to go, requires system prompt to run in thinking mode
  • still investigating tool calling

Key Features

  • Calibration-Free FP8: Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
  • Vision-Language Optimized: The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
  • vLLM Ready: Designed for seamless integration with vLLM for high-throughput serving.
  • Memory Efficient: ~40% memory reduction compared to the original FP16 model.
  • Performance Boost: Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).

Model Details

Attribute Value
Original Model OpenGVLab/InternVL3_5-38B
Quantized Model brandonbeiler/InternVL3_5-38B-FP8-Dynamic
Quantization Method FP8 Dynamic (W8A8)
Quantization Library LLM Compressor v0.7.1
Quantized By brandonbeiler

Usage with vLLM in Python

The following snippet demonstrates inference using the vLLM library.

from vllm import LLM, SamplingParams

# Load the quantized model
# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
model = LLM(
    model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=32768,          # InternVL 3.5 supports a 32k context length. [19, 41]
    tensor_parallel_size=1,      # Adjust for your hardware setup. [11, 15, 38, 40]
)

# Set sampling parameters
# A temperature of 0.6 is recommended for this model. [39]
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)

# Generate a response
# Note: Replace "<image>" with your image input
prompt = "Describe this image: <image>"
response = model.generate(prompt, sampling_params)

print(response[0].outputs[0].text)

Technical Specifications

Hardware Requirements

  • Base VRAM: ~47GB (for model weights)
  • Context VRAM:
    • + ~1.3GB for 10k token context
    • + ~2GB for 32k token context with FP8 KV cache
  • Recommended GPUs: NVIDIA H100, L40S
  • Supported GPUs: NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
  • Optimal Performance: NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).

Quantization Details

  • Weights: FP8 E4M3 with per-tensor scales.
  • Activations: Dynamically quantized to FP8 E4M3 with per-tensor scales.
  • Preserved Modules (Full Precision): Vision tower, embeddings, and the first MLP layer (mlp1).

Package Versions

This model was quantized using the following environment:

llmcompressor==0.7.1
compressed-tensors==0.10.2
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1

Quantized with ❤️ using LLM Compressor for the open-source community.

Downloads last month
851
Safetensors
Model size
38.4B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brandonbeiler/InternVL3_5-38B-FP8-Dynamic

Collection including brandonbeiler/InternVL3_5-38B-FP8-Dynamic