🔥 InternVL3_5-GPT-OSS-20B-A4B-Preview-FP8-Dynamic 🔥

This is a fp8 dynamic (w8a8) version of OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview, optimized for high-performance inference with vLLM. The model utilizes fp8 dynamic (w8a8) for optimal performance and deployment.

Just Run It (vLLM serve)

You can serve the model using vLLM's OpenAI-compatible API server.

Warning: this model uses Gpt-oss as the base language model, and seems to have some issues running in vllm. Still digging in

vllm serve brandonbeiler/InternVL3_5-GPT-OSS-20B-A4B-Preview-FP8-Dynamic \
    --quantization compressed-tensors \
    --served-model-name internvl3_5-gpt-oss-20b \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 # Adjust based on your GPU setup

Notes

32k max context length
reasoning parser ready to go, requires system prompt to run in thinking mode
still investigating tool calling. (Please comment if you have found a solution)

🚀 Key Features

FP8 Dynamic Quantization: No calibration required, ready to use immediately
Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
vLLM Ready: Seamless integration with vLLM for production deployment
Memory Efficient: ~50% memory reduction compared to FP16 original
Performance Boost: Significant faster inference on H100/L40S GPUs

📊 Model Details

Original Model: OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Source Model: OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Quantized Model: InternVL3_5-GPT-OSS-20B-A4B-Preview-FP8-Dynamic
Quantization Method: FP8 Dynamic (W8A8)
Quantization Library: LLM Compressor v0.7.1
Quantized by: brandonbeiler

🏗️ Technical Specifications

Hardware Requirements

Inference: ? VRAM (+ VRAM for context)
Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
GPU Architecture: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell)

Quantization Details

Weights: FP8 E4M3 with dynamic per-tensor scales
Activations: FP8 E4M3 with dynamic per-tensor scales
Preserved Components: Vision tower, embeddings, mlp1

🔬 Package Versions

This model was created using:

llmcompressor==0.7.1
compressed-tensors==latest
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1

Quantized with ❤️ using LLM Compressor for the open-source community

brandonbeiler
/

InternVL3_5-GPT-OSS-20B-A4B-Preview-FP8-Dynamic