InternVL3.5 FP8
Collection
OpenGVLabs InternVL3.5 models quantized to FP8
•
8 items
•
Updated
This is a fp8 dynamic (w8a8) version of OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview, optimized for high-performance inference with vLLM. The model utilizes fp8 dynamic (w8a8) for optimal performance and deployment.
You can serve the model using vLLM's OpenAI-compatible API server.
Warning: this model uses Gpt-oss as the base language model, and seems to have some issues running in vllm. Still digging in
vllm serve brandonbeiler/InternVL3_5-GPT-OSS-20B-A4B-Preview-FP8-Dynamic \
--quantization compressed-tensors \
--served-model-name internvl3_5-gpt-oss-20b \
--reasoning-parser qwen3 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 # Adjust based on your GPU setup
Notes
This model was created using:
llmcompressor==0.7.1
compressed-tensors==latest
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1
Quantized with ❤️ using LLM Compressor for the open-source community