Model Overview

Model Architecture: DeepSeek-R1-0528
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.0
PyTorch: 2.8.0
Transformers: 4.53.0
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (V0.10)
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic
Calibration Dataset: Pile

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from deepseek-ai/DeepSeek-R1-0528 using AMD-Quark. Both weights and activations were quantized to MXFP4 format.

Preprocessing requirement:

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at unsloth/DeepSeek-R1-0528-BF16.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --multi_gpu \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-0528-MXFP4-Preview

Deployment

This model can be deployed efficiently using the SGLang and vLLM backends.

Evaluation

The model was evaluated on AIME24, GPQA Diamond, and MATH-500 benchmarks using the lighteval framework. Each benchmark was run 10 times with different random seeds for reliable performance estimation.

Accuracy

Benchmark	DeepSeek-R1-0528	DeepSeek-R1-0528-MXFP4-Preview(this model)	Recovery
AIME24	88.00	85.00	96.59%
GPQA Diamond	79.90	79.34	99.31%
MATH-500	97.06	97.84	100.80%

Reproduction

The results of AIME24, MATH-500, and GPQA Diamond, were obtained using forked lighteval and vLLM docker (emulation qdq) rocm/vllm-private:pytorch-vllm-gfx950-mxfp4-mxfp6-v3.

# Set docker env
export VLLM_QUARK_F4F6_OFFLINE_DEQUANT_TMPENVVAR=1

# Set model args
OUTPUT_DIR="results/DeepSeek-R1-0528-MXFP4-Preview-Seed"
LOG="logs/deepseek_0528_maxfp4.log"

# Evaluating 10 rounds 
for i in $(seq 1 10); do
    # seed in [0, 2**30 - 1]
    SEED=$(shuf -i 0-1073741823 -n 1)
    MODEL_ARGS="model_name=amd/DeepSeek-R1-0528-MXFP4-Preview,dtype=bfloat16,tensor_parallel_size=8,max_model_length=71536,max_num_batched_tokens=32768,gpu_memory_utilization=0.85,generation_parameters={max_new_tokens:65536,temperature:0.6,top_p:0.95,seed:$SEED}"

    lighteval vllm $MODEL_ARGS "custom|aime24_single|0|0,custom|math_500_single|0|0,custom|gpqa:diamond_single|0|0" \
        --use-chat-template \
        --output-dir "$OUTPUT_DIR/seed_$SEED" \
        2>&1 | tee -a "$LOG"