haoyang-amd's picture
Update README.md
4424bc0 verified
|
raw
history blame
1.81 kB
metadata
base_model:
  - meta-llama/Llama-3.2-90B-Vision-Instruct
license: llama3.2

Llama-3.2-90B-Vision-Instruct-FP8-KV

  • Introduction

    This model was created by applying Quark with calibration samples from Pile dataset.
  • Quantization Stragegy

    • Weight: FP8 symmetric per-tensor
    • Activation: FP8 symmetric per-tensor
    • KV Cache: FP8 symmetric per-tensor
    • Note: The Llama-3.2-90B-Vision-Instruct consists of two parts: the language model (MllamaForCausalLM) and the vision model (MllamaVisionModel). Here, we only quantize the MllamaForCausalLM.
  • Quick Start

  1. Download and install Quark
  2. Run the quantization script in the example folder using the following command line:
export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.2-90B-Vision-Instruct
# single GPU
python3 quantize_quark.py \ 
        --model_dir $MODEL_DIR \
        --output_dir Llama-3.2-90B-Vision-Instruct-FP8-KV \                           
        --quant_scheme w_fp8_a_fp8 \
        --kv_cache_dtype fp8 \
        --num_calib_data 128 \
# If model size is too large for single GPU, please use multi GPU instead.
python3 quantize_quark.py \ 
        --model_dir $MODEL_DIR \
        --output_dir Llama-3.2-90B-Vision-Instruct-FP8-KV \                           
        --quant_scheme w_fp8_a_fp8 \
        --kv_cache_dtype fp8 \
        --num_calib_data 128 \
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Llama-3.2-90B-Vision-Instruct </strong>
   </td>
   <td><strong>Llama-3.2-90B-Vision-Instruct-FP8-KV(this model)</strong>
   </td>
  </tr>
  <tr>
   <td>Perplexity-wikitext2
   </td>
   <td>3.7805
   </td>
   <td>3.8570
   </td>
  </tr>