gemma-3n-E4B-it-FP8-dynamic / README.md

robgreenberg3

Update README.md

99a36dd verified 26 days ago

preview code

raw

history blame contribute delete

11.3 kB

metadata

language:
  - ca
  - hr
  - da
  - nl
  - en
  - fi
  - fr
  - de
  - he
  - hu
  - is
  - id
  - it
  - ja
  - ko
  - ms
  - 'no'
  - pl
  - pt
  - ro
  - ru
  - sr
  - zh
  - sk
  - sl
  - es
  - sv
  - th
  - tr
  - uk
  - vi
base_model:
  - google/gemma-3n-E4B-it
pipeline_tag: text-generation
tags:
  - gemma
  - gemma3
  - gemma3n
  - fp8
  - quantized
  - multimodal
  - conversational
  - text-generation-inference
  - automatic-speech-recognition
  - automatic-speech-translation
  - audio-text-to-text
  - video-text-to-text
license: gemma
license_name: gemma
name: RedHatAI/gemma-3n-E4B-it-FP8-dynamic
description: >-
  This model was obtained by quantizing the weights and activations of
  google/gemma-3n-E4B-it to FP8 data type.
readme: https://huggingface.co/RedHatAI/gemma-3n-E4B-it-FP8-dynamic/main/README.md
tasks:
  - text-to-text
  - image-to-text
  - video-to-text
  - audio-to-text
provider: Google
license_link: https://ai.google.dev/gemma/terms
validated_on:
  - RHOAI 2.24
  - RHAIIS 3.2.1

gemma-3n-E4B-it-FP8-Dynamic

Model Overview

Model Architecture: gemma-3n-E4B-it
- Input: Audio-Vision-Text
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 08/01/2025
Version: 1.0
Validated on: RHOAI 2.24, RHAIIS 3.2.1
Model Developers: RedHatAI

Quantized version of google/gemma-3n-E4B-it.

Model Optimizations

This model was obtained by quantizing the weights of google/gemma-3n-E4B-it to FP8 data type, ready for inference with vLLM >= 0.10.0

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm.assets.image import ImageAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="RedHatAI/gemma-3n-E4B-it-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=4096,
    max_num_seqs=2,
)

# prepare inputs
question = "What is the content of this image?"
inputs = {
    "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
    "multi_modal_data": {
        "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
    },
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Deploy on Red Hat AI Inference Server

podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/gemma-3n-E4B-it-FP8-dynamic

Deploy on Red Hat Openshift AI

# Setting up vllm server with ServingRuntime
# Save as: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.24-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.24-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# Attach model to vllm server. This is an NVIDIA template
# Save as: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: gemma-3n-E4B-it-FP8-dynamic # OPTIONAL CHANGE
    serving.kserve.io/deploymentMode: RawDeployment
  name: gemma-3n-E4B-it-FP8-dynamic          # specify model name. This value will be used to invoke the model in the payload
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# this is model specific
          memory: 8Gi		# this is model specific
          nvidia.com/gpu: '1'	# this is accelerator specific
        requests:			# same comment for this block
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-gemma-3n-e4b-it-fp8-dynamic:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# make sure first to be in the project where you want to deploy the model
# oc project <project-name>

# apply both resources to run model

# Apply the ServingRuntime
oc apply -f vllm-servingruntime.yaml

# Replace <inference-service-name> and <cluster-ingress-domain> below:
# - Run `oc get inferenceservice` to find your URL if unsure.

# Call the server using curl:
curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "gemma-3n-E4B-it-FP8-dynamic",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

See Red Hat Openshift AI documentation for more details.

Creation

This model was created with llm-compressor by running the code snippet below.

Model Creation Code

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoProcessor, Gemma3nForConditionalGeneration

# Load model.
model_id = "google/gemma-3n-E4B-it"
model = Gemma3nForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Recipe
recipe = [
    QuantizationModifier(
        targets="Linear",
        scheme="FP8_DYNAMIC",
        ignore=[
            "re:.*embed_audio.*",
            "re:.*embed_vision.*",
            "re:.*audio_tower.*",
            "re:.*vision_tower.*",
            "re:.*altup.*",
            "re:.*lm_head.*",
            "re:.*laurel.*",
            "re:model\.language_model\.layers\.\d+\.per_layer_input_gate",
            "re:model\.language_model\.layers\.\d+\.per_layer_projection",
            "model.language_model.per_layer_model_projection",
        ],
    ),
]

SAVE_DIR = f"{model_id.split('/')[1]}-{recipe[0].scheme}"

# Perform oneshot
oneshot(
    model=model,
    tokenizer=model_id,
    recipe=recipe,
    trust_remote_code_model=True,
    tie_word_embeddings=True,
    output_dir=SAVE_DIR,
)

# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

Evaluation

The model was evaluated using lm_evaluation_harness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands:

Evaluation Commands

OpenLLM V1

lm_eval \
  --model vllm \
  --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=4096,gpu_memory_utilization=0.8,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto \
  --apply_chat_template \
  --fewshot_as_multiturn

Leaderboard V2

lm_eval \
  --model vllm \
  --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=15000,gpu_memory_utilization=0.5,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
  --tasks leaderboard \
  --batch_size auto \
  --apply_chat_template \
  --fewshot_as_multiturn

Accuracy

Category	Metric	google/gemma-3n-E4B-it	FP8 Dynamic	Recovery (%)
OpenLLM V1	arc_challenge	60.24	59.04	98.01%
	gsm8k	60.12	70.81	117.79%
	hellaswag	74.94	73.28	97.79%
	mmlu	64.14	64.82	101.06%
	truthfulqa_mc2	54.87	54.61	99.53%
	winogrande	68.35	67.72	99.08%
	Average	63.78	65.05	101.99%
Leaderboard	bbh	55.46	55.20	99.53%
	mmlu_pro	34.38	34.28	99.71%
	musr	33.20	34.26	103.19%
	ifeval	84.41	83.93	99.43%
	gpqa	30.87	31.38	101.65%
	math_hard	45.54	46.60	102.33%
	Average	47.31	47.61	100.63%