--- license: gemma base_model: google/Gemma-3-12B-IT pipeline_tag: text-generation tags: - chat extra_gated_heading: Access Gemma3-12B-IT on Hugging Face extra_gated_prompt: >- To access Gemma3-12B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge licensed --- # litert-community/Gemma3-12B-IT This model provides a few variants of [google/Gemma-3-12B-IT](https://huggingface.co/google/Gemma-3-12B-IT) that are ready for deployment on Web using the [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference). ### Web * Build and run our [sample web app](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md). To add the model to your web app, please follow the instructions in our [documentation](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js). ## Performance ### Web Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) with 1280 KV cache size, 1024 tokens prefill, and 256 tokens decode, running in Chrome.
Precision | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | GPU Memory | CPU Memory | Model size | ||
---|---|---|---|---|---|---|---|---|---|
F16 |
int8 |
GPU |
382 tk/s |
17 tk/s |
5.51 s |
12.3 GB |
1.1 GB |
11.79 GB |
F32 |
int8 |
GPU |
226 tk/s |
17 tk/s |
5.47 s |
13.0 GB |
1.1 GB |
11.79 GB |
F16 |
int4 |
GPU |
384 tk/s |
23 tk/s |
3.63 s |
8.4 GB |
1.1 GB |
7.55 GB |
F32 |
int4 |
GPU |
229 tk/s |
23 tk/s |
3.58 s |
9.0 GB |
1.1 GB |
7.55 GB |