--- license: gemma base_model: google/Gemma-3-12B-IT pipeline_tag: text-generation tags: - chat extra_gated_heading: Access Gemma3-12B-IT on Hugging Face extra_gated_prompt: >- To access Gemma3-12B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge licensed --- # litert-community/Gemma3-12B-IT This model provides a few variants of [google/Gemma-3-12B-IT](https://huggingface.co/google/Gemma-3-12B-IT) that are ready for deployment on Web using the [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference). ### Web * Build and run our [sample web app](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md). To add the model to your web app, please follow the instructions in our [documentation](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js). ## Performance ### Web Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) with 1280 KV cache size, 1024 tokens prefill, and 256 tokens decode, running in Chrome.

	Precision	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	GPU Memory	CPU Memory	Model size
F16	int8	GPU	382 tk/s	17 tk/s	5.51 s	12.3 GB	1.1 GB	11.79 GB	🔗
F32	int8	GPU	226 tk/s	17 tk/s	5.47 s	13.0 GB	1.1 GB	11.79 GB	🔗
F16	int4	GPU	384 tk/s	23 tk/s	3.63 s	8.4 GB	1.1 GB	7.55 GB	🔗
F32	int4	GPU	229 tk/s	23 tk/s	3.58 s	9.0 GB	1.1 GB	7.55 GB	🔗

* Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models). * int8: quantized model with int8 weights and float activations. * int4: quantized model with int4 weights and float activations. * GPU memory: measured by "GPU Process" memory for all of Chrome while running. Chrome was measured as using 130-530MB before any model loading took place. * CPU memory: measured for the entire tab while running. Tab was measured as using 30-60MB before any model loading took place.