litert-community
/

Gemma3-27B-IT

Text Generation

chat

Model card Files Files and versions

xet

Community

schmidt-sebastian commited on May 20

Commit

e2853d5

verified ·

1 Parent(s): 32d0d16

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -3

README.md CHANGED Viewed

@@ -1,3 +1,75 @@
----
-license: gemma
----

+---
+license: gemma
+base_model: google/Gemma-3-27B-IT
+pipeline_tag: text-generation
+tags:
+- chat
+extra_gated_heading: Access Gemma3-27B-IT on Hugging Face
+extra_gated_prompt: >-
+  To access Gemma3-27B-IT on Hugging Face, you are required to review and agree
+  to the gemma license. To do this, please ensure you are logged in to
+  Hugging Face and click below. Requests are processed immediately.
+extra_gated_button_content: Acknowledge licensed
+---
+# litert-community/Gemma3-27B-IT
+This model provides a few variants of
+[google/Gemma-3-27B-IT](https://huggingface.co/google/Gemma-3-27B-IT) that are ready for
+deployment on Web using the
+[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
+### Web
+* Build and run our [sample web app](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md).
+To add the model to your web app, please follow the instructions in our [documentation](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js).
+## Performance
+### Web
+Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) with 1280 KV cache size, 1024 tokens prefill, and 256 tokens decode, running in Chrome.
+<table border="1">
+  <tr>
+   <th></th>
+   <th>Precision</th>
+   <th>Backend</th>
+   <th>Prefill (tokens/sec)</th>
+   <th>Decode (tokens/sec)</th>
+   <th>Time-to-first-token (sec)</th>
+   <th>GPU Memory</th>
+   <th>CPU Memory</th>
+   <th>Model size</th>
+   <th></th>
+  </tr>
+  <tr>
+  <td><p style="text-align: left">F16</p></td>
+  <td><p style="text-align: left">int8</p></td>
+  <td><p style="text-align: left">GPU</p></td>
+  <td><p style="text-align: right">166 tk/s</p></td>
+  <td><p style="text-align: right">8 tk/s</p></td>
+  <td><p style="text-align: right">15.0 s</p></td>
+  <td><p style="text-align: right">26.8 GB</p></td>
+  <td><p style="text-align: right">1.5 GB</p></td>
+  <td><p style="text-align: right">27.05 GB</p></td>
+  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-27B-IT/resolve/main/gemma3-27b-it-int8-web.task">&#128279;</a></p></td>
+  </tr>
+  <td><p style="text-align: left">F32</p></td>
+  <td><p style="text-align: left">int8</p></td>
+  <td><p style="text-align: left">GPU</p></td>
+  <td><p style="text-align: right">98 tk/s</p></td>
+  <td><p style="text-align: right">8 tk/s</p></td>
+  <td><p style="text-align: right">15.0 s</p></td>
+  <td><p style="text-align: right">27.8 GB</p></td>
+  <td><p style="text-align: right">1.5 GB</p></td>
+  <td><p style="text-align: right">27.05 GB</p></td>
+  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-27B-IT/resolve/main/gemma3-27b-it-int8-web.task">&#128279;</a></p></td>
+  </tr>
+</table>
+*   Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models).
+*   int8: quantized model with int8 weights and float activations.
+*   GPU memory: measured by "GPU Process" memory for all of Chrome while running. Chrome was measured as using 130-530MB before any model loading took place.
+*   CPU memory: measured for the entire tab while running. Tab was measured as using 30-60MB before any model loading took place.