Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,75 @@
|
|
1 |
-
---
|
2 |
-
license: gemma
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: gemma
|
3 |
+
base_model: google/Gemma-3-27B-IT
|
4 |
+
pipeline_tag: text-generation
|
5 |
+
tags:
|
6 |
+
- chat
|
7 |
+
extra_gated_heading: Access Gemma3-27B-IT on Hugging Face
|
8 |
+
extra_gated_prompt: >-
|
9 |
+
To access Gemma3-27B-IT on Hugging Face, you are required to review and agree
|
10 |
+
to the gemma license. To do this, please ensure you are logged in to
|
11 |
+
Hugging Face and click below. Requests are processed immediately.
|
12 |
+
extra_gated_button_content: Acknowledge licensed
|
13 |
+
---
|
14 |
+
|
15 |
+
# litert-community/Gemma3-27B-IT
|
16 |
+
|
17 |
+
This model provides a few variants of
|
18 |
+
[google/Gemma-3-27B-IT](https://huggingface.co/google/Gemma-3-27B-IT) that are ready for
|
19 |
+
deployment on Web using the
|
20 |
+
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
|
21 |
+
|
22 |
+
### Web
|
23 |
+
|
24 |
+
* Build and run our [sample web app](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md).
|
25 |
+
|
26 |
+
To add the model to your web app, please follow the instructions in our [documentation](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js).
|
27 |
+
|
28 |
+
## Performance
|
29 |
+
|
30 |
+
### Web
|
31 |
+
|
32 |
+
Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) with 1280 KV cache size, 1024 tokens prefill, and 256 tokens decode, running in Chrome.
|
33 |
+
|
34 |
+
<table border="1">
|
35 |
+
<tr>
|
36 |
+
<th></th>
|
37 |
+
<th>Precision</th>
|
38 |
+
<th>Backend</th>
|
39 |
+
<th>Prefill (tokens/sec)</th>
|
40 |
+
<th>Decode (tokens/sec)</th>
|
41 |
+
<th>Time-to-first-token (sec)</th>
|
42 |
+
<th>GPU Memory</th>
|
43 |
+
<th>CPU Memory</th>
|
44 |
+
<th>Model size</th>
|
45 |
+
<th></th>
|
46 |
+
</tr>
|
47 |
+
<tr>
|
48 |
+
<td><p style="text-align: left">F16</p></td>
|
49 |
+
<td><p style="text-align: left">int8</p></td>
|
50 |
+
<td><p style="text-align: left">GPU</p></td>
|
51 |
+
<td><p style="text-align: right">166 tk/s</p></td>
|
52 |
+
<td><p style="text-align: right">8 tk/s</p></td>
|
53 |
+
<td><p style="text-align: right">15.0 s</p></td>
|
54 |
+
<td><p style="text-align: right">26.8 GB</p></td>
|
55 |
+
<td><p style="text-align: right">1.5 GB</p></td>
|
56 |
+
<td><p style="text-align: right">27.05 GB</p></td>
|
57 |
+
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-27B-IT/resolve/main/gemma3-27b-it-int8-web.task">🔗</a></p></td>
|
58 |
+
</tr>
|
59 |
+
<td><p style="text-align: left">F32</p></td>
|
60 |
+
<td><p style="text-align: left">int8</p></td>
|
61 |
+
<td><p style="text-align: left">GPU</p></td>
|
62 |
+
<td><p style="text-align: right">98 tk/s</p></td>
|
63 |
+
<td><p style="text-align: right">8 tk/s</p></td>
|
64 |
+
<td><p style="text-align: right">15.0 s</p></td>
|
65 |
+
<td><p style="text-align: right">27.8 GB</p></td>
|
66 |
+
<td><p style="text-align: right">1.5 GB</p></td>
|
67 |
+
<td><p style="text-align: right">27.05 GB</p></td>
|
68 |
+
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-27B-IT/resolve/main/gemma3-27b-it-int8-web.task">🔗</a></p></td>
|
69 |
+
</tr>
|
70 |
+
</table>
|
71 |
+
|
72 |
+
* Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models).
|
73 |
+
* int8: quantized model with int8 weights and float activations.
|
74 |
+
* GPU memory: measured by "GPU Process" memory for all of Chrome while running. Chrome was measured as using 130-530MB before any model loading took place.
|
75 |
+
* CPU memory: measured for the entire tab while running. Tab was measured as using 30-60MB before any model loading took place.
|