Add files using upload-large-folder tool
Browse files
README.md
CHANGED
|
@@ -56,19 +56,19 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
| 56 |
<tr>
|
| 57 |
<td>fp32 (baseline)</td>
|
| 58 |
<td>cpu</td>
|
| 59 |
-
<td><p style="text-align: right">
|
| 60 |
-
<td><p style="text-align: right">
|
| 61 |
-
<td><p style="text-align: right">
|
| 62 |
-
<td><p style="text-align: right">
|
| 63 |
<td><p style="text-align: right">527 MB</p></td>
|
| 64 |
</tr>
|
| 65 |
<tr>
|
| 66 |
<td>dynamic_int8</td>
|
| 67 |
<td>cpu</td>
|
| 68 |
-
<td><p style="text-align: right">
|
| 69 |
-
<td><p style="text-align: right">
|
| 70 |
-
<td><p style="text-align: right">0.
|
| 71 |
-
<td><p style="text-align: right">
|
| 72 |
<td><p style="text-align: right">159 MB</p></td>
|
| 73 |
</tr>
|
| 74 |
|
|
@@ -79,5 +79,7 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
| 79 |
* Memory: indicator of peak RAM usage
|
| 80 |
* The inference on CPU is accelerated via the LiteRT
|
| 81 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
| 82 |
-
* Benchmark is
|
|
|
|
|
|
|
| 83 |
* dynamic_int8: quantized model with int8 weights and float activations.
|
|
|
|
| 56 |
<tr>
|
| 57 |
<td>fp32 (baseline)</td>
|
| 58 |
<td>cpu</td>
|
| 59 |
+
<td><p style="text-align: right">576.58 tk/s</p></td>
|
| 60 |
+
<td><p style="text-align: right">52.23 tk/s</p></td>
|
| 61 |
+
<td><p style="text-align: right">0.73 s</p></td>
|
| 62 |
+
<td><p style="text-align: right">927 MB</p></td>
|
| 63 |
<td><p style="text-align: right">527 MB</p></td>
|
| 64 |
</tr>
|
| 65 |
<tr>
|
| 66 |
<td>dynamic_int8</td>
|
| 67 |
<td>cpu</td>
|
| 68 |
+
<td><p style="text-align: right">1142.86 tk/s</p></td>
|
| 69 |
+
<td><p style="text-align: right">96.65 tk/s</p></td>
|
| 70 |
+
<td><p style="text-align: right">0.45 s</p></td>
|
| 71 |
+
<td><p style="text-align: right">567 MB</p></td>
|
| 72 |
<td><p style="text-align: right">159 MB</p></td>
|
| 73 |
</tr>
|
| 74 |
|
|
|
|
| 79 |
* Memory: indicator of peak RAM usage
|
| 80 |
* The inference on CPU is accelerated via the LiteRT
|
| 81 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
| 82 |
+
* Benchmark is run with cache enabled and initialized. During the first run,
|
| 83 |
+
the time to first token may differ.
|
| 84 |
+
* dynamic_int4: quantized model with int4 weights and float activations.
|
| 85 |
* dynamic_int8: quantized model with int8 weights and float activations.
|