litert-community
/

SmolLM-135M-Instruct

@@ -56,19 +56,19 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
   <tr>
 <td>fp32 (baseline)</td>
 <td>cpu</td>
-<td><p style="text-align: right">339.97 tk/s</p></td>
-<td><p style="text-align: right">49.27 tk/s</p></td>
-<td><p style="text-align: right">1.25 s</p></td>
-<td><p style="text-align: right">1,398 MB</p></td>
 <td><p style="text-align: right">527 MB</p></td>
 </tr>
 <tr>
 <td>dynamic_int8</td>
 <td>cpu</td>
-<td><p style="text-align: right">505.93 tk/s</p></td>
-<td><p style="text-align: right">89.88 tk/s</p></td>
-<td><p style="text-align: right">0.86 s</p></td>
-<td><p style="text-align: right">583 MB</p></td>
 <td><p style="text-align: right">159 MB</p></td>
 </tr>
@@ -79,5 +79,7 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
 *   Memory: indicator of peak RAM usage
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
-*   Benchmark is done assuming XNNPACK cache is enabled
 *   dynamic_int8: quantized model with int8 weights and float activations.

   <tr>
 <td>fp32 (baseline)</td>
 <td>cpu</td>
+<td><p style="text-align: right">576.58 tk/s</p></td>
+<td><p style="text-align: right">52.23 tk/s</p></td>
+<td><p style="text-align: right">0.73 s</p></td>
+<td><p style="text-align: right">927 MB</p></td>
 <td><p style="text-align: right">527 MB</p></td>
 </tr>
 <tr>
 <td>dynamic_int8</td>
 <td>cpu</td>
+<td><p style="text-align: right">1142.86 tk/s</p></td>
+<td><p style="text-align: right">96.65 tk/s</p></td>
+<td><p style="text-align: right">0.45 s</p></td>
+<td><p style="text-align: right">567 MB</p></td>
 <td><p style="text-align: right">159 MB</p></td>
 </tr>
 *   Memory: indicator of peak RAM usage
 *   The inference on CPU is accelerated via the LiteRT
     [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
+*   Benchmark is run with cache enabled and initialized. During the first run,
+    the time to first token may differ.
+*   dynamic_int4: quantized model with int4 weights and float activations.
 *   dynamic_int8: quantized model with int8 weights and float activations.