Question Answering
LiteRT
English
Lumoslulula commited on
Commit
72e0314
verified
1 Parent(s): d73e4db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -3
README.md CHANGED
@@ -1,3 +1,95 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ ---
6
+
7
+ # litert-community/Gecko-110m-en
8
+
9
+ This model provides a few variants of the embedding model published in the [Gecko paper](https://arxiv.org/abs/2403.20327) that are ready for deployment on Android or iOS using [LiteRT stack](https://ai.google.dev/edge/litert) or [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag).
10
+
11
+ ## Use the models
12
+
13
+ ### Android
14
+
15
+ * Try out the gecko embedding model in the [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag)
16
+ * Follow the instructions in the [android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android)
17
+
18
+ ## Performance
19
+
20
+ ### Android
21
+
22
+ Note that all benchmark stats are from a Samsung S23 Ultra.
23
+
24
+ <table border="1">
25
+ <tr>
26
+ <th></th>
27
+ <th>Backend</th>
28
+ <th>Max sequence length</th>
29
+ <th>Init time (ms)</th>
30
+ <th>Inference time (ms)</th>
31
+ <th>Memory (RSS in MB)</th>
32
+ <th>Model size (MB)</th>
33
+ </tr>
34
+ <tr>
35
+ <td><p style="text-align: right">dynamic_int8</p></td>
36
+ <td><p style="text-align: right">GPU</p></td>
37
+ <td><p style="text-align: right">256</p></td>
38
+ <td><p style="text-align: right">1306.06</p></td>
39
+ <td><p style="text-align: right">76.2</p></td>
40
+ <td><p style="text-align: right">604.5</p></td>
41
+ <td><p style="text-align: right">114</p></td>
42
+ </tr>
43
+ <tr>
44
+ <td><p style="text-align: right">dynamic_int8</p></td>
45
+ <td><p style="text-align: right">GPU</p></td>
46
+ <td><p style="text-align: right">512</p></td>
47
+ <td><p style="text-align: right">1363.38</p></td>
48
+ <td><p style="text-align: right">173.2</p></td>
49
+ <td><p style="text-align: right">604.6</p></td>
50
+ <td><p style="text-align: right">120</p></td>
51
+ </tr>
52
+ <tr>
53
+ <td><p style="text-align: right">dynamic_int8</p></td>
54
+ <td><p style="text-align: right">GPU</p></td>
55
+ <td><p style="text-align: right">1024</p></td>
56
+ <td><p style="text-align: right">1419.87</p></td>
57
+ <td><p style="text-align: right">397</p></td>
58
+ <td><p style="text-align: right">871.1</p></td>
59
+ <td><p style="text-align: right">145</p></td>
60
+ </tr>
61
+ <tr>
62
+ <td><p style="text-align: right">dynamic_int8</p></td>
63
+ <td><p style="text-align: right">CPU</p></td>
64
+ <td><p style="text-align: right">256</p></td>
65
+ <td><p style="text-align: right">11.03</p></td>
66
+ <td><p style="text-align: right">147.6</p></td>
67
+ <td><p style="text-align: right">126.3</p></td>
68
+ <td><p style="text-align: right">114</p></td>
69
+ </tr>
70
+ <tr>
71
+ <td><p style="text-align: right">dynamic_int8</p></td>
72
+ <td><p style="text-align: right">CPU</p></td>
73
+ <td><p style="text-align: right">512</p></td>
74
+ <td><p style="text-align: right">30.04</p></td>
75
+ <td><p style="text-align: right">353.1</p></td>
76
+ <td><p style="text-align: right">225.6</p></td>
77
+ <td><p style="text-align: right">120</p></td>
78
+ </tr>
79
+ <tr>
80
+ <td><p style="text-align: right">dynamic_int8</p></td>
81
+ <td><p style="text-align: right">CPU</p></td>
82
+ <td><p style="text-align: right">1024</p></td>
83
+ <td><p style="text-align: right">79.17</p></td>
84
+ <td><p style="text-align: right">954</p></td>
85
+ <td><p style="text-align: right">619.5</p></td>
86
+ <td><p style="text-align: right">145</p></td>
87
+ </tr>
88
+ </table>
89
+
90
+ * Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
91
+ * Memory: indicator of peak RAM usage
92
+ * The inference is run on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
93
+ * The inference on GPU is accelerated via LiteRT GPU delegate.
94
+ * Benchmark is done assuming XNNPACK cache is enabled
95
+ * dynamic_int8: quantized model with int8 weights and float activations.