prithivida commited on
Commit
4f8d184
·
verified ·
1 Parent(s): b5fdbac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -3
README.md CHANGED
@@ -177,7 +177,7 @@ The below numbers are with mDPR model, but miniDense_arabic_v1 should give a eve
177
 
178
  *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
179
 
180
- # MTEB numbers:
181
  MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
182
  So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
183
 
@@ -185,13 +185,22 @@ So it makes sense to evaluate our models in retrieval slice of the MTEB benchmar
185
 
186
  Refer tables above
187
 
 
 
 
 
 
 
 
 
 
188
  #### Long Document Retrieval
189
 
190
  This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
191
 
192
  <center>
193
  <img src="./ar_metrics_4.png" width=150%/>
194
- <b><p>Table 3: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
195
  </center>
196
 
197
 
@@ -202,7 +211,7 @@ This explains it's overall competitive performance when compared to models that
202
 
203
  <center>
204
  <img src="./ar_metrics_5.png" width=120%/>
205
- <b><p>Table 4: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
206
  </center>
207
 
208
  <br/>
 
177
 
178
  *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
179
 
180
+ # MTEB Retrieval numbers:
181
  MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
182
  So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
183
 
 
185
 
186
  Refer tables above
187
 
188
+ #### Sadeem Question Retrieval
189
+
190
+ <center>
191
+ <img src="./ar_metrics_6.png" width=150%/>
192
+ <b><p>Table 3: Detailed Arabic retrieval performance on the SadeemQA eval set (measured by nDCG@10)</p></b>
193
+ </center>
194
+
195
+
196
+
197
  #### Long Document Retrieval
198
 
199
  This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
200
 
201
  <center>
202
  <img src="./ar_metrics_4.png" width=150%/>
203
+ <b><p>Table 4: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
204
  </center>
205
 
206
 
 
211
 
212
  <center>
213
  <img src="./ar_metrics_5.png" width=120%/>
214
+ <b><p>Table 5: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
215
  </center>
216
 
217
  <br/>