prithivida
/

miniDense_arabic_v1

Sentence Similarity

sentence-transformers

feature-extraction

passage-retrieval

knowledge-distillation

middle-training

text-embeddings-inference

Model card Files Files and versions

prithivida commited on Aug 13, 2024

Commit

4f8d184

·

verified ·

1 Parent(s): b5fdbac

Update README.md

Files changed (1) hide show

README.md +12 -3

README.md CHANGED Viewed

@@ -177,7 +177,7 @@ The below numbers are with mDPR model, but miniDense_arabic_v1 should give a eve
 *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
-# MTEB numbers:
 MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
 So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
@@ -185,13 +185,22 @@ So it makes sense to evaluate our models in retrieval slice of the MTEB benchmar
 Refer tables above
 #### Long Document Retrieval
 This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
 <center>
 <img src="./ar_metrics_4.png" width=150%/>
-  <b><p>Table 3: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
 </center>
@@ -202,7 +211,7 @@ This explains it's overall competitive performance when compared to models that
 <center>
 <img src="./ar_metrics_5.png" width=120%/>
-  <b><p>Table 4: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
 </center>
 <br/>

 *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
+# MTEB Retrieval numbers:
 MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
 So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
 Refer tables above
+#### Sadeem Question Retrieval
+<center>
+<img src="./ar_metrics_6.png" width=150%/>
+  <b><p>Table 3: Detailed Arabic retrieval performance on the SadeemQA eval set (measured by nDCG@10)</p></b>
+</center>
 #### Long Document Retrieval
 This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
 <center>
 <img src="./ar_metrics_4.png" width=150%/>
+  <b><p>Table 4: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
 </center>
 <center>
 <img src="./ar_metrics_5.png" width=120%/>
+  <b><p>Table 5: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
 </center>
 <br/>