Cyrile commited on
Commit
13b12ec
·
1 Parent(s): 10ad6df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -7,14 +7,17 @@ pipeline_tag: sentence-similarity
7
  ---
8
 
9
  Blommz-560m-retriever
 
10
 
11
  Introducing Bloomz-560m-retriever based on the Bloomz-560m-sft-chat model. This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.
12
 
13
  Training
 
14
 
15
  It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined as [rr]:
16
 
17
  Benchmark
 
18
 
19
  Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query, the standard deviation of the average top, and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).
20
 
@@ -26,16 +29,17 @@ Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over
26
  | Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
27
  | Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
28
 
29
- Model (EN/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
30
  |-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
31
- | TF-IDF | 607 | 334 | 0 | 0 | 0 |
32
  | [CamemBERT](https://huggingface.co/camembert/camembert-base) | 432 | 345 | 0 | 1 | 1 |
33
  | [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12 | 47 | 44 | 73 | 83 |
34
- | Bloomz-560m-retriever | 10 | 44 | 49 | 77 | 86 |
35
- | Bloomz-3b-retriever | 9 | 38 | 50 | 78 | 87 |
36
 
37
 
38
  How to Use Blommz-560m-retriever
 
39
 
40
  The following example utilizes the API Pipeline of the Transformers library.
41
 
 
7
  ---
8
 
9
  Blommz-560m-retriever
10
+ ---------------------
11
 
12
  Introducing Bloomz-560m-retriever based on the Bloomz-560m-sft-chat model. This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.
13
 
14
  Training
15
+ --------
16
 
17
  It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined as [rr]:
18
 
19
  Benchmark
20
+ ---------
21
 
22
  Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query, the standard deviation of the average top, and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).
23
 
 
29
  | Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
30
  | Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
31
 
32
+ Model (EN/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
33
  |-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
34
+ | TF-IDF | 607 | 334 | 0 | 0 | 0 |
35
  | [CamemBERT](https://huggingface.co/camembert/camembert-base) | 432 | 345 | 0 | 1 | 1 |
36
  | [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12 | 47 | 44 | 73 | 83 |
37
+ | [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) | 10 | 44 | 49 | 77 | 86 |
38
+ | [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) | 9 | 38 | 50 | 78 | 87 |
39
 
40
 
41
  How to Use Blommz-560m-retriever
42
+ --------------------------------
43
 
44
  The following example utilizes the API Pipeline of the Transformers library.
45