Update README.md
Browse files
README.md
CHANGED
@@ -7,14 +7,17 @@ pipeline_tag: sentence-similarity
|
|
7 |
---
|
8 |
|
9 |
Blommz-560m-retriever
|
|
|
10 |
|
11 |
Introducing Bloomz-560m-retriever based on the Bloomz-560m-sft-chat model. This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.
|
12 |
|
13 |
Training
|
|
|
14 |
|
15 |
It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined as [rr]:
|
16 |
|
17 |
Benchmark
|
|
|
18 |
|
19 |
Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query, the standard deviation of the average top, and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).
|
20 |
|
@@ -26,16 +29,17 @@ Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over
|
|
26 |
| Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
|
27 |
| Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
|
28 |
|
29 |
-
Model (EN/FR) | Top-mean
|
30 |
|-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
|
31 |
-
| TF-IDF | 607 | 334 | 0 | 0 | 0
|
32 |
| [CamemBERT](https://huggingface.co/camembert/camembert-base) | 432 | 345 | 0 | 1 | 1 |
|
33 |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12 | 47 | 44 | 73 | 83 |
|
34 |
-
| Bloomz-560m-retriever
|
35 |
-
| Bloomz-3b-retriever
|
36 |
|
37 |
|
38 |
How to Use Blommz-560m-retriever
|
|
|
39 |
|
40 |
The following example utilizes the API Pipeline of the Transformers library.
|
41 |
|
|
|
7 |
---
|
8 |
|
9 |
Blommz-560m-retriever
|
10 |
+
---------------------
|
11 |
|
12 |
Introducing Bloomz-560m-retriever based on the Bloomz-560m-sft-chat model. This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.
|
13 |
|
14 |
Training
|
15 |
+
--------
|
16 |
|
17 |
It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined as [rr]:
|
18 |
|
19 |
Benchmark
|
20 |
+
---------
|
21 |
|
22 |
Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query, the standard deviation of the average top, and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).
|
23 |
|
|
|
29 |
| Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
|
30 |
| Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
|
31 |
|
32 |
+
Model (EN/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
|
33 |
|-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
|
34 |
+
| TF-IDF | 607 | 334 | 0 | 0 | 0 |
|
35 |
| [CamemBERT](https://huggingface.co/camembert/camembert-base) | 432 | 345 | 0 | 1 | 1 |
|
36 |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12 | 47 | 44 | 73 | 83 |
|
37 |
+
| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) | 10 | 44 | 49 | 77 | 86 |
|
38 |
+
| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) | 9 | 38 | 50 | 78 | 87 |
|
39 |
|
40 |
|
41 |
How to Use Blommz-560m-retriever
|
42 |
+
--------------------------------
|
43 |
|
44 |
The following example utilizes the API Pipeline of the Transformers library.
|
45 |
|