File size: 1,864 Bytes
2359bda |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# DPR-Models
In [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) Karpukhin et al. trained models based on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions):
- **facebook-dpr-ctx_encoder-single-nq-base**
- **facebook-dpr-question_encoder-single-nq-base**
They also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.
- **facebook-dpr-ctx_encoder-multiset-base**
- **facebook-dpr-question_encoder-multiset-base**
There is one model to encode passages and one model to encode question / queries.
## Usage
To encode paragraphs, you need to provide a title (e.g. the Wikipedia article title) and the text passage. These must be seperated with a `[SEP]` token. For encoding paragraphs, we use the **ctx_encoder**.
Queries are encoded with **question_encoder**:
```python
from sentence_transformers import SentenceTransformer, util
passage_encoder = SentenceTransformer('facebook-dpr-ctx_encoder-single-nq-base')
passages = [
"London [SEP] London is the capital and largest city of England and the United Kingdom.",
"Paris [SEP] Paris is the capital and most populous city of France.",
"Berlin [SEP] Berlin is the capital and largest city of Germany by both area and population."
]
passage_embeddings = passage_encoder.encode(passages)
query_encoder = SentenceTransformer('facebook-dpr-question_encoder-single-nq-base')
query = "What is the capital of England?"
query_embedding = query_encoder.encode(query)
#Important: You must use dot-product, not cosine_similarity
scores = util.dot_score(query_embedding, passage_embeddings)
print("Scores:", scores)
```
**Important note:** When you use these models, you have to use them with dot-product (e.g. as implemented in `util.dot_score`) and not with cosine similarity. |