lengocduc195's picture
pushNe
2359bda
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.
## Version Histroy
As we work on the topic, we will publish updated (and improved) models.
### v1
Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.
They can be used like this:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilroberta-base-msmarco-v1')
query_embedding = model.encode('[QRY] ' + 'How big is London')
passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census')
print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
```
**Models**:
- **distilroberta-base-msmarco-v1** - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28