# MSMARCO Models [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query. The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages. ## Version Histroy As we work on the topic, we will publish updated (and improved) models. ### v1 Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128. They can be used like this: ```python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('distilroberta-base-msmarco-v1') query_embedding = model.encode('[QRY] ' + 'How big is London') passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census') print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding)) ``` **Models**: - **distilroberta-base-msmarco-v1** - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28