# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [1]:
!pip install -U sentence-transformers rank_bm25

Requirement already up-to-date: sentence-transformers in /opt/conda/lib/python3.8/site-packages (2.0.0)
Requirement already up-to-date: rank_bm25 in /opt/conda/lib/python3.8/site-packages (0.2.1)


In [2]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
 print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256 #Truncate long passages to 256 tokens
top_k = 32 #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
 util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
 for line in fIn:
 data = json.loads(line.strip())

 #Add all paragraphs
 #passages.extend(data['paragraphs'])

 #Only add the first paragraph
 passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=737.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=9216.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=612.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=116.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=25457.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=349.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=90888945.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=53.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=112.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=466247.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=383.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=13846.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=190.0), HTML(value='')))


Passages: 169597


HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=5300.0), HTML(value='')))




In [4]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
 tokenized_doc = []
 for token in text.lower().split():
 token = token.strip(string.punctuation)

 if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
 tokenized_doc.append(token)
 return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
 tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=169597.0), HTML(value='')))




In [5]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
 print("Input question:", query)

 ##### BM25 search (lexical search) #####
 bm25_scores = bm25.get_scores(bm25_tokenizer(query))
 top_n = np.argpartition(bm25_scores, -5)[-5:]
 bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
 bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
 
 print("Top-3 lexical search (BM25) hits")
 for hit in bm25_hits[0:3]:
 print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

 ##### Sematic Search #####
 # Encode the query using the bi-encoder and find potentially relevant passages
 question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
 question_embedding = question_embedding.cuda()
 hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
 hits = hits[0] # Get the hits for the first query

 ##### Re-Ranking #####
 # Now, score all retrieved passages with the cross_encoder
 cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
 cross_scores = cross_encoder.predict(cross_inp)

 # Sort results by the cross-encoder scores
 for idx in range(len(cross_scores)):
 hits[idx]['cross-score'] = cross_scores[idx]

 # Output of top-5 hits from bi-encoder
 print("\n-------------------------\n")
 print("Top-3 Bi-Encoder Retrieval hits")
 hits = sorted(hits, key=lambda x: x['score'], reverse=True)
 for hit in hits[0:3]:
 print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

 # Output of top-5 hits from re-ranker
 print("\n-------------------------\n")
 print("Top-3 Cross-Encoder Re-ranker hits")
 hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
 for hit in hits[0:3]:
 print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))


In [6]:
search(query = "What is the capital of the United States?")

Input question: What is the capital of the United States?
Top-3 lexical search (BM25) hits
	13.316	Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.
	11.434	Ohio is one of the 50 states in the United States. Its capital is Columbus. Columbus also is the largest city in Ohio.
	11.179	Nevada is one of the United States' states. Its capital is Carson City. Other big cities are Las Vegas and Reno.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.622	Cities in the United States:
	0.597	The United States Capitol is the building where the United States Congress meets. It is the center of the legislative branch of the U.S. federal government. It is in Washington, D.C., on top of Capitol Hill at the east end of the National Mall.
	0.596	In the United States:

-

In [7]:
search(query = "What is the best orchestra in the world?")

Input question: What is the best orchestra in the world?
Top-3 lexical search (BM25) hits
	15.328	The BBC Symphony Orchestra is the main orchestra of the British Broadcasting Corporation. It is one of the best orchestras in Britain.
	15.320	The NHK Symphony Orchestra is a Japanese orchestra based in Tokyo, Japan. In Japanese it is written: NHK交響楽団, pronounced: Enueichikei Kōkyō Gakudan. When the orchestra was started in 1926 it was called "New Symphony Orchestra". It was the first large professional orchestra in Japan. Later, it changed its name to "Japan Symphony Orchestra". In 1951 it started to get money from the Japanese radio station NHK (Nippon Hōsō Kyōkai), so it changed its name again to the name it has now. It is thought of as the best orchestra in Japan. They have played in many parts of the world, including at the BBC Proms in London.
	14.079	The Bamberger Symphoniker (Bamberg Symphony Orchestra) is a world-famous orchestra from the city of Bamberg, Germany. It was formed in

In [8]:
search(query = "Number countries Europe")

Input question: Number countries Europe
Top-3 lexical search (BM25) hits
	13.795	Amy MacDonald is a Scottish singer and songwriter. She became famous in 2007 with her first album "This Is The Life" and her first single "Poison Prince". She has become even more successful in Europe since her single "This Is The Life" charted at number 1 in many European countries.
	13.758	The Croatian language is spoken mainly throughout the countries of Croatia and Bosnia and Herzegovina and in the surrounding countries of Europe.
	13.019	Organization for Security and Co-operation in Europe (OSCE) is an international organization for peace and human rights. Presently, it has 57 countries as its members. Most of the member countries of the OSCE are from Europe, the Caucasus, Central Asia and North America.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.538	The Council of Europe (, ) is an international organization of 47 member states in the European region. One of its first successes wa

In [9]:
search(query = "When did the cold war end?")

Input question: When did the cold war end?
Top-3 lexical search (BM25) hits
	17.374	The Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the "Cold" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.
	17.291	The Reagan Doctrine was a document by the United States under the Reagan Administration. It was about being against the global influence of the Soviet Union during the final years of the Cold War. The doctrine lasted for less than a decade, it was the most important document of United States foreign policy from the early 1980s until the end of the Cold War in 1991.
	15.420	Cold Norton is a village and civil parish in Maldon District, Essex, England. In 2001 there were 1103 people living in Cold N

In [10]:
search(query = "How long do cats live?")

Input question: How long do cats live?
Top-3 lexical search (BM25) hits
	22.997	Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.
	16.974	The sabertoothed cats or sabretooth cats are some of the best known and most popular extinct animals. They are among the most impressive carnivores that ever have lived. These cats had long canines and jaws which opened wider than modern cats. This suggests a different style of killing from modern felines.
	16.490	The Cyprus cat is a breed of cat. These cats are thought to have first come from ancient Egypt or Palestine. They were brought to the island of Cyprus by St. Helen. These are now common domestic cats that live in homes or outside. Many of these cats still live all over Cypr

In [11]:
search(query = "How many people live in Toronto?")

Input question: How many people live in Toronto?
Top-3 lexical search (BM25) hits
	15.978	Markham, Ontario is a city in Regional Municipality of York, in the Greater Toronto Area of Southern Ontario, Canada. There are twice as many people there as in 1990. 261,573 people live in Markham. It is the 4th largest town in the Greater Toronto Area, after Toronto, Mississauga, and Brampton.
	11.299	The Toronto Zoo is a zoo in Toronto, Ontario, Canada. With , the Toronto Zoo is the largest zoo in Canada.
	10.679	Denzil Minnan-Wong (; born ) is a Canadian politician. He is a Toronto city councillor. He is the person that represents Ward 16, an area of Toronto. He is a chairperson of the Employee and Labour Relations Committee in Toronto's municipal government and is also the deputy mayor of Toronto. He is also part of the board of the Toronto Transit Commission and the Toronto Hydro.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.604	Vaughan is a city in Ontario, Canada, 335,000

In [12]:
search(query = "Oldest US president")

Input question: Oldest US president
Top-3 lexical search (BM25) hits
	11.010	Glafcos Ioannou Clerides (; 24 April 1919 – 15 November 2013) was a Greek-Cypriot politician. He was the fourth President of Cyprus. He was the oldest living former President of the Republic of Cyprus.
	9.237	José Celso de Mello Filho (Tatuí, November 1, 1945), is a Brazilian jurist. He is the oldest member of the Supreme Federal Court of Brazil. He was nominated by President José Sarney in 1989.
	8.872	USS "Constitution" is a wooden, three-masted heavy frigate of the United States Navy. Named by President George Washington after the Constitution of the United States of America, she is the world's oldest commissioned naval vessel afloat.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.645	William Henry Harrison (February 9, 1773 – April 4, 1841) was the 9th President of the United States. His nickname was "Old Tippecanoe " and he was a well-respected war veteran. Harrison served the shortest ter

In [13]:
search(query = "Coldest place earth")

Input question: Coldest place earth
Top-3 lexical search (BM25) hits
	24.891	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.
	12.650	Earth Day is a day that is supposed to inspire more awareness and appreciation for the Earth's natural environment. It takes place each year on April 22. It now takes place in more than 193 countries around the world. During Earth Day, the world encourages everyone to turn off all unwanted lights.
	12.172	Heinrich events occurred during the coldest point of "Bond Cycles" in which many icebergs were discharged into the North Atlantic and melted.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.633	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic contine

In [14]:
search(query = "Elon Musk year birth")

Input question: Elon Musk year birth
Top-3 lexical search (BM25) hits
	23.364	Tesla, Inc. is a company based in Palo Alto, California which makes electric cars. It was started in 2003 by Martin Eberhard, Dylan Stott, and Elon Musk (who also co-founded PayPal and SpaceX and is the CEO of SpaceX). Eberhard no longer works there. Today, Elon Musk is the Chief Executive Officer (CEO). It started selling its first car, the Roadster in 2008.
	19.943	The Boring Company is a tunnel boring company founded by Elon Musk, who earlier started SpaceX. It aims to reduce traffic congestion in urban areas. It is involved in the building of the Hyperloop in Los Angeles.
	18.392	Elon Reeve Musk (born June 28, 1971) is a businessman and philanthropist. He was born in South Africa. He moved to Canada and later became an American citizen. Musk is the current CEO & Chief Product Architect of Tesla Motors, a company that makes electric vehicles. He is also the CEO of Solar City, a company that makes solar pan

In [15]:
search(query = "Paris eiffel tower")

Input question: Paris eiffel tower
Top-3 lexical search (BM25) hits
	27.300	The Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: "EYE-full" English; "eh-FEHL" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition's main attraction.
	25.263	Paris is a city in the U.S. state of Texas. It is in Lamar County, Texas. It had a population of 25,171 in 2010. It has been called the "Second Largest Paris in the World". It has a replica of the Eiffel Tower.
	24.059	Paris is a city in the U.S. state of Tennessee. It had a population of 25,171 in 2010. It has been called the "World's Biggest Fish Fry". It has a 70-foot replica of the Eiffel Tower.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.812	The Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: "EYE-full" English; "eh-FEHL" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle 

In [16]:
search(query = "Which US president was killed?")

Input question: Which US president was killed?
Top-3 lexical search (BM25) hits
	10.179	Lyndon Baines Johnson (August 27, 1908 – January 22, 1973) was a member of the Democratic Party and the 36th president of the United States serving from 1963 to 1969. Johnson took over as president when President Kennedy was killed in November 1963. He was then re-elected in the 1964 election.
	10.091	Lech Kaczyński, the fourth President of the Republic of Poland, died on 10 April 2010. He died in a plane crash outside of Smolensk, Russia. The plane was a Tu-154 belonging to the Polish Air Force. The crash killed all 96 on board. His wife, Maria Kaczyńska, was also among those killed.
	9.791	Jacobo Majluta Azar (October 9, 1934 – March 2, 1996) was a Dominican politician. He was Vice President of the Dominican Republic during the Antonio Guzmán Fernández presidency between 1978 to 1982. He became President of the Dominican Republic after Guzmán Fernández killed himself in 1982. He was president for 

In [17]:
search(query="When is Chinese New Year")

Input question: When is Chinese New Year
Top-3 lexical search (BM25) hits
	18.743	Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.
	18.527	New Year in Japan is one of the most important festivals. Unlike the Chinese New Year, it is held on January 1.
	15.789	The CCTV New Year's Gala (Simplified Chinese: 中国中央电视台春节联欢晚会; Traditional Chinese: 中國中央電視台春節聯歡晚會; Pinyin: "Zhōngguó zhōngyāng diànshìtái chūnjié liánhuān wǎnhuì") is a Chinese New Year special produced by China Central Television. It was presented by Zhao Zhongxiang.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0