Spaces:

Endre
/

SemanticSearch-HU

Runtime error

endre sukosd commited on Nov 23, 2021

Commit

1565c8a

1 Parent(s): 3992084

Add precalculated embeddings data files tracked with git-lfs

Files changed (4) hide show

.gitattributes CHANGED Viewed

@@ -25,5 +25,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-data/processed/shortened_abstracts_hu_2021_09_01.txt filter=lfs diff=lfs merge=lfs -text
-data/processed/shortened_abstracts_hu_2021_09_01_embedded.pt filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -1,6 +1,5 @@
 # Custom
 hf_venv/
-data/
 *.DS_Store
 # Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks,venv

 # Custom
 hf_venv/
 *.DS_Store
 # Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks,venv

README.md CHANGED Viewed

@@ -61,6 +61,8 @@ Model facts:
 To reproduce the precalculated embedding use the notebook in `notebooks/QA_retrieval_precalculate_embeddings.ipynb`, with GPU in Google Colab.
 ## Search top-k matches
 Finally, having all precalculated embeddings, we can to implement semantic search (dense retrieval).We encode the search query into vector space and retrieves the document embeddings that are closest in vector space (using cosine similarity). By default the top 5 similar wikipedia abstracts are returned. Can be seen in the main script `src/main_qa.py`.

 To reproduce the precalculated embedding use the notebook in `notebooks/QA_retrieval_precalculate_embeddings.ipynb`, with GPU in Google Colab.
+Known bug: the precalculated embeddings contain an extra random tensor in the beginning, thus the total size of 466529 (one more than the number of raw sentences). This is corrected by substracting 1 from the index of the most similar embedding, to find the corresponding raw sentence.
 ## Search top-k matches
 Finally, having all precalculated embeddings, we can to implement semantic search (dense retrieval).We encode the search query into vector space and retrieves the document embeddings that are closest in vector space (using cosine similarity). By default the top 5 similar wikipedia abstracts are returned. Can be seen in the main script `src/main_qa.py`.

src/app.py CHANGED Viewed

@@ -26,7 +26,8 @@ def findTopKMostSimilar(query_embedding, embeddings, all_sentences, k):
     cosine_scores_list = cosine_scores.squeeze().tolist()
     pairs = []
     for idx,score in enumerate(cosine_scores_list):
-        pairs.append({'index': idx, 'score': score, 'text': all_sentences[idx]})
     pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
     return pairs[0:k]
@@ -49,10 +50,11 @@ embeddings_file = 'data/processed/shortened_abstracts_hu_2021_09_01_embedded.pt'
 all_embeddings = load_embeddings(embeddings_file)
-st.text('Search Wikipedia abstracts in Hungarian - Input some search term and see the top-5 most similar wikipedia abstracts')
-st.text('Wikipedia absztrakt kereső - adjon meg egy tetszőleges kifejezést és a rendszer visszaadja az 5 hozzá legjobban hasonlító Wikipedia absztraktot')
-input_query = st.text_area("Hol élnek a bengali tigrisek?")
 if input_query:
     query_embedding = calculateEmbeddings([input_query],tokenizer,model)

     cosine_scores_list = cosine_scores.squeeze().tolist()
     pairs = []
     for idx,score in enumerate(cosine_scores_list):
+        if idx < len(all_sentences):
+            pairs.append({'score': score, 'text': all_sentences[idx]})
     pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
     return pairs[0:k]
 all_embeddings = load_embeddings(embeddings_file)
+st.header('Wikipedia absztrakt kereső')
+st.subheader('Search Wikipedia abstracts in Hungarian')
+st.caption('Input some search term and see the top-5 most similar wikipedia abstracts')
+input_query = st.text_area("Adjon meg egy tetszőleges kifejezést és a rendszer visszaadja az 5 hozzá legjobban hasonlító Wikipedia absztraktot", value='Hol élnek a bengali tigrisek?')
 if input_query:
     query_embedding = calculateEmbeddings([input_query],tokenizer,model)