txtai-arxiv / README.md
davidmezzetti's picture
Update README
f36c4ac
|
raw
history blame
3.05 kB
metadata
inference: false
language: en
license:
  - cc0-1.0
library_name: txtai
tags:
  - sentence-similarity
datasets:
  - arxiv_dataset

arXiv txtai embeddings index

This is a txtai embeddings index for the arXiv dataset metadata.

txtai must be installed to use this model.

Example

This index can be loaded from the Hugging Face Hub with txtai as shown below.

from txtai.embeddings import Embeddings

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-arxiv")

# Search for papers matching a query
embeddings.search("Survey of vector databases")

# Search for papers matching an abstract
embeddings.search("""
Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the
same architecture as Mistral 7B, with the difference that each layer is composed
of 8 feedforward  blocks (i.e. experts). For every token, at each layer, a router
network selects two experts to process the current state and combine their outputs.
""")

embeddings.search("""
Humanity has wondered whether we are alone for millennia. The discovery of life
elsewhere in the Universe, particularly intelligent life, would have profound effects,
comparable to those of recognizing that the Earth is not the center of the Universe
and that humans evolved from previous species.
""")

embeddings.search("""
The main objective of this paper is to investigate the extent to which the margin of
victory can be predicted solely by the rankings of the opposing teams in NCAA
Division I men's basketball games.
""")

Use Cases

An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

The arXiv index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

Additionally, this model can identify articles to cite in research. Passing a title + abstract pair will find similar existing articles.

Build the index

The following steps show how to build this index.

  • Install required build dependencies
pip install txtchat datasets
  • Follow these instructions to download the dataset

  • Build txtai-arxiv index

python -m txtchat.data.arxiv.index \
       -d <path to directory with file downloaded in previous step> \
       -o txtai-arxiv

More information

See the following links for more information on the arXiv metadata dataset.