nomic-ai
/

nomic-embed-text-v1

@@ -2663,14 +2663,70 @@ Training data to train the models is released in its entirety. For more details,
 ## Usage
-Note `nomic-embed-text` *requires* prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
-For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
-For example, you are building a RAG application over the top of Wikipedia. You would embed all Wikipedia articles with the prefix `search_document`
-and any questions you ask with `search_query`. For example:
 ```python
-queries = ["search_query: who is the first president of the united states?", "search_query: when was babe ruth born?"]
-documents = ["search_document: <article about US Presidents>", "search_document: <article about Babe Ruth>"]
 ```
 ### Sentence Transformers

 ## Usage
+**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
+For example, if you are implementing a RAG application, you embed your documents as `search_document: <text here>` and embed your user queries as `search_query: <text here>`.
+## Task instruction prefixes
+### `search_document`
+#### Purpose: embed texts as documents from a dataset
+This prefix is used for embedding texts as documents, for example as documents for a RAG index.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
+sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+### `search_query`
+#### Purpose: embed texts as questions to answer
+This prefix is used for embedding texts as questions that documents from a dataset could resolve, for example as queries to be answered by a RAG application.
 ```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
+sentences = ['search_query: Who is Laurens van Der Maaten?']
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+### `clustering`
+#### Purpose: embed texts to group them into clusters
+This prefix is used for embedding texts in order to group them into clusters, discover common topics, or remove semantic duplicates.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
+sentences = ['clustering: the quick brown fox']
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+### `classification`
+#### Purpose: embed texts to classify them
+This prefix is used for embedding texts into vectors that will be used as features for a classification model
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
+sentences = ['classification: the quick brown fox']
+embeddings = model.encode(sentences)
+print(embeddings)
 ```
 ### Sentence Transformers