Update README.md
Browse files
README.md
CHANGED
|
@@ -2602,10 +2602,51 @@ model-index:
|
|
| 2602 |
value: 78.02957839746892
|
| 2603 |
---
|
| 2604 |
|
| 2605 |
-
# nomic-embed-text-v1-unsupervised:
|
| 2606 |
|
| 2607 |
`nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
|
| 2608 |
[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). If you want to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
|
| 2609 |
.
|
| 2610 |
|
| 2611 |
If you would like to finetune a model on more data, you can use this model as an initialization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2602 |
value: 78.02957839746892
|
| 2603 |
---
|
| 2604 |
|
| 2605 |
+
# nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
|
| 2606 |
|
| 2607 |
`nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
|
| 2608 |
[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). If you want to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
|
| 2609 |
.
|
| 2610 |
|
| 2611 |
If you would like to finetune a model on more data, you can use this model as an initialization
|
| 2612 |
+
|
| 2613 |
+
|
| 2614 |
+
## Training Details
|
| 2615 |
+
|
| 2616 |
+
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
| 2617 |
+
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
| 2618 |
+
|
| 2619 |
+
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
| 2620 |
+
|
| 2621 |
+
For more details, see Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf).
|
| 2622 |
+
|
| 2623 |
+
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
| 2624 |
+
|
| 2625 |
+
## Usage
|
| 2626 |
+
|
| 2627 |
+
|
| 2628 |
+
```python
|
| 2629 |
+
import torch
|
| 2630 |
+
import torch.nn.functional as F
|
| 2631 |
+
from transformers import AutoTokenizer, AutoModel
|
| 2632 |
+
|
| 2633 |
+
def mean_pooling(model_output, attention_mask):
|
| 2634 |
+
token_embeddings = model_output[0]
|
| 2635 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 2636 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
| 2637 |
+
|
| 2638 |
+
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
|
| 2639 |
+
|
| 2640 |
+
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
| 2641 |
+
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
|
| 2642 |
+
|
| 2643 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
| 2644 |
+
|
| 2645 |
+
with torch.no_grad():
|
| 2646 |
+
model_output = model(**encoded_input)
|
| 2647 |
+
|
| 2648 |
+
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
| 2649 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 2650 |
+
print(embeddings)
|
| 2651 |
+
```
|
| 2652 |
+
|