usage at the top (#28)
Browse files- usage at the top (22a61f969963dc430753f8b0bdf3d8d25f5b9e5d)
Co-authored-by: Max Cembalest <[email protected]>
README.md
CHANGED
|
@@ -2609,63 +2609,8 @@ language:
|
|
| 2609 |
|
| 2610 |
# nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning
|
| 2611 |
|
| 2612 |
-
`nomic-embed-text-v1.5` is an improvement upon [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1) that utilizes [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which gives developers the flexibility to trade off the embedding size for a negligible reduction in performance.
|
| 2613 |
-
|
| 2614 |
-
|
| 2615 |
-
|
| 2616 |
-
| Name | SeqLen | Dimension | MTEB |
|
| 2617 |
-
| :-------------------------------:| :----- | :-------- | :------: |
|
| 2618 |
-
| nomic-embed-text-v1 | 8192 | 768 | **62.39** |
|
| 2619 |
-
| nomic-embed-text-v1.5 | 8192 | 768 | 62.28 |
|
| 2620 |
-
| nomic-embed-text-v1.5 | 8192 | 512 | 61.96 |
|
| 2621 |
-
| nomic-embed-text-v1.5 | 8192 | 256 | 61.04 |
|
| 2622 |
-
| nomic-embed-text-v1.5 | 8192 | 128 | 59.34 |
|
| 2623 |
-
| nomic-embed-text-v1.5 | 8192 | 64 | 56.10 |
|
| 2624 |
-
|
| 2625 |
-
|
| 2626 |
-

|
| 2627 |
-
|
| 2628 |
**Exciting Update!**: `nomic-embed-text-v1.5` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is aligned to the embedding space of `nomic-embed-text-v1.5`, meaning any text embedding is multimodal!
|
| 2629 |
|
| 2630 |
-
|
| 2631 |
-
## Hosted Inference API
|
| 2632 |
-
|
| 2633 |
-
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
|
| 2634 |
-
|
| 2635 |
-
Generating embeddings with the `nomic` Python client is as easy as
|
| 2636 |
-
|
| 2637 |
-
```python
|
| 2638 |
-
from nomic import embed
|
| 2639 |
-
|
| 2640 |
-
output = embed.text(
|
| 2641 |
-
texts=['Nomic Embedding API', '#keepAIOpen'],
|
| 2642 |
-
model='nomic-embed-text-v1.5',
|
| 2643 |
-
task_type='search_document',
|
| 2644 |
-
dimensionality=256,
|
| 2645 |
-
)
|
| 2646 |
-
|
| 2647 |
-
print(output)
|
| 2648 |
-
```
|
| 2649 |
-
|
| 2650 |
-
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
|
| 2651 |
-
|
| 2652 |
-
## Data Visualization
|
| 2653 |
-
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
| 2654 |
-
|
| 2655 |
-
|
| 2656 |
-
[](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
| 2657 |
-
|
| 2658 |
-
## Training Details
|
| 2659 |
-
|
| 2660 |
-
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
| 2661 |
-
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
| 2662 |
-
|
| 2663 |
-
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
| 2664 |
-
|
| 2665 |
-
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-matryoshka).
|
| 2666 |
-
|
| 2667 |
-
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
| 2668 |
-
|
| 2669 |
## Usage
|
| 2670 |
|
| 2671 |
**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
|
|
@@ -2818,6 +2763,61 @@ embeddings = layer_norm(embeddings, [embeddings.dims[1]])
|
|
| 2818 |
console.log(embeddings.tolist());
|
| 2819 |
```
|
| 2820 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2821 |
# Join the Nomic Community
|
| 2822 |
|
| 2823 |
- Nomic: [https://nomic.ai](https://nomic.ai)
|
|
|
|
| 2609 |
|
| 2610 |
# nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning
|
| 2611 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2612 |
**Exciting Update!**: `nomic-embed-text-v1.5` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is aligned to the embedding space of `nomic-embed-text-v1.5`, meaning any text embedding is multimodal!
|
| 2613 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2614 |
## Usage
|
| 2615 |
|
| 2616 |
**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
|
|
|
|
| 2763 |
console.log(embeddings.tolist());
|
| 2764 |
```
|
| 2765 |
|
| 2766 |
+
|
| 2767 |
+
## Nomic API
|
| 2768 |
+
|
| 2769 |
+
The easiest way to use Nomic Embed is through the Nomic Embedding API.
|
| 2770 |
+
|
| 2771 |
+
Generating embeddings with the `nomic` Python client is as easy as
|
| 2772 |
+
|
| 2773 |
+
```python
|
| 2774 |
+
from nomic import embed
|
| 2775 |
+
|
| 2776 |
+
output = embed.text(
|
| 2777 |
+
texts=['Nomic Embedding API', '#keepAIOpen'],
|
| 2778 |
+
model='nomic-embed-text-v1.5',
|
| 2779 |
+
task_type='search_document',
|
| 2780 |
+
dimensionality=256,
|
| 2781 |
+
)
|
| 2782 |
+
|
| 2783 |
+
print(output)
|
| 2784 |
+
```
|
| 2785 |
+
|
| 2786 |
+
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
|
| 2787 |
+
|
| 2788 |
+
|
| 2789 |
+
## Adjusting Dimensionality
|
| 2790 |
+
|
| 2791 |
+
`nomic-embed-text-v1.5` is an improvement upon [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1) that utilizes [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which gives developers the flexibility to trade off the embedding size for a negligible reduction in performance.
|
| 2792 |
+
|
| 2793 |
+
|
| 2794 |
+
| Name | SeqLen | Dimension | MTEB |
|
| 2795 |
+
| :-------------------------------:| :----- | :-------- | :------: |
|
| 2796 |
+
| nomic-embed-text-v1 | 8192 | 768 | **62.39** |
|
| 2797 |
+
| nomic-embed-text-v1.5 | 8192 | 768 | 62.28 |
|
| 2798 |
+
| nomic-embed-text-v1.5 | 8192 | 512 | 61.96 |
|
| 2799 |
+
| nomic-embed-text-v1.5 | 8192 | 256 | 61.04 |
|
| 2800 |
+
| nomic-embed-text-v1.5 | 8192 | 128 | 59.34 |
|
| 2801 |
+
| nomic-embed-text-v1.5 | 8192 | 64 | 56.10 |
|
| 2802 |
+
|
| 2803 |
+
|
| 2804 |
+

|
| 2805 |
+
|
| 2806 |
+
## Training
|
| 2807 |
+
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
| 2808 |
+
|
| 2809 |
+
[](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
| 2810 |
+
|
| 2811 |
+
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
| 2812 |
+
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
| 2813 |
+
|
| 2814 |
+
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
| 2815 |
+
|
| 2816 |
+
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-matryoshka).
|
| 2817 |
+
|
| 2818 |
+
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
| 2819 |
+
|
| 2820 |
+
|
| 2821 |
# Join the Nomic Community
|
| 2822 |
|
| 2823 |
- Nomic: [https://nomic.ai](https://nomic.ai)
|