|
--- |
|
base_model: |
|
- nomic-ai/nomic-embed-text-v2-moe-unsupervised |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- es |
|
- fr |
|
- de |
|
- it |
|
- pt |
|
- pl |
|
- nl |
|
- tr |
|
- ja |
|
- vi |
|
- ru |
|
- id |
|
- ar |
|
- cs |
|
- ro |
|
- sv |
|
- el |
|
- uk |
|
- zh |
|
- hu |
|
- da |
|
- 'no' |
|
- hi |
|
- fi |
|
- bg |
|
- ko |
|
- sk |
|
- th |
|
- he |
|
- ca |
|
- lt |
|
- fa |
|
- ms |
|
- sl |
|
- lv |
|
- mr |
|
- bn |
|
- sq |
|
- cy |
|
- be |
|
- ml |
|
- kn |
|
- mk |
|
- ur |
|
- fy |
|
- te |
|
- eu |
|
- sw |
|
- so |
|
- sd |
|
- uz |
|
- co |
|
- hr |
|
- gu |
|
- ce |
|
- eo |
|
- jv |
|
- la |
|
- zu |
|
- mn |
|
- si |
|
- ga |
|
- ky |
|
- tg |
|
- my |
|
- km |
|
- mg |
|
- pa |
|
- sn |
|
- ha |
|
- ht |
|
- su |
|
- gd |
|
- ny |
|
- ps |
|
- ku |
|
- am |
|
- ig |
|
- lo |
|
- mi |
|
- nn |
|
- sm |
|
- yi |
|
- st |
|
- tl |
|
- xh |
|
- yo |
|
- af |
|
- ta |
|
- tn |
|
- ug |
|
- az |
|
- ba |
|
- bs |
|
- dv |
|
- et |
|
- gl |
|
- gn |
|
- gv |
|
- hy |
|
--- |
|
|
|
# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings |
|
|
|
## Model Overview |
|
`nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval: |
|
|
|
- **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size |
|
- **Multilinguality**: Supports ~100 languages and trained on over 1.6B pairs |
|
- **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations |
|
- **Fully Open-Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released |
|
|
|
|
|
| Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code | |
|
|-------|------------|----------|------|---------|---------------|---------------|------| |
|
| **Nomic Embed v2** | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ | |
|
| mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ | |
|
| mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ | |
|
| Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ | |
|
| | |
|
| BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ | |
|
| Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ | |
|
| mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ | |
|
|
|
|
|
|
|
## Model Architecture |
|
- **Total Parameters**: 475M |
|
- **Active Parameters During Inference**: 305M |
|
- **Architecture Type**: Mixture of Experts (MoE) |
|
- **MoE Configuration**: 8 experts with top-2 routing |
|
- **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning |
|
- **Maximum Sequence Length**: 512 tokens |
|
- **Languages**: Supports dozens of languages (see Performance section) |
|
|
|
|
|
## Usage Guide |
|
|
|
### Installation |
|
|
|
The model can be used through SentenceTransformers and Transformers. |
|
|
|
For best performance on GPU, please install |
|
|
|
```bash |
|
pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git |
|
``` |
|
|
|
> [!IMPORTANT] |
|
> **Important!** |
|
> The text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed. |
|
|
|
Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents. |
|
|
|
### Transformers |
|
|
|
If using Transformers, **make sure to prepend the task instruction prefix**. |
|
|
|
```python |
|
import torch |
|
import torch.nn.functional as F |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe") |
|
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) |
|
|
|
sentences = ['search_document: Hello!', 'search_document: ¡Hola!'] |
|
|
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
model.eval() |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
print(embeddings.shape) |
|
# torch.Size([2, 768]) |
|
|
|
similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0) |
|
print(similarity) |
|
# tensor(0.9118) |
|
``` |
|
|
|
### SentenceTransformers |
|
|
|
With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically. |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) |
|
sentences = ["Hello!", "¡Hola!"] |
|
embeddings = model.encode(sentences, prompt_name="passage") |
|
print(embeddings.shape) |
|
# (2, 768) |
|
|
|
similarity = model.similarity(embeddings[0], embeddings[1]) |
|
print(similarity) |
|
# tensor([[0.9118]]) |
|
``` |
|
|
|
## Performance |
|
|
|
nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models: |
|
|
|
 |
|
|
|
nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions: |
|
|
|
 |
|
|
|
## Best Practices |
|
- Add appropriate prefixes to your text: |
|
- For queries: "search_query: " |
|
- For documents: "search_document: " |
|
- Maximum input length is 512 tokens |
|
- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern |
|
|
|
## Limitations |
|
- Performance may vary across different languages |
|
- Resource requirements may be higher than traditional dense models due to MoE architecture |
|
- Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation |
|
|
|
## Training Details |
|
|
|
 |
|
|
|
- Trained on 1.6 billion high-quality pairs across multiple languages |
|
- Uses consistency filtering to ensure high-quality training data |
|
- Incorporates Matryoshka representation learning for dimension flexibility |
|
- Training includes both weakly-supervised contrastive pretraining and supervised finetuning |
|
|
|
|
|
|
|
## Join the Nomic Community |
|
|
|
- Nomic: [https://nomic.ai](https://nomic.ai) |
|
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8) |
|
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai) |