Update README.md

ce1f667 verified 13 days ago

6.49 kB

	---
	base_model:
	- nomic-ai/nomic-embed-text-v2-moe-unsupervised
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	license: apache-2.0
	language:
	- en
	- es
	- fr
	- de
	- it
	- pt
	- pl
	- nl
	- tr
	- ja
	- vi
	- ru
	- id
	- ar
	- cs
	- ro
	- sv
	- el
	- uk
	- zh
	- hu
	- da
	- 'no'
	- hi
	- fi
	- bg
	- ko
	- sk
	- th
	- he
	- ca
	- lt
	- fa
	- ms
	- sl
	- lv
	- mr
	- bn
	- sq
	- cy
	- be
	- ml
	- kn
	- mk
	- ur
	- fy
	- te
	- eu
	- sw
	- so
	- sd
	- uz
	- co
	- hr
	- gu
	- ce
	- eo
	- jv
	- la
	- zu
	- mn
	- si
	- ga
	- ky
	- tg
	- my
	- km
	- mg
	- pa
	- sn
	- ha
	- ht
	- su
	- gd
	- ny
	- ps
	- ku
	- am
	- ig
	- lo
	- mi
	- nn
	- sm
	- yi
	- st
	- tl
	- xh
	- yo
	- af
	- ta
	- tn
	- ug
	- az
	- ba
	- bs
	- dv
	- et
	- gl
	- gn
	- gv
	- hy
	---

	# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings

	## Model Overview
	`nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval:

	- High Performance: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
	- Multilinguality: Supports ~100 languages and trained on over 1.6B pairs
	- Flexible Embedding Dimension: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations
	- Fully Open-Source: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released


	\| Model \| Params (M) \| Emb Dim \| BEIR \| MIRACL \| Pretrain Data \| Finetune Data \| Code \|
	\|-------\|------------\|----------\|------\|---------\|---------------\|---------------\|------\|
	\| Nomic Embed v2 \| 305 \| 768 \| 52.86 \| 65.80 \| ✅ \| ✅ \| ✅ \|
	\| mE5 Base \| 278 \| 768 \| 48.88 \| 62.30 \| ❌ \| ❌ \| ❌ \|
	\| mGTE Base \| 305 \| 768 \| 51.10 \| 63.40 \| ❌ \| ❌ \| ❌ \|
	\| Arctic Embed v2 Base \| 305 \| 768 \| 55.40 \| 59.90 \| ❌ \| ❌ \| ❌ \|
	\| \|
	\| BGE M3 \| 568 \| 1024 \| 48.80 \| 69.20 \| ❌ \| ✅ \| ❌ \|
	\| Arctic Embed v2 Large \| 568 \| 1024 \| 55.65 \| 66.00 \| ❌ \| ❌ \| ❌ \|
	\| mE5 Large \| 560 \| 1024 \| 51.40 \| 66.50 \| ❌ \| ❌ \| ❌ \|



	## Model Architecture
	- Total Parameters: 475M
	- Active Parameters During Inference: 305M
	- Architecture Type: Mixture of Experts (MoE)
	- MoE Configuration: 8 experts with top-2 routing
	- Embedding Dimensions: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
	- Maximum Sequence Length: 512 tokens
	- Languages: Supports dozens of languages (see Performance section)


	## Usage Guide

	### Installation

	The model can be used through SentenceTransformers and Transformers.

	For best performance on GPU, please install

	```bash
	pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git
	```

	> [!IMPORTANT]
	> Important!
	> The text prompt must include a task instruction prefix, instructing the model which task is being performed.

	Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents.

	### Transformers

	If using Transformers, make sure to prepend the task instruction prefix.

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
	model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)

	sentences = ['search_document: Hello!', 'search_document: ¡Hola!']

	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0]
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
	model.eval()
	with torch.no_grad():
	model_output = model(**encoded_input)
	embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
	embeddings = F.normalize(embeddings, p=2, dim=1)
	print(embeddings.shape)
	# torch.Size([2, 768])

	similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0)
	print(similarity)
	# tensor(0.9118)
	```

	### SentenceTransformers

	With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically.

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
	sentences = ["Hello!", "¡Hola!"]
	embeddings = model.encode(sentences, prompt_name="passage")
	print(embeddings.shape)
	# (2, 768)

	similarity = model.similarity(embeddings[0], embeddings[1])
	print(similarity)
	# tensor([[0.9118]])
	```

	## Performance

	nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png)

	nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png)

	## Best Practices
	- Add appropriate prefixes to your text:
	- For queries: "search_query: "
	- For documents: "search_document: "
	- Maximum input length is 512 tokens
	- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern

	## Limitations
	- Performance may vary across different languages
	- Resource requirements may be higher than traditional dense models due to MoE architecture
	- Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation

	## Training Details

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png)

	- Trained on 1.6 billion high-quality pairs across multiple languages
	- Uses consistency filtering to ensure high-quality training data
	- Incorporates Matryoshka representation learning for dimension flexibility
	- Training includes both weakly-supervised contrastive pretraining and supervised finetuning



	## Join the Nomic Community

	- Nomic: [https://nomic.ai](https://nomic.ai)
	- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
	- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)