--- base_model: nomic-ai/nomic-embed-text-v2-moe library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction license: apache-2.0 language: - en - es - fr - de - it - pt - pl - nl - tr - ja - vi - ru - id - ar - cs - ro - sv - el - uk - zh - hu - da - 'no' - hi - fi - bg - ko - sk - th - he - ca - lt - fa - ms - sl - lv - mr - bn - sq - cy - be - ml - kn - mk - ur - fy - te - eu - sw - so - sd - uz - co - hr - gu - ce - eo - jv - la - zu - mn - si - ga - ky - tg - my - km - mg - pa - sn - ha - ht - su - gd - ny - ps - ku - am - ig - lo - mi - nn - sm - yi - st - tl - xh - yo --- # nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings ## Model Overview nomic-embed-text-v2-moe is SoTA multilingual MoE text embedding model: - **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size - **Multilinguality**: Supports 100+ languages and trained over 1.6B pairs - **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degredations - **Fully-Open Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released | Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code | |-------|------------|----------|------|---------|---------------|---------------|------| | Nomic Embed v2 | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ | | mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ | | mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ | | Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ | | | | BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ | | Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ | | mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ | ## Model Architecture - **Total Parameters**: 475M - **Active Parameters During Inference**: 305M - **Architecture Type**: Mixture of Experts (MoE) - **MoE Configuration**: 8 experts with top-2 routing - **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning - **Maximum Sequence Length**: 512 tokens - **Languages**: Supports dozens of languages (see Performance section) ## Usage Guide ### Installation The model can be used through SentenceTransformers and Transformers. **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed. For queries/questions, please use `search_query: ` and `search_document: ` for the corresponding document **Transformers** If using Transformers, **make sure to prepend the task instruction prefix** ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe") model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) sentences = ['search_document: Hello!', 'search_document: ¡Hola!'] def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') model.eval() with torch.no_grad(): model_output = model(**encoded_input) embeddings = mean_pooling(model_output, encoded_input['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) ``` **SentenceTransformers** With SentenceTransformers, you can specify the prompt_name (query or passage) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) sentences = ["Hello!", "¡Hola!"] embeddings = model.encode(sentences, prompt_name="passage") ``` ## Performance ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png) ## Best Practices - Add appropriate prefixes to your text: - For queries: "search_query: " - For documents: "search_document: " - Maximum input length is 512 tokens - For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern ## Limitations - Performance may vary across different languages - Resource requirements may be higher than traditional dense models due to MoE architecture - Must have trust_remote_code=True when loading the model ## Training Details ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png) - Trained on 1.6 billion high-quality pairs across multiple languages - Uses consistency filtering to ensure high-quality training data - Incorporates Matryoshka representation learning for dimension flexibility - Training includes both weakly-supervised contrastive pretraining and supervised finetuning ## Join the Nomic Community - Nomic: [https://nomic.ai](https://nomic.ai) - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8) - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)