nomic-ai
/

nomic-embed-text-v2-moe

@@ -1,142 +1,216 @@
----
-base_model: nomic-ai/nomic-embed-text-v2-moe
-library_name: sentence-transformers
-pipeline_tag: sentence-similarity
-tags:
-- sentence-transformers
-- sentence-similarity
-- feature-extraction
----
-# SentenceTransformer based on nomic-ai/nomic-embed-text-v2-moe
-This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
-## Model Details
-### Model Description
-- **Model Type:** Sentence Transformer
-- **Base model:** [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) <!-- at revision 8e109938f32da90ed146077b419bedd5cc6590b7 -->
-- **Maximum Sequence Length:** 512 tokens
-- **Output Dimensionality:** 768 dimensions
-- **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NomicBertModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
-## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
-```bash
-pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
-model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe")
-# Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
-### Framework Versions
-- Python: 3.10.12
-- Sentence Transformers: 3.3.0
-- Transformers: 4.44.2
-- PyTorch: 2.4.1+cu121
-- Accelerate: 1.0.0
-- Datasets: 2.19.0
-- Tokenizers: 0.19.1
-## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

+---
+base_model: nomic-ai/nomic-embed-text-v2-moe
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+license: apache-2.0
+language:
+- en
+- es
+- fr
+- de
+- it
+- pt
+- pl
+- nl
+- tr
+- ja
+- vi
+- ru
+- id
+- ar
+- cs
+- ro
+- sv
+- el
+- uk
+- zh
+- hu
+- da
+- 'no'
+- hi
+- fi
+- bg
+- ko
+- sk
+- th
+- he
+- ca
+- lt
+- fa
+- ms
+- sl
+- lv
+- mr
+- bn
+- sq
+- cy
+- be
+- ml
+- kn
+- mk
+- ur
+- fy
+- te
+- eu
+- sw
+- so
+- sd
+- uz
+- co
+- hr
+- gu
+- ce
+- eo
+- jv
+- la
+- zu
+- mn
+- si
+- ga
+- ky
+- tg
+- my
+- km
+- mg
+- pa
+- sn
+- ha
+- ht
+- su
+- gd
+- ny
+- ps
+- ku
+- am
+- ig
+- lo
+- mi
+- nn
+- sm
+- yi
+- st
+- tl
+- xh
+- yo
+---
+# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings
+## Model Overview
+nomic-embed-text-v2-moe is SoTA multilingual MoE text embedding model:
+- **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
+- **Multilinguality**: Supports 100+ languages and trained over 1.6B pairs
+- **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degredations
+- **Fully-Open Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released
+| Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code |
+|-------|------------|----------|------|---------|---------------|---------------|------|
+| Nomic Embed v2 | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ |
+| mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌   | ❌   | ❌   |
+| mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ |
+| Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ |
+|   |
+| BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ |
+| Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ |
+| mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ |
+## Model Architecture
+- **Total Parameters**: 475M
+- **Active Parameters During Inference**: 305M
+- **Architecture Type**: Mixture of Experts (MoE)
+- **MoE Configuration**: 8 experts with top-2 routing
+- **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
+- **Maximum Sequence Length**: 512 tokens
+- **Languages**: Supports dozens of languages (see Performance section)
+## Usage Guide
+### Installation
+The model can be used through SentenceTransformers and Transformers.
+**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
+For queries/questions, please use `search_query: ` and `search_document: ` for the corresponding document
+**Transformers**
+If using Transformers, **make sure to prepend the task instruction prefix**
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
+model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
+sentences = ['search_document: Hello!', 'search_document: ¡Hola!']
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+model.eval()
+with torch.no_grad():
+    model_output = model(**encoded_input)
+embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+embeddings = F.normalize(embeddings, p=2, dim=1)
 ```
+**SentenceTransformers**
+With SentenceTransformers, you can specify the prompt_name (query or passage)
 ```python
 from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
+sentences = ["Hello!", "¡Hola!"]
+embeddings = model.encode(sentences, prompt_name="passage")
 ```
+## Performance
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png)
+## Best Practices
+- Add appropriate prefixes to your text:
+  - For queries: "search_query: "
+  - For documents: "search_document: "
+- Maximum input length is 512 tokens
+- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern
+## Limitations
+- Performance may vary across different languages
+- Resource requirements may be higher than traditional dense models due to MoE architecture
+- Must have trust_remote_code=True when loading the model
 ## Training Details
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png)
+- Trained on 1.6 billion high-quality pairs across multiple languages
+- Uses consistency filtering to ensure high-quality training data
+- Incorporates Matryoshka representation learning for dimension flexibility
+- Training includes both weakly-supervised contrastive pretraining and supervised finetuning
+## Join the Nomic Community
+- Nomic: [https://nomic.ai](https://nomic.ai)
+- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
+- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)