zpn commited on
Commit
a69ccd7
·
verified ·
1 Parent(s): c5fb980

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -120
README.md CHANGED
@@ -1,142 +1,216 @@
1
- ---
2
- base_model: nomic-ai/nomic-embed-text-v2-moe
3
- library_name: sentence-transformers
4
- pipeline_tag: sentence-similarity
5
- tags:
6
- - sentence-transformers
7
- - sentence-similarity
8
- - feature-extraction
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- # SentenceTransformer based on nomic-ai/nomic-embed-text-v2-moe
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
14
-
15
- ## Model Details
16
-
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- - **Base model:** [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) <!-- at revision 8e109938f32da90ed146077b419bedd5cc6590b7 -->
20
- - **Maximum Sequence Length:** 512 tokens
21
- - **Output Dimensionality:** 768 dimensions
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
-
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
- ### Full Model Architecture
34
-
35
- ```
36
- SentenceTransformer(
37
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NomicBertModel
38
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- (2): Normalize()
40
- )
41
- ```
42
-
43
- ## Usage
44
-
45
- ### Direct Usage (Sentence Transformers)
46
-
47
- First install the Sentence Transformers library:
48
-
49
- ```bash
50
- pip install -U sentence-transformers
51
  ```
52
 
53
- Then you can load this model and run inference.
 
54
  ```python
55
  from sentence_transformers import SentenceTransformer
56
 
57
- # Download from the 🤗 Hub
58
- model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe")
59
- # Run inference
60
- sentences = [
61
- 'The weather is lovely today.',
62
- "It's so sunny outside!",
63
- 'He drove to the stadium.',
64
- ]
65
- embeddings = model.encode(sentences)
66
- print(embeddings.shape)
67
- # [3, 768]
68
-
69
- # Get the similarity scores for the embeddings
70
- similarities = model.similarity(embeddings, embeddings)
71
- print(similarities.shape)
72
- # [3, 3]
73
  ```
74
 
75
- <!--
76
- ### Direct Usage (Transformers)
77
-
78
- <details><summary>Click to see the direct usage in Transformers</summary>
79
-
80
- </details>
81
- -->
82
-
83
- <!--
84
- ### Downstream Usage (Sentence Transformers)
85
-
86
- You can finetune this model on your own dataset.
87
 
88
- <details><summary>Click to expand</summary>
89
 
90
- </details>
91
- -->
92
 
93
- <!--
94
- ### Out-of-Scope Use
95
 
96
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
97
- -->
98
 
99
- <!--
100
- ## Bias, Risks and Limitations
 
 
 
 
101
 
102
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
103
- -->
104
-
105
- <!--
106
- ### Recommendations
107
-
108
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
109
- -->
110
 
111
  ## Training Details
112
 
113
- ### Framework Versions
114
- - Python: 3.10.12
115
- - Sentence Transformers: 3.3.0
116
- - Transformers: 4.44.2
117
- - PyTorch: 2.4.1+cu121
118
- - Accelerate: 1.0.0
119
- - Datasets: 2.19.0
120
- - Tokenizers: 0.19.1
121
-
122
- ## Citation
123
-
124
- ### BibTeX
125
-
126
- <!--
127
- ## Glossary
128
 
129
- *Clearly define terms in order to be accessible across audiences.*
130
- -->
 
 
131
 
132
- <!--
133
- ## Model Card Authors
134
 
135
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
136
- -->
137
 
138
- <!--
139
- ## Model Card Contact
140
 
141
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
142
- -->
 
 
1
+ ---
2
+ base_model: nomic-ai/nomic-embed-text-v2-moe
3
+ library_name: sentence-transformers
4
+ pipeline_tag: sentence-similarity
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ license: apache-2.0
10
+ language:
11
+ - en
12
+ - es
13
+ - fr
14
+ - de
15
+ - it
16
+ - pt
17
+ - pl
18
+ - nl
19
+ - tr
20
+ - ja
21
+ - vi
22
+ - ru
23
+ - id
24
+ - ar
25
+ - cs
26
+ - ro
27
+ - sv
28
+ - el
29
+ - uk
30
+ - zh
31
+ - hu
32
+ - da
33
+ - 'no'
34
+ - hi
35
+ - fi
36
+ - bg
37
+ - ko
38
+ - sk
39
+ - th
40
+ - he
41
+ - ca
42
+ - lt
43
+ - fa
44
+ - ms
45
+ - sl
46
+ - lv
47
+ - mr
48
+ - bn
49
+ - sq
50
+ - cy
51
+ - be
52
+ - ml
53
+ - kn
54
+ - mk
55
+ - ur
56
+ - fy
57
+ - te
58
+ - eu
59
+ - sw
60
+ - so
61
+ - sd
62
+ - uz
63
+ - co
64
+ - hr
65
+ - gu
66
+ - ce
67
+ - eo
68
+ - jv
69
+ - la
70
+ - zu
71
+ - mn
72
+ - si
73
+ - ga
74
+ - ky
75
+ - tg
76
+ - my
77
+ - km
78
+ - mg
79
+ - pa
80
+ - sn
81
+ - ha
82
+ - ht
83
+ - su
84
+ - gd
85
+ - ny
86
+ - ps
87
+ - ku
88
+ - am
89
+ - ig
90
+ - lo
91
+ - mi
92
+ - nn
93
+ - sm
94
+ - yi
95
+ - st
96
+ - tl
97
+ - xh
98
+ - yo
99
+ ---
100
+
101
+ # nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings
102
+
103
+ ## Model Overview
104
+ nomic-embed-text-v2-moe is SoTA multilingual MoE text embedding model:
105
+
106
+ - **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
107
+ - **Multilinguality**: Supports 100+ languages and trained over 1.6B pairs
108
+ - **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degredations
109
+ - **Fully-Open Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released
110
+
111
+
112
+ | Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code |
113
+ |-------|------------|----------|------|---------|---------------|---------------|------|
114
+ | Nomic Embed v2 | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ |
115
+ | mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ |
116
+ | mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ |
117
+ | Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ |
118
+ | |
119
+ | BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ |
120
+ | Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ |
121
+ | mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ |
122
+
123
+
124
+
125
+ ## Model Architecture
126
+ - **Total Parameters**: 475M
127
+ - **Active Parameters During Inference**: 305M
128
+ - **Architecture Type**: Mixture of Experts (MoE)
129
+ - **MoE Configuration**: 8 experts with top-2 routing
130
+ - **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
131
+ - **Maximum Sequence Length**: 512 tokens
132
+ - **Languages**: Supports dozens of languages (see Performance section)
133
+
134
+
135
+ ## Usage Guide
136
+
137
+ ### Installation
138
+
139
+ The model can be used through SentenceTransformers and Transformers.
140
+
141
+ **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
142
+
143
+ For queries/questions, please use `search_query: ` and `search_document: ` for the corresponding document
144
+
145
+ **Transformers**
146
+ If using Transformers, **make sure to prepend the task instruction prefix**
147
 
148
+ ```python
149
+ import torch
150
+ import torch.nn.functional as F
151
+ from transformers import AutoTokenizer, AutoModel
152
+
153
+ tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
154
+ model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
155
+
156
+ sentences = ['search_document: Hello!', 'search_document: ¡Hola!']
157
+
158
+ def mean_pooling(model_output, attention_mask):
159
+ token_embeddings = model_output[0]
160
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
161
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
162
+
163
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
164
+ model.eval()
165
+ with torch.no_grad():
166
+ model_output = model(**encoded_input)
167
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
168
+ embeddings = F.normalize(embeddings, p=2, dim=1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  ```
170
 
171
+ **SentenceTransformers**
172
+ With SentenceTransformers, you can specify the prompt_name (query or passage)
173
  ```python
174
  from sentence_transformers import SentenceTransformer
175
 
176
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
177
+ sentences = ["Hello!", "¡Hola!"]
178
+ embeddings = model.encode(sentences, prompt_name="passage")
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ```
180
 
181
+ ## Performance
 
 
 
 
 
 
 
 
 
 
 
182
 
 
183
 
184
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png)
 
185
 
 
 
186
 
187
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png)
 
188
 
189
+ ## Best Practices
190
+ - Add appropriate prefixes to your text:
191
+ - For queries: "search_query: "
192
+ - For documents: "search_document: "
193
+ - Maximum input length is 512 tokens
194
+ - For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern
195
 
196
+ ## Limitations
197
+ - Performance may vary across different languages
198
+ - Resource requirements may be higher than traditional dense models due to MoE architecture
199
+ - Must have trust_remote_code=True when loading the model
 
 
 
 
200
 
201
  ## Training Details
202
 
203
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
+ - Trained on 1.6 billion high-quality pairs across multiple languages
206
+ - Uses consistency filtering to ensure high-quality training data
207
+ - Incorporates Matryoshka representation learning for dimension flexibility
208
+ - Training includes both weakly-supervised contrastive pretraining and supervised finetuning
209
 
 
 
210
 
 
 
211
 
212
+ ## Join the Nomic Community
 
213
 
214
+ - Nomic: [https://nomic.ai](https://nomic.ai)
215
+ - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
216
+ - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)