ernestobs7 commited on
Commit
3a0af86
·
verified ·
1 Parent(s): 52e5c80

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,697 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:156
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: What are some of the tools that different systems can apply to
13
+ problems, as mentioned in the context?
14
+ sentences:
15
+ - Synthetic data as a substantial component of pretraining is becoming increasingly
16
+ common, and the Phi series of models has consistently emphasized the importance
17
+ of synthetic data. Rather than serving as a cheap substitute for organic data,
18
+ synthetic data has several direct advantages over organic data.
19
+ - 'The number of available systems has exploded. Different systems have different
20
+ tools they can apply to your problems—like Python and JavaScript and web search
21
+ and image generation and maybe even database lookups... so you’d better understand
22
+ what those tools are, what they can do and how to tell if the LLM used them or
23
+ not.
24
+
25
+ Did you know ChatGPT has two entirely different ways of running Python now?
26
+
27
+ Want to build a Claude Artifact that talks to an external API? You’d better understand
28
+ CSP and CORS HTTP headers first.'
29
+ - '29th: NotebookLM’s automatically generated podcasts are surprisingly effective
30
+
31
+
32
+ 30th: Weeknotes: Three podcasts, two trips and a new plugin system
33
+
34
+
35
+
36
+
37
+ October
38
+
39
+
40
+ 1st: OpenAI DevDay 2024 live blog
41
+
42
+
43
+ 2nd: OpenAI DevDay: Let’s build developer tools, not digital God
44
+
45
+
46
+ 15th: ChatGPT will happily write you a thinly disguised horoscope
47
+
48
+
49
+ 17th: Video scraping: extracting JSON data from a 35 second screen capture for
50
+ less than 1/10th of a cent
51
+
52
+
53
+ 18th: Experimenting with audio input and output for the OpenAI Chat Completion
54
+ API
55
+
56
+
57
+ 19th: Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs
58
+
59
+
60
+ 21st: Everything I built with Claude Artifacts this week
61
+
62
+
63
+ 22nd: Initial explorations of Anthropic’s new Computer Use capability'
64
+ - source_sentence: What key themes and pivotal moments in the field of Large Language
65
+ Models were identified in 2024?
66
+ sentences:
67
+ - 'One way to think about these models is an extension of the chain-of-thought prompting
68
+ trick, first explored in the May 2022 paper Large Language Models are Zero-Shot
69
+ Reasoners.
70
+
71
+ This is that trick where, if you get a model to talk out loud about a problem
72
+ it’s solving, you often get a result which the model would not have achieved otherwise.
73
+
74
+ o1 takes this process and further bakes it into the model itself. The details
75
+ are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the
76
+ problem that are not directly visible to the user (though the ChatGPT UI shows
77
+ a summary of them), then outputs a final result.'
78
+ - 'Things we learned about LLMs in 2024
79
+
80
+
81
+
82
+
83
+
84
+
85
+
86
+
87
+
88
+
89
+
90
+
91
+
92
+
93
+
94
+
95
+
96
+
97
+
98
+
99
+
100
+
101
+ Simon Willison’s Weblog
102
+
103
+ Subscribe
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+ Things we learned about LLMs in 2024
112
+
113
+ 31st December 2024
114
+
115
+ A lot has happened in the world of Large Language Models over the course of 2024.
116
+ Here’s a review of things we figured out about the field in the past twelve months,
117
+ plus my attempt at identifying key themes and pivotal moments.
118
+
119
+ This is a sequel to my review of 2023.
120
+
121
+ In this article:'
122
+ - 'The number of available systems has exploded. Different systems have different
123
+ tools they can apply to your problems—like Python and JavaScript and web search
124
+ and image generation and maybe even database lookups... so you’d better understand
125
+ what those tools are, what they can do and how to tell if the LLM used them or
126
+ not.
127
+
128
+ Did you know ChatGPT has two entirely different ways of running Python now?
129
+
130
+ Want to build a Claude Artifact that talks to an external API? You’d better understand
131
+ CSP and CORS HTTP headers first.'
132
+ - source_sentence: Which organizations have models that scored higher than GPT-4-0314?
133
+ sentences:
134
+ - 'This prompt-driven custom interface feature is so powerful and easy to build
135
+ (once you’ve figured out the gnarly details of browser sandboxing) that I expect
136
+ it to show up as a feature in a wide range of products in 2025.
137
+
138
+ Universal access to the best models lasted for just a few short months
139
+
140
+ For a few short months this year all three of the best available models—GPT-4o,
141
+ Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely available to most of the world.'
142
+ - 'Then there’s the rest. If you browse the Chatbot Arena leaderboard today—still
143
+ the most useful single place to get a vibes-based evaluation of models—you’ll
144
+ see that GPT-4-0314 has fallen to around 70th place. The 18 organizations with
145
+ higher scoring models are Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01
146
+ AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21
147
+ Labs, Princeton and Tencent.
148
+
149
+ Training a GPT-4 beating model was a huge deal in 2023. In 2024 it’s an achievement
150
+ that isn’t even particularly notable, though I personally still celebrate any
151
+ time a new organization joins that list.
152
+
153
+ Some of those GPT-4 models run on my laptop'
154
+ - 'This remains astonishing to me. I thought a model with the capabilities and output
155
+ quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.
156
+
157
+ These models take up enough of my 64GB of RAM that I don’t run them often—they
158
+ don’t leave much room for anything else.
159
+
160
+ The fact that they run at all is a testament to the incredible training and inference
161
+ performance gains that we’ve figured out over the past year. It turns out there
162
+ was a lot of low-hanging fruit to be harvested in terms of model efficiency. I
163
+ expect there’s still more to come.'
164
+ - source_sentence: What does the term "slop" refer to in the context of generative
165
+ AI usage?
166
+ sentences:
167
+ - 'I think this means that, as individual users, we don’t need to feel any guilt
168
+ at all for the energy consumed by the vast majority of our prompts. The impact
169
+ is likely neglible compared to driving a car down the street or maybe even watching
170
+ a video on YouTube.
171
+
172
+ Likewise, training. DeepSeek v3 training for less than $6m is a fantastic sign
173
+ that training costs can and should continue to drop.
174
+
175
+ For less efficient models I find it useful to compare their energy usage to commercial
176
+ flights. The largest Llama 3 model cost about the same as a single digit number
177
+ of fully loaded passenger flights from New York to London. That’s certainly not
178
+ nothing, but once trained that model can be used by millions of people at no extra
179
+ training cost.'
180
+ - 'A lot of people absolutely hate this stuff. In some of the spaces I hang out
181
+ (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that
182
+ “LLMs are useful” can be enough to kick off a huge fight.
183
+
184
+ I get it. There are plenty of reasons to dislike this technology—the environmental
185
+ impact, the (lack of) ethics of the training data, the lack of reliability, the
186
+ negative applications, the potential impact on people’s jobs.
187
+
188
+ LLMs absolutely warrant criticism. We need to be talking through these problems,
189
+ finding ways to mitigate them and helping people learn how to use these tools
190
+ responsibly in ways where the positive applications outweigh the negative.'
191
+ - 'I love the term “slop” because it so succinctly captures one of the ways we should
192
+ not be using generative AI!
193
+
194
+ Slop was even in the running for Oxford Word of the Year 2024, but it lost to
195
+ brain rot.
196
+
197
+ Synthetic training data works great
198
+
199
+ An idea that surprisingly seems to have stuck in the public consciousness is that
200
+ of “model collapse”. This was first described in the paper The Curse of Recursion:
201
+ Training on Generated Data Makes Models Forget in May 2023, and repeated in Nature
202
+ in July 2024 with the more eye-catching headline AI models collapse when trained
203
+ on recursively generated data.'
204
+ - source_sentence: What are the dates of the articles listed as more recent articles
205
+ in the context?
206
+ sentences:
207
+ - "Posted 31st December 2024 at 6:07 pm · Follow me on Mastodon or Twitter or subscribe\
208
+ \ to my newsletter\n\n\nMore recent articles\n\nRun LLMs on macOS using llm-mlx\
209
+ \ and Apple's MLX framework - 15th February 2025\nURL-addressable Pyodide Python\
210
+ \ environments - 13th February 2025\nUsing pip to install a Large Language Model\
211
+ \ that's under 100MB - 7th February 2025\n\n\n \n\n\nThis is Things we learned\
212
+ \ about LLMs in 2024 by Simon Willison, posted on 31st December 2024.\n\nPart\
213
+ \ of series LLMs annual review\n\nStuff we figured out about AI in 2023 - Dec.\
214
+ \ 31, 2023, 11:59 p.m. \nThings we learned about LLMs in 2024 - Dec. 31, 2024,\
215
+ \ 6:07 p.m. \n\n\n\n google\n 347\n\n\n ai\n\
216
+ \ 1098\n\n\n openai\n 255"
217
+ - 'OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely
218
+ available from its launch in June. This was a momentus change, because for the
219
+ previous year free users had mostly been restricted to GPT-3.5 level models, meaning
220
+ new users got a very inaccurate mental model of what a capable LLM could actually
221
+ do.
222
+
223
+ That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT
224
+ Pro. This $200/month subscription service is the only way to access their most
225
+ capable model, o1 Pro.
226
+
227
+ Since the trick behind the o1 series (and the future models it will undoubtedly
228
+ inspire) is to expend more compute time to get better results, I don’t think those
229
+ days of free access to the best available models are likely to return.'
230
+ - 'Against this photo of butterflies at the California Academy of Sciences:
231
+
232
+
233
+
234
+ A shallow dish, likely a hummingbird or butterfly feeder, is red. Pieces of orange
235
+ slices of fruit are visible inside the dish.
236
+
237
+ Two butterflies are positioned in the feeder, one is a dark brown/black butterfly
238
+ with white/cream-colored markings. The other is a large, brown butterfly with
239
+ patterns of lighter brown, beige, and black markings, including prominent eye
240
+ spots. The larger brown butterfly appears to be feeding on the fruit.'
241
+ pipeline_tag: sentence-similarity
242
+ library_name: sentence-transformers
243
+ metrics:
244
+ - cosine_accuracy@1
245
+ - cosine_accuracy@3
246
+ - cosine_accuracy@5
247
+ - cosine_accuracy@10
248
+ - cosine_precision@1
249
+ - cosine_precision@3
250
+ - cosine_precision@5
251
+ - cosine_precision@10
252
+ - cosine_recall@1
253
+ - cosine_recall@3
254
+ - cosine_recall@5
255
+ - cosine_recall@10
256
+ - cosine_ndcg@10
257
+ - cosine_mrr@10
258
+ - cosine_map@100
259
+ model-index:
260
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
261
+ results:
262
+ - task:
263
+ type: information-retrieval
264
+ name: Information Retrieval
265
+ dataset:
266
+ name: Unknown
267
+ type: unknown
268
+ metrics:
269
+ - type: cosine_accuracy@1
270
+ value: 0.75
271
+ name: Cosine Accuracy@1
272
+ - type: cosine_accuracy@3
273
+ value: 1.0
274
+ name: Cosine Accuracy@3
275
+ - type: cosine_accuracy@5
276
+ value: 1.0
277
+ name: Cosine Accuracy@5
278
+ - type: cosine_accuracy@10
279
+ value: 1.0
280
+ name: Cosine Accuracy@10
281
+ - type: cosine_precision@1
282
+ value: 0.75
283
+ name: Cosine Precision@1
284
+ - type: cosine_precision@3
285
+ value: 0.3333333333333333
286
+ name: Cosine Precision@3
287
+ - type: cosine_precision@5
288
+ value: 0.20000000000000004
289
+ name: Cosine Precision@5
290
+ - type: cosine_precision@10
291
+ value: 0.10000000000000002
292
+ name: Cosine Precision@10
293
+ - type: cosine_recall@1
294
+ value: 0.75
295
+ name: Cosine Recall@1
296
+ - type: cosine_recall@3
297
+ value: 1.0
298
+ name: Cosine Recall@3
299
+ - type: cosine_recall@5
300
+ value: 1.0
301
+ name: Cosine Recall@5
302
+ - type: cosine_recall@10
303
+ value: 1.0
304
+ name: Cosine Recall@10
305
+ - type: cosine_ndcg@10
306
+ value: 0.8968216255952429
307
+ name: Cosine Ndcg@10
308
+ - type: cosine_mrr@10
309
+ value: 0.861111111111111
310
+ name: Cosine Mrr@10
311
+ - type: cosine_map@100
312
+ value: 0.861111111111111
313
+ name: Cosine Map@100
314
+ ---
315
+
316
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
317
+
318
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
319
+
320
+ ## Model Details
321
+
322
+ ### Model Description
323
+ - **Model Type:** Sentence Transformer
324
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
325
+ - **Maximum Sequence Length:** 512 tokens
326
+ - **Output Dimensionality:** 1024 dimensions
327
+ - **Similarity Function:** Cosine Similarity
328
+ <!-- - **Training Dataset:** Unknown -->
329
+ <!-- - **Language:** Unknown -->
330
+ <!-- - **License:** Unknown -->
331
+
332
+ ### Model Sources
333
+
334
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
335
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
336
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
337
+
338
+ ### Full Model Architecture
339
+
340
+ ```
341
+ SentenceTransformer(
342
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
343
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
344
+ (2): Normalize()
345
+ )
346
+ ```
347
+
348
+ ## Usage
349
+
350
+ ### Direct Usage (Sentence Transformers)
351
+
352
+ First install the Sentence Transformers library:
353
+
354
+ ```bash
355
+ pip install -U sentence-transformers
356
+ ```
357
+
358
+ Then you can load this model and run inference.
359
+ ```python
360
+ from sentence_transformers import SentenceTransformer
361
+
362
+ # Download from the 🤗 Hub
363
+ model = SentenceTransformer("ernestobs7/legal-ft-v0")
364
+ # Run inference
365
+ sentences = [
366
+ 'What are the dates of the articles listed as more recent articles in the context?',
367
+ "Posted 31st December 2024 at 6:07 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter\n\n\nMore recent articles\n\nRun LLMs on macOS using llm-mlx and Apple's MLX framework - 15th February 2025\nURL-addressable Pyodide Python environments - 13th February 2025\nUsing pip to install a Large Language Model that's under 100MB - 7th February 2025\n\n\n \n\n\nThis is Things we learned about LLMs in 2024 by Simon Willison, posted on 31st December 2024.\n\nPart of series LLMs annual review\n\nStuff we figured out about AI in 2023 - Dec. 31, 2023, 11:59 p.m. \nThings we learned about LLMs in 2024 - Dec. 31, 2024, 6:07 p.m. \n\n\n\n google\n 347\n\n\n ai\n 1098\n\n\n openai\n 255",
368
+ 'Against this photo of butterflies at the California Academy of Sciences:\n\n\nA shallow dish, likely a hummingbird or butterfly feeder, is red. Pieces of orange slices of fruit are visible inside the dish.\nTwo butterflies are positioned in the feeder, one is a dark brown/black butterfly with white/cream-colored markings. The other is a large, brown butterfly with patterns of lighter brown, beige, and black markings, including prominent eye spots. The larger brown butterfly appears to be feeding on the fruit.',
369
+ ]
370
+ embeddings = model.encode(sentences)
371
+ print(embeddings.shape)
372
+ # [3, 1024]
373
+
374
+ # Get the similarity scores for the embeddings
375
+ similarities = model.similarity(embeddings, embeddings)
376
+ print(similarities.shape)
377
+ # [3, 3]
378
+ ```
379
+
380
+ <!--
381
+ ### Direct Usage (Transformers)
382
+
383
+ <details><summary>Click to see the direct usage in Transformers</summary>
384
+
385
+ </details>
386
+ -->
387
+
388
+ <!--
389
+ ### Downstream Usage (Sentence Transformers)
390
+
391
+ You can finetune this model on your own dataset.
392
+
393
+ <details><summary>Click to expand</summary>
394
+
395
+ </details>
396
+ -->
397
+
398
+ <!--
399
+ ### Out-of-Scope Use
400
+
401
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
402
+ -->
403
+
404
+ ## Evaluation
405
+
406
+ ### Metrics
407
+
408
+ #### Information Retrieval
409
+
410
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
411
+
412
+ | Metric | Value |
413
+ |:--------------------|:-----------|
414
+ | cosine_accuracy@1 | 0.75 |
415
+ | cosine_accuracy@3 | 1.0 |
416
+ | cosine_accuracy@5 | 1.0 |
417
+ | cosine_accuracy@10 | 1.0 |
418
+ | cosine_precision@1 | 0.75 |
419
+ | cosine_precision@3 | 0.3333 |
420
+ | cosine_precision@5 | 0.2 |
421
+ | cosine_precision@10 | 0.1 |
422
+ | cosine_recall@1 | 0.75 |
423
+ | cosine_recall@3 | 1.0 |
424
+ | cosine_recall@5 | 1.0 |
425
+ | cosine_recall@10 | 1.0 |
426
+ | **cosine_ndcg@10** | **0.8968** |
427
+ | cosine_mrr@10 | 0.8611 |
428
+ | cosine_map@100 | 0.8611 |
429
+
430
+ <!--
431
+ ## Bias, Risks and Limitations
432
+
433
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
434
+ -->
435
+
436
+ <!--
437
+ ### Recommendations
438
+
439
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
440
+ -->
441
+
442
+ ## Training Details
443
+
444
+ ### Training Dataset
445
+
446
+ #### Unnamed Dataset
447
+
448
+ * Size: 156 training samples
449
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
450
+ * Approximate statistics based on the first 156 samples:
451
+ | | sentence_0 | sentence_1 |
452
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
453
+ | type | string | string |
454
+ | details | <ul><li>min: 13 tokens</li><li>mean: 20.12 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 130.53 tokens</li><li>max: 204 tokens</li></ul> |
455
+ * Samples:
456
+ | sentence_0 | sentence_1 |
457
+ |:----------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
458
+ | <code>What are the hardware requirements mentioned for running models like GPT-4?</code> | <code>This remains astonishing to me. I thought a model with the capabilities and output quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.<br>These models take up enough of my 64GB of RAM that I don’t run them often—they don’t leave much room for anything else.<br>The fact that they run at all is a testament to the incredible training and inference performance gains that we’ve figured out over the past year. It turns out there was a lot of low-hanging fruit to be harvested in terms of model efficiency. I expect there’s still more to come.</code> |
459
+ | <code>What does the author attribute the ability to run these models on less powerful hardware to?</code> | <code>This remains astonishing to me. I thought a model with the capabilities and output quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.<br>These models take up enough of my 64GB of RAM that I don’t run them often—they don’t leave much room for anything else.<br>The fact that they run at all is a testament to the incredible training and inference performance gains that we’ve figured out over the past year. It turns out there was a lot of low-hanging fruit to be harvested in terms of model efficiency. I expect there’s still more to come.</code> |
460
+ | <code>What challenges are associated with using LLMs in 2024?</code> | <code>The year of slop<br>Synthetic training data works great<br>LLMs somehow got even harder to use<br>Knowledge is incredibly unevenly distributed<br>LLMs need better criticism<br>Everything tagged “llms” on my blog in 2024</code> |
461
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
462
+ ```json
463
+ {
464
+ "loss": "MultipleNegativesRankingLoss",
465
+ "matryoshka_dims": [
466
+ 768,
467
+ 512,
468
+ 256,
469
+ 128,
470
+ 64
471
+ ],
472
+ "matryoshka_weights": [
473
+ 1,
474
+ 1,
475
+ 1,
476
+ 1,
477
+ 1
478
+ ],
479
+ "n_dims_per_step": -1
480
+ }
481
+ ```
482
+
483
+ ### Training Hyperparameters
484
+ #### Non-Default Hyperparameters
485
+
486
+ - `eval_strategy`: steps
487
+ - `per_device_train_batch_size`: 10
488
+ - `per_device_eval_batch_size`: 10
489
+ - `num_train_epochs`: 10
490
+ - `multi_dataset_batch_sampler`: round_robin
491
+
492
+ #### All Hyperparameters
493
+ <details><summary>Click to expand</summary>
494
+
495
+ - `overwrite_output_dir`: False
496
+ - `do_predict`: False
497
+ - `eval_strategy`: steps
498
+ - `prediction_loss_only`: True
499
+ - `per_device_train_batch_size`: 10
500
+ - `per_device_eval_batch_size`: 10
501
+ - `per_gpu_train_batch_size`: None
502
+ - `per_gpu_eval_batch_size`: None
503
+ - `gradient_accumulation_steps`: 1
504
+ - `eval_accumulation_steps`: None
505
+ - `torch_empty_cache_steps`: None
506
+ - `learning_rate`: 5e-05
507
+ - `weight_decay`: 0.0
508
+ - `adam_beta1`: 0.9
509
+ - `adam_beta2`: 0.999
510
+ - `adam_epsilon`: 1e-08
511
+ - `max_grad_norm`: 1
512
+ - `num_train_epochs`: 10
513
+ - `max_steps`: -1
514
+ - `lr_scheduler_type`: linear
515
+ - `lr_scheduler_kwargs`: {}
516
+ - `warmup_ratio`: 0.0
517
+ - `warmup_steps`: 0
518
+ - `log_level`: passive
519
+ - `log_level_replica`: warning
520
+ - `log_on_each_node`: True
521
+ - `logging_nan_inf_filter`: True
522
+ - `save_safetensors`: True
523
+ - `save_on_each_node`: False
524
+ - `save_only_model`: False
525
+ - `restore_callback_states_from_checkpoint`: False
526
+ - `no_cuda`: False
527
+ - `use_cpu`: False
528
+ - `use_mps_device`: False
529
+ - `seed`: 42
530
+ - `data_seed`: None
531
+ - `jit_mode_eval`: False
532
+ - `use_ipex`: False
533
+ - `bf16`: False
534
+ - `fp16`: False
535
+ - `fp16_opt_level`: O1
536
+ - `half_precision_backend`: auto
537
+ - `bf16_full_eval`: False
538
+ - `fp16_full_eval`: False
539
+ - `tf32`: None
540
+ - `local_rank`: 0
541
+ - `ddp_backend`: None
542
+ - `tpu_num_cores`: None
543
+ - `tpu_metrics_debug`: False
544
+ - `debug`: []
545
+ - `dataloader_drop_last`: False
546
+ - `dataloader_num_workers`: 0
547
+ - `dataloader_prefetch_factor`: None
548
+ - `past_index`: -1
549
+ - `disable_tqdm`: False
550
+ - `remove_unused_columns`: True
551
+ - `label_names`: None
552
+ - `load_best_model_at_end`: False
553
+ - `ignore_data_skip`: False
554
+ - `fsdp`: []
555
+ - `fsdp_min_num_params`: 0
556
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
557
+ - `fsdp_transformer_layer_cls_to_wrap`: None
558
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
559
+ - `deepspeed`: None
560
+ - `label_smoothing_factor`: 0.0
561
+ - `optim`: adamw_torch
562
+ - `optim_args`: None
563
+ - `adafactor`: False
564
+ - `group_by_length`: False
565
+ - `length_column_name`: length
566
+ - `ddp_find_unused_parameters`: None
567
+ - `ddp_bucket_cap_mb`: None
568
+ - `ddp_broadcast_buffers`: False
569
+ - `dataloader_pin_memory`: True
570
+ - `dataloader_persistent_workers`: False
571
+ - `skip_memory_metrics`: True
572
+ - `use_legacy_prediction_loop`: False
573
+ - `push_to_hub`: False
574
+ - `resume_from_checkpoint`: None
575
+ - `hub_model_id`: None
576
+ - `hub_strategy`: every_save
577
+ - `hub_private_repo`: None
578
+ - `hub_always_push`: False
579
+ - `gradient_checkpointing`: False
580
+ - `gradient_checkpointing_kwargs`: None
581
+ - `include_inputs_for_metrics`: False
582
+ - `include_for_metrics`: []
583
+ - `eval_do_concat_batches`: True
584
+ - `fp16_backend`: auto
585
+ - `push_to_hub_model_id`: None
586
+ - `push_to_hub_organization`: None
587
+ - `mp_parameters`:
588
+ - `auto_find_batch_size`: False
589
+ - `full_determinism`: False
590
+ - `torchdynamo`: None
591
+ - `ray_scope`: last
592
+ - `ddp_timeout`: 1800
593
+ - `torch_compile`: False
594
+ - `torch_compile_backend`: None
595
+ - `torch_compile_mode`: None
596
+ - `dispatch_batches`: None
597
+ - `split_batches`: None
598
+ - `include_tokens_per_second`: False
599
+ - `include_num_input_tokens_seen`: False
600
+ - `neftune_noise_alpha`: None
601
+ - `optim_target_modules`: None
602
+ - `batch_eval_metrics`: False
603
+ - `eval_on_start`: False
604
+ - `use_liger_kernel`: False
605
+ - `eval_use_gather_object`: False
606
+ - `average_tokens_across_devices`: False
607
+ - `prompts`: None
608
+ - `batch_sampler`: batch_sampler
609
+ - `multi_dataset_batch_sampler`: round_robin
610
+
611
+ </details>
612
+
613
+ ### Training Logs
614
+ | Epoch | Step | cosine_ndcg@10 |
615
+ |:-----:|:----:|:--------------:|
616
+ | 1.0 | 16 | 0.8885 |
617
+ | 2.0 | 32 | 0.8939 |
618
+ | 3.0 | 48 | 0.8939 |
619
+ | 3.125 | 50 | 0.8994 |
620
+ | 4.0 | 64 | 0.8939 |
621
+ | 5.0 | 80 | 0.8939 |
622
+ | 6.0 | 96 | 0.8968 |
623
+ | 6.25 | 100 | 0.8968 |
624
+ | 7.0 | 112 | 0.8968 |
625
+ | 8.0 | 128 | 0.8968 |
626
+ | 9.0 | 144 | 0.8968 |
627
+ | 9.375 | 150 | 0.8968 |
628
+ | 10.0 | 160 | 0.8968 |
629
+
630
+
631
+ ### Framework Versions
632
+ - Python: 3.11.11
633
+ - Sentence Transformers: 3.4.1
634
+ - Transformers: 4.48.3
635
+ - PyTorch: 2.5.1+cu124
636
+ - Accelerate: 1.3.0
637
+ - Datasets: 3.3.0
638
+ - Tokenizers: 0.21.0
639
+
640
+ ## Citation
641
+
642
+ ### BibTeX
643
+
644
+ #### Sentence Transformers
645
+ ```bibtex
646
+ @inproceedings{reimers-2019-sentence-bert,
647
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
648
+ author = "Reimers, Nils and Gurevych, Iryna",
649
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
650
+ month = "11",
651
+ year = "2019",
652
+ publisher = "Association for Computational Linguistics",
653
+ url = "https://arxiv.org/abs/1908.10084",
654
+ }
655
+ ```
656
+
657
+ #### MatryoshkaLoss
658
+ ```bibtex
659
+ @misc{kusupati2024matryoshka,
660
+ title={Matryoshka Representation Learning},
661
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
662
+ year={2024},
663
+ eprint={2205.13147},
664
+ archivePrefix={arXiv},
665
+ primaryClass={cs.LG}
666
+ }
667
+ ```
668
+
669
+ #### MultipleNegativesRankingLoss
670
+ ```bibtex
671
+ @misc{henderson2017efficient,
672
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
673
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
674
+ year={2017},
675
+ eprint={1705.00652},
676
+ archivePrefix={arXiv},
677
+ primaryClass={cs.CL}
678
+ }
679
+ ```
680
+
681
+ <!--
682
+ ## Glossary
683
+
684
+ *Clearly define terms in order to be accessible across audiences.*
685
+ -->
686
+
687
+ <!--
688
+ ## Model Card Authors
689
+
690
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
691
+ -->
692
+
693
+ <!--
694
+ ## Model Card Contact
695
+
696
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
697
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4de5eb16e66196c8263aedb0c9332d143a4c34383fe42852de296d0e1e16432d
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff