njhaveri commited on
Commit
80915d9
·
verified ·
1 Parent(s): c2b8428

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,750 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:156
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: What is the significance of Claude Artifacts in the context of
13
+ LLMs and application development?
14
+ sentences:
15
+ - 'The environmental impact got much, much worse
16
+
17
+ The much bigger problem here is the enormous competitive buildout of the infrastructure
18
+ that is imagined to be necessary for these models in the future.
19
+
20
+ Companies like Google, Meta, Microsoft and Amazon are all spending billions of
21
+ dollars rolling out new datacenters, with a very material impact on the electricity
22
+ grid and the environment. There’s even talk of spinning up new nuclear power stations,
23
+ but those can take decades.
24
+
25
+ Is this infrastructure necessary? DeepSeek v3’s $6m training cost and the continued
26
+ crash in LLM prices might hint that it’s not. But would you want to be the big
27
+ tech executive that argued NOT to build out this infrastructure only to be proven
28
+ wrong in a few years’ time?'
29
+ - 'We already knew LLMs were spookily good at writing code. If you prompt them right,
30
+ it turns out they can build you a full interactive application using HTML, CSS
31
+ and JavaScript (and tools like React if you wire up some extra supporting build
32
+ mechanisms)—often in a single prompt.
33
+
34
+ Anthropic kicked this idea into high gear when they released Claude Artifacts,
35
+ a groundbreaking new feature that was initially slightly lost in the noise due
36
+ to being described half way through their announcement of the incredible Claude
37
+ 3.5 Sonnet.
38
+
39
+ With Artifacts, Claude can write you an on-demand interactive application and
40
+ then let you use it directly inside the Claude interface.
41
+
42
+ Here’s my Extract URLs app, entirely generated by Claude:'
43
+ - 'This prompt-driven custom interface feature is so powerful and easy to build
44
+ (once you’ve figured out the gnarly details of browser sandboxing) that I expect
45
+ it to show up as a feature in a wide range of products in 2025.
46
+
47
+ Universal access to the best models lasted for just a few short months
48
+
49
+ For a few short months this year all three of the best available models—GPT-4o,
50
+ Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely available to most of the world.'
51
+ - source_sentence: What challenges are associated with using LLMs in the year of slop?
52
+ sentences:
53
+ - 'I also gave a bunch of talks and podcast appearances. I’ve started habitually
54
+ turning my talks into annotated presentations—here are my best from 2023:
55
+
56
+
57
+ Prompt injection explained, with video, slides, and a transcript
58
+
59
+ Catching up on the weird world of LLMs
60
+
61
+ Making Large Language Models work for you
62
+
63
+ Open questions for AI engineering
64
+
65
+ Embeddings: What they are and why they matter
66
+
67
+ Financial sustainability for open source projects at GitHub Universe
68
+
69
+
70
+ And in podcasts:
71
+
72
+
73
+
74
+ What AI can do for you on the Theory of Change
75
+
76
+
77
+ Working in public on Path to Citus Con
78
+
79
+
80
+ LLMs break the internet on the Changelog
81
+
82
+
83
+ Talking Large Language Models on Rooftop Ruby
84
+
85
+
86
+ Thoughts on the OpenAI board situation on Newsroom Robots'
87
+ - 'The year of slop
88
+
89
+ Synthetic training data works great
90
+
91
+ LLMs somehow got even harder to use
92
+
93
+ Knowledge is incredibly unevenly distributed
94
+
95
+ LLMs need better criticism
96
+
97
+ Everything tagged “llms” on my blog in 2024'
98
+ - 'The boring yet crucial secret behind good system prompts is test-driven development.
99
+ You don’t write down a system prompt and find ways to test it. You write down
100
+ tests and find a system prompt that passes them.
101
+
102
+
103
+ It’s become abundantly clear over the course of 2024 that writing good automated
104
+ evals for LLM-powered systems is the skill that’s most needed to build useful
105
+ applications on top of these models. If you have a strong eval suite you can adopt
106
+ new models faster, iterate better and build more reliable and useful product features
107
+ than your competition.
108
+
109
+ Vercel’s Malte Ubl:'
110
+ - source_sentence: What features did GitHub and Mistral Chat introduce in relation
111
+ to the author's findings?
112
+ sentences:
113
+ - 'Except... you can run generated code to see if it’s correct. And with patterns
114
+ like ChatGPT Code Interpreter the LLM can execute the code itself, process the
115
+ error message, then rewrite it and keep trying until it works!
116
+
117
+ So hallucination is a much lesser problem for code generation than for anything
118
+ else. If only we had the equivalent of Code Interpreter for fact-checking natural
119
+ language!
120
+
121
+ How should we feel about this as software engineers?
122
+
123
+ On the one hand, this feels like a threat: who needs a programmer if ChatGPT can
124
+ write code for you?'
125
+ - 'I’ve found myself using this a lot. I noticed how much I was relying on it in
126
+ October and wrote Everything I built with Claude Artifacts this week, describing
127
+ 14 little tools I had put together in a seven day period.
128
+
129
+ Since then, a whole bunch of other teams have built similar systems. GitHub announced
130
+ their version of this—GitHub Spark—in October. Mistral Chat added it as a feature
131
+ called Canvas in November.
132
+
133
+ Steve Krouse from Val Town built a version of it against Cerebras, showcasing
134
+ how a 2,000 token/second LLM can iterate on an application with changes visible
135
+ in less than a second.'
136
+ - 'This remains astonishing to me. I thought a model with the capabilities and output
137
+ quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.
138
+
139
+ These models take up enough of my 64GB of RAM that I don’t run them often—they
140
+ don’t leave much room for anything else.
141
+
142
+ The fact that they run at all is a testament to the incredible training and inference
143
+ performance gains that we’ve figured out over the past year. It turns out there
144
+ was a lot of low-hanging fruit to be harvested in terms of model efficiency. I
145
+ expect there’s still more to come.'
146
+ - source_sentence: Why did the voice from the demo, named Skye, not make it to a production
147
+ product?
148
+ sentences:
149
+ - 'A lot of people are excited about AI agents—an infuriatingly vague term that
150
+ seems to be converging on “AI systems that can go away and act on your behalf”.
151
+ We’ve been talking about them all year, but I’ve seen few if any examples of them
152
+ running in production, despite lots of exciting prototypes.
153
+
154
+ I think this is because of gullibility.
155
+
156
+ Can we solve this? Honestly, I’m beginning to suspect that you can’t fully solve
157
+ gullibility without achieving AGI. So it may be quite a while before those agent
158
+ dreams can really start to come true!
159
+
160
+ Code may be the best application
161
+
162
+ Over the course of the year, it’s become increasingly clear that writing code
163
+ is one of the things LLMs are most capable of.'
164
+ - 'Embeddings: What they are and why they matter
165
+
166
+ 61.7k
167
+
168
+ 79.3k
169
+
170
+
171
+
172
+ Catching up on the weird world of LLMs
173
+
174
+ 61.6k
175
+
176
+ 85.9k
177
+
178
+
179
+
180
+ llamafile is the new best way to run an LLM on your own computer
181
+
182
+ 52k
183
+
184
+ 66k
185
+
186
+
187
+
188
+ Prompt injection explained, with video, slides, and a transcript
189
+
190
+ 51k
191
+
192
+ 61.9k
193
+
194
+
195
+
196
+ AI-enhanced development makes me more ambitious with my projects
197
+
198
+ 49.6k
199
+
200
+ 60.1k
201
+
202
+
203
+
204
+ Understanding GPT tokenizers
205
+
206
+ 49.5k
207
+
208
+ 61.1k
209
+
210
+
211
+
212
+ Exploring GPTs: ChatGPT in a trench coat?
213
+
214
+ 46.4k
215
+
216
+ 58.5k
217
+
218
+
219
+
220
+ Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
221
+
222
+ 40.5k
223
+
224
+ 49.2k
225
+
226
+
227
+
228
+ How to implement Q&A against your documentation with GPT3, embeddings and Datasette
229
+
230
+ 37.3k
231
+
232
+ 44.9k
233
+
234
+
235
+
236
+ Lawyer cites fake cases invented by ChatGPT, judge is not amused
237
+
238
+ 37.1k
239
+
240
+ 47.4k'
241
+ - 'The May 13th announcement of GPT-4o included a demo of a brand new voice mode,
242
+ where the true multi-modal GPT-4o (the o is for “omni”) model could accept audio
243
+ input and output incredibly realistic sounding speech without needing separate
244
+ TTS or STT models.
245
+
246
+ The demo also sounded conspicuously similar to Scarlett Johansson... and after
247
+ she complained the voice from the demo, Skye, never made it to a production product.
248
+
249
+ The delay in releasing the new voice mode after the initial demo caused quite
250
+ a lot of confusion. I wrote about that in ChatGPT in “4o” mode is not running
251
+ the new features yet.'
252
+ - source_sentence: What are some of the new features introduced in multi-modal models
253
+ that enhance their capabilities beyond text?
254
+ sentences:
255
+ - 'I think people who complain that LLM improvement has slowed are often missing
256
+ the enormous advances in these multi-modal models. Being able to run prompts against
257
+ images (and audio and video) is a fascinating new way to apply these models.
258
+
259
+ Voice and live camera mode are science fiction come to life
260
+
261
+ The audio and live video modes that have started to emerge deserve a special mention.
262
+
263
+ The ability to talk to ChatGPT first arrived in September 2023, but it was mostly
264
+ an illusion: OpenAI used their excellent Whisper speech-to-text model and a new
265
+ text-to-speech model (creatively named tts-1) to enable conversations with the
266
+ ChatGPT mobile apps, but the actual model just saw text.'
267
+ - 'Then in February, Meta released Llama. And a few weeks later in March, Georgi
268
+ Gerganov released code that got it working on a MacBook.
269
+
270
+ I wrote about how Large language models are having their Stable Diffusion moment,
271
+ and with hindsight that was a very good call!
272
+
273
+ This unleashed a whirlwind of innovation, which was accelerated further in July
274
+ when Meta released Llama 2—an improved version which, crucially, included permission
275
+ for commercial use.
276
+
277
+ Today there are literally thousands of LLMs that can be run locally, on all manner
278
+ of different devices.'
279
+ - '260 input tokens, 92 output tokens. Cost approximately 0.0024 cents (that’s less
280
+ than a 400th of a cent).
281
+
282
+ This increase in efficiency and reduction in price is my single favourite trend
283
+ from 2024. I want the utility of LLMs at a fraction of the energy cost and it
284
+ looks like that’s what we’re getting.
285
+
286
+ Multimodal vision is common, audio and video are starting to emerge
287
+
288
+ My butterfly example above illustrates another key trend from 2024: the rise of
289
+ multi-modal LLMs.
290
+
291
+ A year ago the single most notable example of these was GPT-4 Vision, released
292
+ at OpenAI’s DevDay in November 2023. Google’s multi-modal Gemini 1.0 was announced
293
+ on December 7th 2023 so it also (just) makes it into the 2023 window.'
294
+ pipeline_tag: sentence-similarity
295
+ library_name: sentence-transformers
296
+ metrics:
297
+ - cosine_accuracy@1
298
+ - cosine_accuracy@3
299
+ - cosine_accuracy@5
300
+ - cosine_accuracy@10
301
+ - cosine_precision@1
302
+ - cosine_precision@3
303
+ - cosine_precision@5
304
+ - cosine_precision@10
305
+ - cosine_recall@1
306
+ - cosine_recall@3
307
+ - cosine_recall@5
308
+ - cosine_recall@10
309
+ - cosine_ndcg@10
310
+ - cosine_mrr@10
311
+ - cosine_map@100
312
+ model-index:
313
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
314
+ results:
315
+ - task:
316
+ type: information-retrieval
317
+ name: Information Retrieval
318
+ dataset:
319
+ name: Unknown
320
+ type: unknown
321
+ metrics:
322
+ - type: cosine_accuracy@1
323
+ value: 0.9166666666666666
324
+ name: Cosine Accuracy@1
325
+ - type: cosine_accuracy@3
326
+ value: 1.0
327
+ name: Cosine Accuracy@3
328
+ - type: cosine_accuracy@5
329
+ value: 1.0
330
+ name: Cosine Accuracy@5
331
+ - type: cosine_accuracy@10
332
+ value: 1.0
333
+ name: Cosine Accuracy@10
334
+ - type: cosine_precision@1
335
+ value: 0.9166666666666666
336
+ name: Cosine Precision@1
337
+ - type: cosine_precision@3
338
+ value: 0.3333333333333333
339
+ name: Cosine Precision@3
340
+ - type: cosine_precision@5
341
+ value: 0.20000000000000004
342
+ name: Cosine Precision@5
343
+ - type: cosine_precision@10
344
+ value: 0.10000000000000002
345
+ name: Cosine Precision@10
346
+ - type: cosine_recall@1
347
+ value: 0.9166666666666666
348
+ name: Cosine Recall@1
349
+ - type: cosine_recall@3
350
+ value: 1.0
351
+ name: Cosine Recall@3
352
+ - type: cosine_recall@5
353
+ value: 1.0
354
+ name: Cosine Recall@5
355
+ - type: cosine_recall@10
356
+ value: 1.0
357
+ name: Cosine Recall@10
358
+ - type: cosine_ndcg@10
359
+ value: 0.9692441461309548
360
+ name: Cosine Ndcg@10
361
+ - type: cosine_mrr@10
362
+ value: 0.9583333333333334
363
+ name: Cosine Mrr@10
364
+ - type: cosine_map@100
365
+ value: 0.9583333333333334
366
+ name: Cosine Map@100
367
+ ---
368
+
369
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
370
+
371
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
372
+
373
+ ## Model Details
374
+
375
+ ### Model Description
376
+ - **Model Type:** Sentence Transformer
377
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
378
+ - **Maximum Sequence Length:** 512 tokens
379
+ - **Output Dimensionality:** 1024 dimensions
380
+ - **Similarity Function:** Cosine Similarity
381
+ <!-- - **Training Dataset:** Unknown -->
382
+ <!-- - **Language:** Unknown -->
383
+ <!-- - **License:** Unknown -->
384
+
385
+ ### Model Sources
386
+
387
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
388
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
389
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
390
+
391
+ ### Full Model Architecture
392
+
393
+ ```
394
+ SentenceTransformer(
395
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
396
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
397
+ (2): Normalize()
398
+ )
399
+ ```
400
+
401
+ ## Usage
402
+
403
+ ### Direct Usage (Sentence Transformers)
404
+
405
+ First install the Sentence Transformers library:
406
+
407
+ ```bash
408
+ pip install -U sentence-transformers
409
+ ```
410
+
411
+ Then you can load this model and run inference.
412
+ ```python
413
+ from sentence_transformers import SentenceTransformer
414
+
415
+ # Download from the 🤗 Hub
416
+ model = SentenceTransformer("njhaveri/legal-ft-2")
417
+ # Run inference
418
+ sentences = [
419
+ 'What are some of the new features introduced in multi-modal models that enhance their capabilities beyond text?',
420
+ 'I think people who complain that LLM improvement has slowed are often missing the enormous advances in these multi-modal models. Being able to run prompts against images (and audio and video) is a fascinating new way to apply these models.\nVoice and live camera mode are science fiction come to life\nThe audio and live video modes that have started to emerge deserve a special mention.\nThe ability to talk to ChatGPT first arrived in September 2023, but it was mostly an illusion: OpenAI used their excellent Whisper speech-to-text model and a new text-to-speech model (creatively named tts-1) to enable conversations with the ChatGPT mobile apps, but the actual model just saw text.',
421
+ '260 input tokens, 92 output tokens. Cost approximately 0.0024 cents (that’s less than a 400th of a cent).\nThis increase in efficiency and reduction in price is my single favourite trend from 2024. I want the utility of LLMs at a fraction of the energy cost and it looks like that’s what we’re getting.\nMultimodal vision is common, audio and video are starting to emerge\nMy butterfly example above illustrates another key trend from 2024: the rise of multi-modal LLMs.\nA year ago the single most notable example of these was GPT-4 Vision, released at OpenAI’s DevDay in November 2023. Google’s multi-modal Gemini 1.0 was announced on December 7th 2023 so it also (just) makes it into the 2023 window.',
422
+ ]
423
+ embeddings = model.encode(sentences)
424
+ print(embeddings.shape)
425
+ # [3, 1024]
426
+
427
+ # Get the similarity scores for the embeddings
428
+ similarities = model.similarity(embeddings, embeddings)
429
+ print(similarities.shape)
430
+ # [3, 3]
431
+ ```
432
+
433
+ <!--
434
+ ### Direct Usage (Transformers)
435
+
436
+ <details><summary>Click to see the direct usage in Transformers</summary>
437
+
438
+ </details>
439
+ -->
440
+
441
+ <!--
442
+ ### Downstream Usage (Sentence Transformers)
443
+
444
+ You can finetune this model on your own dataset.
445
+
446
+ <details><summary>Click to expand</summary>
447
+
448
+ </details>
449
+ -->
450
+
451
+ <!--
452
+ ### Out-of-Scope Use
453
+
454
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
455
+ -->
456
+
457
+ ## Evaluation
458
+
459
+ ### Metrics
460
+
461
+ #### Information Retrieval
462
+
463
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
464
+
465
+ | Metric | Value |
466
+ |:--------------------|:-----------|
467
+ | cosine_accuracy@1 | 0.9167 |
468
+ | cosine_accuracy@3 | 1.0 |
469
+ | cosine_accuracy@5 | 1.0 |
470
+ | cosine_accuracy@10 | 1.0 |
471
+ | cosine_precision@1 | 0.9167 |
472
+ | cosine_precision@3 | 0.3333 |
473
+ | cosine_precision@5 | 0.2 |
474
+ | cosine_precision@10 | 0.1 |
475
+ | cosine_recall@1 | 0.9167 |
476
+ | cosine_recall@3 | 1.0 |
477
+ | cosine_recall@5 | 1.0 |
478
+ | cosine_recall@10 | 1.0 |
479
+ | **cosine_ndcg@10** | **0.9692** |
480
+ | cosine_mrr@10 | 0.9583 |
481
+ | cosine_map@100 | 0.9583 |
482
+
483
+ <!--
484
+ ## Bias, Risks and Limitations
485
+
486
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
487
+ -->
488
+
489
+ <!--
490
+ ### Recommendations
491
+
492
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
493
+ -->
494
+
495
+ ## Training Details
496
+
497
+ ### Training Dataset
498
+
499
+ #### Unnamed Dataset
500
+
501
+ * Size: 156 training samples
502
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
503
+ * Approximate statistics based on the first 156 samples:
504
+ | | sentence_0 | sentence_1 |
505
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
506
+ | type | string | string |
507
+ | details | <ul><li>min: 12 tokens</li><li>mean: 20.29 tokens</li><li>max: 31 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 135.13 tokens</li><li>max: 214 tokens</li></ul> |
508
+ * Samples:
509
+ | sentence_0 | sentence_1 |
510
+ |:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
511
+ | <code>Why is it important for language models to believe the information provided to them?</code> | <code>Language Models are gullible. They “believe” what we tell them—what’s in their training data, then what’s in the fine-tuning data, then what’s in the prompt.<br>In order to be useful tools for us, we need them to believe what we feed them!<br>But it turns out a lot of the things we want to build need them not to be gullible.<br>Everyone wants an AI personal assistant. If you hired a real-world personal assistant who believed everything that anyone told them, you would quickly find that their ability to positively impact your life was severely limited.</code> |
512
+ | <code>What are the potential drawbacks of having a language model that is overly gullible?</code> | <code>Language Models are gullible. They “believe” what we tell them—what’s in their training data, then what’s in the fine-tuning data, then what’s in the prompt.<br>In order to be useful tools for us, we need them to believe what we feed them!<br>But it turns out a lot of the things we want to build need them not to be gullible.<br>Everyone wants an AI personal assistant. If you hired a real-world personal assistant who believed everything that anyone told them, you would quickly find that their ability to positively impact your life was severely limited.</code> |
513
+ | <code>What significant change occurred in LLM pricing over the past twelve months?</code> | <code>Here’s the rest of the transcript. It’s bland and generic, but my phone can pitch bland and generic Christmas movies to Netflix now!<br>LLM prices crashed, thanks to competition and increased efficiency<br>The past twelve months have seen a dramatic collapse in the cost of running a prompt through the top tier hosted LLMs.<br>In December 2023 (here’s the Internet Archive for the OpenAI pricing page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.</code> |
514
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
515
+ ```json
516
+ {
517
+ "loss": "MultipleNegativesRankingLoss",
518
+ "matryoshka_dims": [
519
+ 768,
520
+ 512,
521
+ 256,
522
+ 128,
523
+ 64
524
+ ],
525
+ "matryoshka_weights": [
526
+ 1,
527
+ 1,
528
+ 1,
529
+ 1,
530
+ 1
531
+ ],
532
+ "n_dims_per_step": -1
533
+ }
534
+ ```
535
+
536
+ ### Training Hyperparameters
537
+ #### Non-Default Hyperparameters
538
+
539
+ - `eval_strategy`: steps
540
+ - `per_device_train_batch_size`: 10
541
+ - `per_device_eval_batch_size`: 10
542
+ - `num_train_epochs`: 10
543
+ - `multi_dataset_batch_sampler`: round_robin
544
+
545
+ #### All Hyperparameters
546
+ <details><summary>Click to expand</summary>
547
+
548
+ - `overwrite_output_dir`: False
549
+ - `do_predict`: False
550
+ - `eval_strategy`: steps
551
+ - `prediction_loss_only`: True
552
+ - `per_device_train_batch_size`: 10
553
+ - `per_device_eval_batch_size`: 10
554
+ - `per_gpu_train_batch_size`: None
555
+ - `per_gpu_eval_batch_size`: None
556
+ - `gradient_accumulation_steps`: 1
557
+ - `eval_accumulation_steps`: None
558
+ - `torch_empty_cache_steps`: None
559
+ - `learning_rate`: 5e-05
560
+ - `weight_decay`: 0.0
561
+ - `adam_beta1`: 0.9
562
+ - `adam_beta2`: 0.999
563
+ - `adam_epsilon`: 1e-08
564
+ - `max_grad_norm`: 1
565
+ - `num_train_epochs`: 10
566
+ - `max_steps`: -1
567
+ - `lr_scheduler_type`: linear
568
+ - `lr_scheduler_kwargs`: {}
569
+ - `warmup_ratio`: 0.0
570
+ - `warmup_steps`: 0
571
+ - `log_level`: passive
572
+ - `log_level_replica`: warning
573
+ - `log_on_each_node`: True
574
+ - `logging_nan_inf_filter`: True
575
+ - `save_safetensors`: True
576
+ - `save_on_each_node`: False
577
+ - `save_only_model`: False
578
+ - `restore_callback_states_from_checkpoint`: False
579
+ - `no_cuda`: False
580
+ - `use_cpu`: False
581
+ - `use_mps_device`: False
582
+ - `seed`: 42
583
+ - `data_seed`: None
584
+ - `jit_mode_eval`: False
585
+ - `use_ipex`: False
586
+ - `bf16`: False
587
+ - `fp16`: False
588
+ - `fp16_opt_level`: O1
589
+ - `half_precision_backend`: auto
590
+ - `bf16_full_eval`: False
591
+ - `fp16_full_eval`: False
592
+ - `tf32`: None
593
+ - `local_rank`: 0
594
+ - `ddp_backend`: None
595
+ - `tpu_num_cores`: None
596
+ - `tpu_metrics_debug`: False
597
+ - `debug`: []
598
+ - `dataloader_drop_last`: False
599
+ - `dataloader_num_workers`: 0
600
+ - `dataloader_prefetch_factor`: None
601
+ - `past_index`: -1
602
+ - `disable_tqdm`: False
603
+ - `remove_unused_columns`: True
604
+ - `label_names`: None
605
+ - `load_best_model_at_end`: False
606
+ - `ignore_data_skip`: False
607
+ - `fsdp`: []
608
+ - `fsdp_min_num_params`: 0
609
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
610
+ - `fsdp_transformer_layer_cls_to_wrap`: None
611
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
612
+ - `deepspeed`: None
613
+ - `label_smoothing_factor`: 0.0
614
+ - `optim`: adamw_torch
615
+ - `optim_args`: None
616
+ - `adafactor`: False
617
+ - `group_by_length`: False
618
+ - `length_column_name`: length
619
+ - `ddp_find_unused_parameters`: None
620
+ - `ddp_bucket_cap_mb`: None
621
+ - `ddp_broadcast_buffers`: False
622
+ - `dataloader_pin_memory`: True
623
+ - `dataloader_persistent_workers`: False
624
+ - `skip_memory_metrics`: True
625
+ - `use_legacy_prediction_loop`: False
626
+ - `push_to_hub`: False
627
+ - `resume_from_checkpoint`: None
628
+ - `hub_model_id`: None
629
+ - `hub_strategy`: every_save
630
+ - `hub_private_repo`: None
631
+ - `hub_always_push`: False
632
+ - `gradient_checkpointing`: False
633
+ - `gradient_checkpointing_kwargs`: None
634
+ - `include_inputs_for_metrics`: False
635
+ - `include_for_metrics`: []
636
+ - `eval_do_concat_batches`: True
637
+ - `fp16_backend`: auto
638
+ - `push_to_hub_model_id`: None
639
+ - `push_to_hub_organization`: None
640
+ - `mp_parameters`:
641
+ - `auto_find_batch_size`: False
642
+ - `full_determinism`: False
643
+ - `torchdynamo`: None
644
+ - `ray_scope`: last
645
+ - `ddp_timeout`: 1800
646
+ - `torch_compile`: False
647
+ - `torch_compile_backend`: None
648
+ - `torch_compile_mode`: None
649
+ - `dispatch_batches`: None
650
+ - `split_batches`: None
651
+ - `include_tokens_per_second`: False
652
+ - `include_num_input_tokens_seen`: False
653
+ - `neftune_noise_alpha`: None
654
+ - `optim_target_modules`: None
655
+ - `batch_eval_metrics`: False
656
+ - `eval_on_start`: False
657
+ - `use_liger_kernel`: False
658
+ - `eval_use_gather_object`: False
659
+ - `average_tokens_across_devices`: False
660
+ - `prompts`: None
661
+ - `batch_sampler`: batch_sampler
662
+ - `multi_dataset_batch_sampler`: round_robin
663
+
664
+ </details>
665
+
666
+ ### Training Logs
667
+ | Epoch | Step | cosine_ndcg@10 |
668
+ |:-----:|:----:|:--------------:|
669
+ | 1.0 | 16 | 0.9692 |
670
+ | 2.0 | 32 | 0.9692 |
671
+ | 3.0 | 48 | 0.9692 |
672
+ | 3.125 | 50 | 0.9692 |
673
+ | 4.0 | 64 | 0.9692 |
674
+ | 5.0 | 80 | 0.9692 |
675
+ | 6.0 | 96 | 0.9692 |
676
+ | 6.25 | 100 | 0.9692 |
677
+ | 7.0 | 112 | 0.9692 |
678
+ | 8.0 | 128 | 0.9692 |
679
+ | 9.0 | 144 | 0.9692 |
680
+ | 9.375 | 150 | 0.9692 |
681
+ | 10.0 | 160 | 0.9692 |
682
+
683
+
684
+ ### Framework Versions
685
+ - Python: 3.13.1
686
+ - Sentence Transformers: 3.4.1
687
+ - Transformers: 4.48.3
688
+ - PyTorch: 2.6.0
689
+ - Accelerate: 1.3.0
690
+ - Datasets: 3.2.0
691
+ - Tokenizers: 0.21.0
692
+
693
+ ## Citation
694
+
695
+ ### BibTeX
696
+
697
+ #### Sentence Transformers
698
+ ```bibtex
699
+ @inproceedings{reimers-2019-sentence-bert,
700
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
701
+ author = "Reimers, Nils and Gurevych, Iryna",
702
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
703
+ month = "11",
704
+ year = "2019",
705
+ publisher = "Association for Computational Linguistics",
706
+ url = "https://arxiv.org/abs/1908.10084",
707
+ }
708
+ ```
709
+
710
+ #### MatryoshkaLoss
711
+ ```bibtex
712
+ @misc{kusupati2024matryoshka,
713
+ title={Matryoshka Representation Learning},
714
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
715
+ year={2024},
716
+ eprint={2205.13147},
717
+ archivePrefix={arXiv},
718
+ primaryClass={cs.LG}
719
+ }
720
+ ```
721
+
722
+ #### MultipleNegativesRankingLoss
723
+ ```bibtex
724
+ @misc{henderson2017efficient,
725
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
726
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
727
+ year={2017},
728
+ eprint={1705.00652},
729
+ archivePrefix={arXiv},
730
+ primaryClass={cs.CL}
731
+ }
732
+ ```
733
+
734
+ <!--
735
+ ## Glossary
736
+
737
+ *Clearly define terms in order to be accessible across audiences.*
738
+ -->
739
+
740
+ <!--
741
+ ## Model Card Authors
742
+
743
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
744
+ -->
745
+
746
+ <!--
747
+ ## Model Card Contact
748
+
749
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
750
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.6.0"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb94cbef3c8277c537a63d41c9b283284383ba91dd566d32ee25fc142446f278
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff