llm-wizard commited on
Commit
fa3bc1d
·
verified ·
1 Parent(s): ac14915

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,663 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:156
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: What is the estimated training cost of DeepSeek v3, and how does
13
+ it compare to the training hours used for Llama 31?
14
+ sentences:
15
+ - 'Your browser does not support the audio element.
16
+
17
+
18
+ OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also
19
+ accepts audio input, and the Google Gemini apps can speak in a similar way to
20
+ ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s
21
+ meant to roll out in Q1 of 2025.
22
+
23
+ Google’s NotebookLM, released in September, took audio output to a new level by
24
+ producing spookily realistic conversations between two “podcast hosts” about anything
25
+ you fed into their tool. They later added custom instructions, so naturally I
26
+ turned them into pelicans:
27
+
28
+
29
+
30
+ Your browser does not support the audio element.'
31
+ - 'DeepSeek v3 is a huge 685B parameter model—one of the largest openly licensed
32
+ models currently available, significantly bigger than the largest of Meta’s Llama
33
+ series, Llama 3.1 405B.
34
+
35
+ Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka the Chatbot
36
+ Arena) currently rank it 7th, just behind the Gemini 2.0 and OpenAI 4o/o1 models.
37
+ This is by far the highest ranking openly licensed model.
38
+
39
+ The really impressive thing about DeepSeek v3 is the training cost. The model
40
+ was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama
41
+ 3.1 405B trained 30,840,000 GPU hours—11x that used by DeepSeek v3, for a model
42
+ that benchmarks slightly worse.'
43
+ - 'Those US export regulations on GPUs to China seem to have inspired some very
44
+ effective training optimizations!
45
+
46
+ The environmental impact got better
47
+
48
+ A welcome result of the increased efficiency of the models—both the hosted ones
49
+ and the ones I can run locally—is that the energy usage and environmental impact
50
+ of running a prompt has dropped enormously over the past couple of years.
51
+
52
+ OpenAI themselves are charging 100x less for a prompt compared to the GPT-3 days.
53
+ I have it on good authority that neither Google Gemini nor Amazon Nova (two of
54
+ the least expensive model providers) are running prompts at a loss.'
55
+ - source_sentence: How does the launch of ChatGPT Pro impact access to OpenAI's most
56
+ capable model compared to previous offerings?
57
+ sentences:
58
+ - 'These abilities are just a few weeks old at this point, and I don’t think their
59
+ impact has been fully felt yet. If you haven’t tried them out yet you really should.
60
+
61
+ Both Gemini and OpenAI offer API access to these features as well. OpenAI started
62
+ with a WebSocket API that was quite challenging to use, but in December they announced
63
+ a new WebRTC API which is much easier to get started with. Building a web app
64
+ that a user can talk to via voice is easy now!
65
+
66
+ Prompt driven app generation is a commodity already
67
+
68
+ This was possible with GPT-4 in 2023, but the value it provides became evident
69
+ in 2024.'
70
+ - 'OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely
71
+ available from its launch in June. This was a momentus change, because for the
72
+ previous year free users had mostly been restricted to GPT-3.5 level models, meaning
73
+ new users got a very inaccurate mental model of what a capable LLM could actually
74
+ do.
75
+
76
+ That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT
77
+ Pro. This $200/month subscription service is the only way to access their most
78
+ capable model, o1 Pro.
79
+
80
+ Since the trick behind the o1 series (and the future models it will undoubtedly
81
+ inspire) is to expend more compute time to get better results, I don’t think those
82
+ days of free access to the best available models are likely to return.'
83
+ - 'Intuitively, one would expect that systems this powerful would take millions
84
+ of lines of complex code. Instead, it turns out a few hundred lines of Python
85
+ is genuinely enough to train a basic version!
86
+
87
+ What matters most is the training data. You need a lot of data to make these
88
+ things work, and the quantity and quality of the training data appears to be the
89
+ most important factor in how good the resulting model is.
90
+
91
+ If you can gather the right data, and afford to pay for the GPUs to train it,
92
+ you can build an LLM.'
93
+ - source_sentence: What are the implications of having a Code Interpreter equivalent
94
+ for fact-checking natural language?
95
+ sentences:
96
+ - 'Your browser does not support the audio element.
97
+
98
+
99
+ OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also
100
+ accepts audio input, and the Google Gemini apps can speak in a similar way to
101
+ ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s
102
+ meant to roll out in Q1 of 2025.
103
+
104
+ Google’s NotebookLM, released in September, took audio output to a new level by
105
+ producing spookily realistic conversations between two “podcast hosts” about anything
106
+ you fed into their tool. They later added custom instructions, so naturally I
107
+ turned them into pelicans:
108
+
109
+
110
+
111
+ Your browser does not support the audio element.'
112
+ - 'Except... you can run generated code to see if it’s correct. And with patterns
113
+ like ChatGPT Code Interpreter the LLM can execute the code itself, process the
114
+ error message, then rewrite it and keep trying until it works!
115
+
116
+ So hallucination is a much lesser problem for code generation than for anything
117
+ else. If only we had the equivalent of Code Interpreter for fact-checking natural
118
+ language!
119
+
120
+ How should we feel about this as software engineers?
121
+
122
+ On the one hand, this feels like a threat: who needs a programmer if ChatGPT can
123
+ write code for you?'
124
+ - 'The biggest innovation here is that it opens up a new way to scale a model: instead
125
+ of improving model performance purely through additional compute at training time,
126
+ models can now take on harder problems by spending more compute on inference.
127
+
128
+ The sequel to o1, o3 (they skipped “o2” for European trademark reasons) was announced
129
+ on 20th December with an impressive result against the ARC-AGI benchmark, albeit
130
+ one that likely involved more than $1,000,000 of compute time expense!
131
+
132
+ o3 is expected to ship in January. I doubt many people have real-world problems
133
+ that would benefit from that level of compute expenditure—I certainly don’t!—but
134
+ it appears to be a genuine next step in LLM architecture for taking on much harder
135
+ problems.'
136
+ - source_sentence: What advantages does a 64GB Mac have for running models compared
137
+ to other machines?
138
+ sentences:
139
+ - 'My personal laptop is a 64GB M2 MacBook Pro from 2023. It’s a powerful machine,
140
+ but it’s also nearly two years old now—and crucially it’s the same laptop I’ve
141
+ been using ever since I first ran an LLM on my computer back in March 2023 (see
142
+ Large language models are having their Stable Diffusion moment).
143
+
144
+ That same laptop that could just about run a GPT-3-class model in March last year
145
+ has now run multiple GPT-4 class models! Some of my notes on that:'
146
+ - 'This prompt-driven custom interface feature is so powerful and easy to build
147
+ (once you’ve figured out the gnarly details of browser sandboxing) that I expect
148
+ it to show up as a feature in a wide range of products in 2025.
149
+
150
+ Universal access to the best models lasted for just a few short months
151
+
152
+ For a few short months this year all three of the best available models—GPT-4o,
153
+ Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely available to most of the world.'
154
+ - 'On paper, a 64GB Mac should be a great machine for running models due to the
155
+ way the CPU and GPU can share the same memory. In practice, many models are released
156
+ as model weights and libraries that reward NVIDIA’s CUDA over other platforms.
157
+
158
+ The llama.cpp ecosystem helped a lot here, but the real breakthrough has been
159
+ Apple’s MLX library, “an array framework for Apple Silicon”. It’s fantastic.
160
+
161
+ Apple’s mlx-lm Python library supports running a wide range of MLX-compatible
162
+ models on my Mac, with excellent performance. mlx-community on Hugging Face offers
163
+ more than 1,000 models that have been converted to the necessary format.'
164
+ - source_sentence: How does Claude enable users to interact with applications generated
165
+ by its system?
166
+ sentences:
167
+ - 'We already knew LLMs were spookily good at writing code. If you prompt them right,
168
+ it turns out they can build you a full interactive application using HTML, CSS
169
+ and JavaScript (and tools like React if you wire up some extra supporting build
170
+ mechanisms)—often in a single prompt.
171
+
172
+ Anthropic kicked this idea into high gear when they released Claude Artifacts,
173
+ a groundbreaking new feature that was initially slightly lost in the noise due
174
+ to being described half way through their announcement of the incredible Claude
175
+ 3.5 Sonnet.
176
+
177
+ With Artifacts, Claude can write you an on-demand interactive application and
178
+ then let you use it directly inside the Claude interface.
179
+
180
+ Here’s my Extract URLs app, entirely generated by Claude:'
181
+ - 'An interesting point of comparison here could be the way railways rolled out
182
+ around the world in the 1800s. Constructing these required enormous investments
183
+ and had a massive environmental impact, and many of the lines that were built
184
+ turned out to be unnecessary—sometimes multiple lines from different companies
185
+ serving the exact same routes!
186
+
187
+ The resulting bubbles contributed to several financial crashes, see Wikipedia
188
+ for Panic of 1873, Panic of 1893, Panic of 1901 and the UK’s Railway Mania. They
189
+ left us with a lot of useful infrastructure and a great deal of bankruptcies and
190
+ environmental damage.
191
+
192
+ The year of slop'
193
+ - 'We don’t yet know how to build GPT-4
194
+
195
+ Frustratingly, despite the enormous leaps ahead we’ve had this year, we are yet
196
+ to see an alternative model that’s better than GPT-4.
197
+
198
+ OpenAI released GPT-4 in March, though it later turned out we had a sneak peak
199
+ of it in February when Microsoft used it as part of the new Bing.
200
+
201
+ This may well change in the next few weeks: Google’s Gemini Ultra has big claims,
202
+ but isn’t yet available for us to try out.
203
+
204
+ The team behind Mistral are working to beat GPT-4 as well, and their track record
205
+ is already extremely strong considering their first public model only came out
206
+ in September, and they’ve released two significant improvements since then.'
207
+ pipeline_tag: sentence-similarity
208
+ library_name: sentence-transformers
209
+ metrics:
210
+ - cosine_accuracy@1
211
+ - cosine_accuracy@3
212
+ - cosine_accuracy@5
213
+ - cosine_accuracy@10
214
+ - cosine_precision@1
215
+ - cosine_precision@3
216
+ - cosine_precision@5
217
+ - cosine_precision@10
218
+ - cosine_recall@1
219
+ - cosine_recall@3
220
+ - cosine_recall@5
221
+ - cosine_recall@10
222
+ - cosine_ndcg@10
223
+ - cosine_mrr@10
224
+ - cosine_map@100
225
+ model-index:
226
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
227
+ results:
228
+ - task:
229
+ type: information-retrieval
230
+ name: Information Retrieval
231
+ dataset:
232
+ name: Unknown
233
+ type: unknown
234
+ metrics:
235
+ - type: cosine_accuracy@1
236
+ value: 1.0
237
+ name: Cosine Accuracy@1
238
+ - type: cosine_accuracy@3
239
+ value: 1.0
240
+ name: Cosine Accuracy@3
241
+ - type: cosine_accuracy@5
242
+ value: 1.0
243
+ name: Cosine Accuracy@5
244
+ - type: cosine_accuracy@10
245
+ value: 1.0
246
+ name: Cosine Accuracy@10
247
+ - type: cosine_precision@1
248
+ value: 1.0
249
+ name: Cosine Precision@1
250
+ - type: cosine_precision@3
251
+ value: 0.3333333333333333
252
+ name: Cosine Precision@3
253
+ - type: cosine_precision@5
254
+ value: 0.20000000000000004
255
+ name: Cosine Precision@5
256
+ - type: cosine_precision@10
257
+ value: 0.10000000000000002
258
+ name: Cosine Precision@10
259
+ - type: cosine_recall@1
260
+ value: 1.0
261
+ name: Cosine Recall@1
262
+ - type: cosine_recall@3
263
+ value: 1.0
264
+ name: Cosine Recall@3
265
+ - type: cosine_recall@5
266
+ value: 1.0
267
+ name: Cosine Recall@5
268
+ - type: cosine_recall@10
269
+ value: 1.0
270
+ name: Cosine Recall@10
271
+ - type: cosine_ndcg@10
272
+ value: 1.0
273
+ name: Cosine Ndcg@10
274
+ - type: cosine_mrr@10
275
+ value: 1.0
276
+ name: Cosine Mrr@10
277
+ - type: cosine_map@100
278
+ value: 1.0
279
+ name: Cosine Map@100
280
+ ---
281
+
282
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
283
+
284
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
285
+
286
+ ## Model Details
287
+
288
+ ### Model Description
289
+ - **Model Type:** Sentence Transformer
290
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
291
+ - **Maximum Sequence Length:** 512 tokens
292
+ - **Output Dimensionality:** 1024 dimensions
293
+ - **Similarity Function:** Cosine Similarity
294
+ <!-- - **Training Dataset:** Unknown -->
295
+ <!-- - **Language:** Unknown -->
296
+ <!-- - **License:** Unknown -->
297
+
298
+ ### Model Sources
299
+
300
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
301
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
302
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
303
+
304
+ ### Full Model Architecture
305
+
306
+ ```
307
+ SentenceTransformer(
308
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
309
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
310
+ (2): Normalize()
311
+ )
312
+ ```
313
+
314
+ ## Usage
315
+
316
+ ### Direct Usage (Sentence Transformers)
317
+
318
+ First install the Sentence Transformers library:
319
+
320
+ ```bash
321
+ pip install -U sentence-transformers
322
+ ```
323
+
324
+ Then you can load this model and run inference.
325
+ ```python
326
+ from sentence_transformers import SentenceTransformer
327
+
328
+ # Download from the 🤗 Hub
329
+ model = SentenceTransformer("llm-wizard/legal-ft-2")
330
+ # Run inference
331
+ sentences = [
332
+ 'How does Claude enable users to interact with applications generated by its system?',
333
+ 'We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.\nAnthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.\nWith Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.\nHere’s my Extract URLs app, entirely generated by Claude:',
334
+ 'We don’t yet know how to build GPT-4\nFrustratingly, despite the enormous leaps ahead we’ve had this year, we are yet to see an alternative model that’s better than GPT-4.\nOpenAI released GPT-4 in March, though it later turned out we had a sneak peak of it in February when Microsoft used it as part of the new Bing.\nThis may well change in the next few weeks: Google’s Gemini Ultra has big claims, but isn’t yet available for us to try out.\nThe team behind Mistral are working to beat GPT-4 as well, and their track record is already extremely strong considering their first public model only came out in September, and they’ve released two significant improvements since then.',
335
+ ]
336
+ embeddings = model.encode(sentences)
337
+ print(embeddings.shape)
338
+ # [3, 1024]
339
+
340
+ # Get the similarity scores for the embeddings
341
+ similarities = model.similarity(embeddings, embeddings)
342
+ print(similarities.shape)
343
+ # [3, 3]
344
+ ```
345
+
346
+ <!--
347
+ ### Direct Usage (Transformers)
348
+
349
+ <details><summary>Click to see the direct usage in Transformers</summary>
350
+
351
+ </details>
352
+ -->
353
+
354
+ <!--
355
+ ### Downstream Usage (Sentence Transformers)
356
+
357
+ You can finetune this model on your own dataset.
358
+
359
+ <details><summary>Click to expand</summary>
360
+
361
+ </details>
362
+ -->
363
+
364
+ <!--
365
+ ### Out-of-Scope Use
366
+
367
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
368
+ -->
369
+
370
+ ## Evaluation
371
+
372
+ ### Metrics
373
+
374
+ #### Information Retrieval
375
+
376
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
377
+
378
+ | Metric | Value |
379
+ |:--------------------|:--------|
380
+ | cosine_accuracy@1 | 1.0 |
381
+ | cosine_accuracy@3 | 1.0 |
382
+ | cosine_accuracy@5 | 1.0 |
383
+ | cosine_accuracy@10 | 1.0 |
384
+ | cosine_precision@1 | 1.0 |
385
+ | cosine_precision@3 | 0.3333 |
386
+ | cosine_precision@5 | 0.2 |
387
+ | cosine_precision@10 | 0.1 |
388
+ | cosine_recall@1 | 1.0 |
389
+ | cosine_recall@3 | 1.0 |
390
+ | cosine_recall@5 | 1.0 |
391
+ | cosine_recall@10 | 1.0 |
392
+ | **cosine_ndcg@10** | **1.0** |
393
+ | cosine_mrr@10 | 1.0 |
394
+ | cosine_map@100 | 1.0 |
395
+
396
+ <!--
397
+ ## Bias, Risks and Limitations
398
+
399
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
400
+ -->
401
+
402
+ <!--
403
+ ### Recommendations
404
+
405
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
406
+ -->
407
+
408
+ ## Training Details
409
+
410
+ ### Training Dataset
411
+
412
+ #### Unnamed Dataset
413
+
414
+ * Size: 156 training samples
415
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
416
+ * Approximate statistics based on the first 156 samples:
417
+ | | sentence_0 | sentence_1 |
418
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
419
+ | type | string | string |
420
+ | details | <ul><li>min: 12 tokens</li><li>mean: 20.22 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 134.95 tokens</li><li>max: 214 tokens</li></ul> |
421
+ * Samples:
422
+ | sentence_0 | sentence_1 |
423
+ |:------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
424
+ | <code>What topics were covered in the annotated presentations given in 2023?</code> | <code>I also gave a bunch of talks and podcast appearances. I’ve started habitually turning my talks into annotated presentations—here are my best from 2023:<br><br>Prompt injection explained, with video, slides, and a transcript<br>Catching up on the weird world of LLMs<br>Making Large Language Models work for you<br>Open questions for AI engineering<br>Embeddings: What they are and why they matter<br>Financial sustainability for open source projects at GitHub Universe<br><br>And in podcasts:<br><br><br>What AI can do for you on the Theory of Change<br><br>Working in public on Path to Citus Con<br><br>LLMs break the internet on the Changelog<br><br>Talking Large Language Models on Rooftop Ruby<br><br>Thoughts on the OpenAI board situation on Newsroom Robots</code> |
425
+ | <code>Which podcasts featured discussions about Large Language Models?</code> | <code>I also gave a bunch of talks and podcast appearances. I’ve started habitually turning my talks into annotated presentations—here are my best from 2023:<br><br>Prompt injection explained, with video, slides, and a transcript<br>Catching up on the weird world of LLMs<br>Making Large Language Models work for you<br>Open questions for AI engineering<br>Embeddings: What they are and why they matter<br>Financial sustainability for open source projects at GitHub Universe<br><br>And in podcasts:<br><br><br>What AI can do for you on the Theory of Change<br><br>Working in public on Path to Citus Con<br><br>LLMs break the internet on the Changelog<br><br>Talking Large Language Models on Rooftop Ruby<br><br>Thoughts on the OpenAI board situation on Newsroom Robots</code> |
426
+ | <code>When did Google release their gemini-20-flash-thinking-exp model?</code> | <code>OpenAI are not the only game in town here. Google released their first entrant in the category, gemini-2.0-flash-thinking-exp, on December 19th.<br>Alibaba’s Qwen team released their QwQ model on November 28th—under an Apache 2.0 license, and that one I could run on my own machine. They followed that up with a vision reasoning model called QvQ on December 24th, which I also ran locally.<br>DeepSeek made their DeepSeek-R1-Lite-Preview model available to try out through their chat interface on November 20th.<br>To understand more about inference scaling I recommend Is AI progress slowing down? by Arvind Narayanan and Sayash Kapoor.</code> |
427
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
428
+ ```json
429
+ {
430
+ "loss": "MultipleNegativesRankingLoss",
431
+ "matryoshka_dims": [
432
+ 768,
433
+ 512,
434
+ 256,
435
+ 128,
436
+ 64
437
+ ],
438
+ "matryoshka_weights": [
439
+ 1,
440
+ 1,
441
+ 1,
442
+ 1,
443
+ 1
444
+ ],
445
+ "n_dims_per_step": -1
446
+ }
447
+ ```
448
+
449
+ ### Training Hyperparameters
450
+ #### Non-Default Hyperparameters
451
+
452
+ - `eval_strategy`: steps
453
+ - `per_device_train_batch_size`: 10
454
+ - `per_device_eval_batch_size`: 10
455
+ - `num_train_epochs`: 10
456
+ - `multi_dataset_batch_sampler`: round_robin
457
+
458
+ #### All Hyperparameters
459
+ <details><summary>Click to expand</summary>
460
+
461
+ - `overwrite_output_dir`: False
462
+ - `do_predict`: False
463
+ - `eval_strategy`: steps
464
+ - `prediction_loss_only`: True
465
+ - `per_device_train_batch_size`: 10
466
+ - `per_device_eval_batch_size`: 10
467
+ - `per_gpu_train_batch_size`: None
468
+ - `per_gpu_eval_batch_size`: None
469
+ - `gradient_accumulation_steps`: 1
470
+ - `eval_accumulation_steps`: None
471
+ - `torch_empty_cache_steps`: None
472
+ - `learning_rate`: 5e-05
473
+ - `weight_decay`: 0.0
474
+ - `adam_beta1`: 0.9
475
+ - `adam_beta2`: 0.999
476
+ - `adam_epsilon`: 1e-08
477
+ - `max_grad_norm`: 1
478
+ - `num_train_epochs`: 10
479
+ - `max_steps`: -1
480
+ - `lr_scheduler_type`: linear
481
+ - `lr_scheduler_kwargs`: {}
482
+ - `warmup_ratio`: 0.0
483
+ - `warmup_steps`: 0
484
+ - `log_level`: passive
485
+ - `log_level_replica`: warning
486
+ - `log_on_each_node`: True
487
+ - `logging_nan_inf_filter`: True
488
+ - `save_safetensors`: True
489
+ - `save_on_each_node`: False
490
+ - `save_only_model`: False
491
+ - `restore_callback_states_from_checkpoint`: False
492
+ - `no_cuda`: False
493
+ - `use_cpu`: False
494
+ - `use_mps_device`: False
495
+ - `seed`: 42
496
+ - `data_seed`: None
497
+ - `jit_mode_eval`: False
498
+ - `use_ipex`: False
499
+ - `bf16`: False
500
+ - `fp16`: False
501
+ - `fp16_opt_level`: O1
502
+ - `half_precision_backend`: auto
503
+ - `bf16_full_eval`: False
504
+ - `fp16_full_eval`: False
505
+ - `tf32`: None
506
+ - `local_rank`: 0
507
+ - `ddp_backend`: None
508
+ - `tpu_num_cores`: None
509
+ - `tpu_metrics_debug`: False
510
+ - `debug`: []
511
+ - `dataloader_drop_last`: False
512
+ - `dataloader_num_workers`: 0
513
+ - `dataloader_prefetch_factor`: None
514
+ - `past_index`: -1
515
+ - `disable_tqdm`: False
516
+ - `remove_unused_columns`: True
517
+ - `label_names`: None
518
+ - `load_best_model_at_end`: False
519
+ - `ignore_data_skip`: False
520
+ - `fsdp`: []
521
+ - `fsdp_min_num_params`: 0
522
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
523
+ - `fsdp_transformer_layer_cls_to_wrap`: None
524
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
525
+ - `deepspeed`: None
526
+ - `label_smoothing_factor`: 0.0
527
+ - `optim`: adamw_torch
528
+ - `optim_args`: None
529
+ - `adafactor`: False
530
+ - `group_by_length`: False
531
+ - `length_column_name`: length
532
+ - `ddp_find_unused_parameters`: None
533
+ - `ddp_bucket_cap_mb`: None
534
+ - `ddp_broadcast_buffers`: False
535
+ - `dataloader_pin_memory`: True
536
+ - `dataloader_persistent_workers`: False
537
+ - `skip_memory_metrics`: True
538
+ - `use_legacy_prediction_loop`: False
539
+ - `push_to_hub`: False
540
+ - `resume_from_checkpoint`: None
541
+ - `hub_model_id`: None
542
+ - `hub_strategy`: every_save
543
+ - `hub_private_repo`: None
544
+ - `hub_always_push`: False
545
+ - `gradient_checkpointing`: False
546
+ - `gradient_checkpointing_kwargs`: None
547
+ - `include_inputs_for_metrics`: False
548
+ - `include_for_metrics`: []
549
+ - `eval_do_concat_batches`: True
550
+ - `fp16_backend`: auto
551
+ - `push_to_hub_model_id`: None
552
+ - `push_to_hub_organization`: None
553
+ - `mp_parameters`:
554
+ - `auto_find_batch_size`: False
555
+ - `full_determinism`: False
556
+ - `torchdynamo`: None
557
+ - `ray_scope`: last
558
+ - `ddp_timeout`: 1800
559
+ - `torch_compile`: False
560
+ - `torch_compile_backend`: None
561
+ - `torch_compile_mode`: None
562
+ - `dispatch_batches`: None
563
+ - `split_batches`: None
564
+ - `include_tokens_per_second`: False
565
+ - `include_num_input_tokens_seen`: False
566
+ - `neftune_noise_alpha`: None
567
+ - `optim_target_modules`: None
568
+ - `batch_eval_metrics`: False
569
+ - `eval_on_start`: False
570
+ - `use_liger_kernel`: False
571
+ - `eval_use_gather_object`: False
572
+ - `average_tokens_across_devices`: False
573
+ - `prompts`: None
574
+ - `batch_sampler`: batch_sampler
575
+ - `multi_dataset_batch_sampler`: round_robin
576
+
577
+ </details>
578
+
579
+ ### Training Logs
580
+ | Epoch | Step | cosine_ndcg@10 |
581
+ |:-----:|:----:|:--------------:|
582
+ | 1.0 | 16 | 1.0 |
583
+ | 2.0 | 32 | 1.0 |
584
+ | 3.0 | 48 | 1.0 |
585
+ | 3.125 | 50 | 1.0 |
586
+ | 4.0 | 64 | 1.0 |
587
+ | 5.0 | 80 | 1.0 |
588
+ | 6.0 | 96 | 1.0 |
589
+ | 6.25 | 100 | 1.0 |
590
+ | 7.0 | 112 | 1.0 |
591
+ | 8.0 | 128 | 1.0 |
592
+ | 9.0 | 144 | 1.0 |
593
+ | 9.375 | 150 | 1.0 |
594
+ | 10.0 | 160 | 1.0 |
595
+
596
+
597
+ ### Framework Versions
598
+ - Python: 3.13.1
599
+ - Sentence Transformers: 3.4.1
600
+ - Transformers: 4.48.3
601
+ - PyTorch: 2.6.0+cu124
602
+ - Accelerate: 1.3.0
603
+ - Datasets: 3.2.0
604
+ - Tokenizers: 0.21.0
605
+
606
+ ## Citation
607
+
608
+ ### BibTeX
609
+
610
+ #### Sentence Transformers
611
+ ```bibtex
612
+ @inproceedings{reimers-2019-sentence-bert,
613
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
614
+ author = "Reimers, Nils and Gurevych, Iryna",
615
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
616
+ month = "11",
617
+ year = "2019",
618
+ publisher = "Association for Computational Linguistics",
619
+ url = "https://arxiv.org/abs/1908.10084",
620
+ }
621
+ ```
622
+
623
+ #### MatryoshkaLoss
624
+ ```bibtex
625
+ @misc{kusupati2024matryoshka,
626
+ title={Matryoshka Representation Learning},
627
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
628
+ year={2024},
629
+ eprint={2205.13147},
630
+ archivePrefix={arXiv},
631
+ primaryClass={cs.LG}
632
+ }
633
+ ```
634
+
635
+ #### MultipleNegativesRankingLoss
636
+ ```bibtex
637
+ @misc{henderson2017efficient,
638
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
639
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
640
+ year={2017},
641
+ eprint={1705.00652},
642
+ archivePrefix={arXiv},
643
+ primaryClass={cs.CL}
644
+ }
645
+ ```
646
+
647
+ <!--
648
+ ## Glossary
649
+
650
+ *Clearly define terms in order to be accessible across audiences.*
651
+ -->
652
+
653
+ <!--
654
+ ## Model Card Authors
655
+
656
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
657
+ -->
658
+
659
+ <!--
660
+ ## Model Card Contact
661
+
662
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
663
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:675c77bfbd9c7527de94a6f13bc029fc78de9b2c65bd419f154cf0422cf7554e
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff