Rsr2425 commited on
Commit
87e6f77
·
verified ·
1 Parent(s): 656ecd4

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,658 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:78
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: "1. What role does synthetic data play in the pretraining of models,\
13
+ \ particularly in the Phi series? \n2. How does synthetic data compare to organic\
14
+ \ data in terms of advantages?"
15
+ sentences:
16
+ - Synthetic data as a substantial component of pretraining is becoming increasingly
17
+ common, and the Phi series of models has consistently emphasized the importance
18
+ of synthetic data. Rather than serving as a cheap substitute for organic data,
19
+ synthetic data has several direct advantages over organic data.
20
+ - 'The two main categories I see are people who think AI agents are obviously things
21
+ that go and act on your behalf—the travel agent model—and people who think in
22
+ terms of LLMs that have been given access to tools which they can run in a loop
23
+ as part of solving a problem. The term “autonomy” is often thrown into the mix
24
+ too, again without including a clear definition.
25
+
26
+ (I also collected 211 definitions on Twitter a few months ago—here they are in
27
+ Datasette Lite—and had gemini-exp-1206 attempt to summarize them.)
28
+
29
+ Whatever the term may mean, agents still have that feeling of perpetually “coming
30
+ soon”.'
31
+ - 'Terminology aside, I remain skeptical as to their utility based, once again,
32
+ on the challenge of gullibility. LLMs believe anything you tell them. Any systems
33
+ that attempts to make meaningful decisions on your behalf will run into the same
34
+ roadblock: how good is a travel agent, or a digital assistant, or even a research
35
+ tool if it can’t distinguish truth from fiction?
36
+
37
+ Just the other day Google Search was caught serving up an entirely fake description
38
+ of the non-existant movie “Encanto 2”. It turned out to be summarizing an imagined
39
+ movie listing from a fan fiction wiki.'
40
+ - source_sentence: "1. What is the mlx-vlm project and how does it relate to vision\
41
+ \ LLMs on Apple Silicon? \n2. What were the author's initial thoughts on Apple's\
42
+ \ \"Apple Intelligence\" features following their announcement in June?"
43
+ sentences:
44
+ - 'The GPT-4 barrier was comprehensively broken
45
+
46
+ In my December 2023 review I wrote about how We don’t yet know how to build GPT-4—OpenAI’s
47
+ best model was almost a year old at that point, yet no other AI lab had produced
48
+ anything better. What did OpenAI know that the rest of us didn’t?
49
+
50
+ I’m relieved that this has changed completely in the past twelve months. 18 organizations
51
+ now have models on the Chatbot Arena Leaderboard that rank higher than the original
52
+ GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.'
53
+ - 'The year of slop
54
+
55
+ Synthetic training data works great
56
+
57
+ LLMs somehow got even harder to use
58
+
59
+ Knowledge is incredibly unevenly distributed
60
+
61
+ LLMs need better criticism
62
+
63
+ Everything tagged “llms” on my blog in 2024'
64
+ - 'Prince Canuma’s excellent, fast moving mlx-vlm project brings vision LLMs to
65
+ Apple Silicon as well. I used that recently to run Qwen’s QvQ.
66
+
67
+ While MLX is a game changer, Apple’s own “Apple Intelligence” features have mostly
68
+ been a disappointment. I wrote about their initial announcement in June, and I
69
+ was optimistic that Apple had focused hard on the subset of LLM applications that
70
+ preserve user privacy and minimize the chance of users getting mislead by confusing
71
+ features.'
72
+ - source_sentence: "1. What improvements were noted in the intonation of ChatGPT Advanced\
73
+ \ Voice mode during its rollout? \n2. How did the user experiment with accents\
74
+ \ in the Advanced Voice mode?"
75
+ sentences:
76
+ - 'When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August
77
+ through September) it was spectacular. I’ve been using it extensively on walks
78
+ with my dog and it’s amazing how much the improvement in intonation elevates the
79
+ material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.
80
+
81
+ Even more fun: Advanced Voice mode can do accents! Here’s what happened when I
82
+ told it I need you to pretend to be a California brown pelican with a very thick
83
+ Russian accent, but you talk to me exclusively in Spanish.'
84
+ - 'One way to think about these models is an extension of the chain-of-thought prompting
85
+ trick, first explored in the May 2022 paper Large Language Models are Zero-Shot
86
+ Reasoners.
87
+
88
+ This is that trick where, if you get a model to talk out loud about a problem
89
+ it’s solving, you often get a result which the model would not have achieved otherwise.
90
+
91
+ o1 takes this process and further bakes it into the model itself. The details
92
+ are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the
93
+ problem that are not directly visible to the user (though the ChatGPT UI shows
94
+ a summary of them), then outputs a final result.'
95
+ - 'The May 13th announcement of GPT-4o included a demo of a brand new voice mode,
96
+ where the true multi-modal GPT-4o (the o is for “omni”) model could accept audio
97
+ input and output incredibly realistic sounding speech without needing separate
98
+ TTS or STT models.
99
+
100
+ The demo also sounded conspicuously similar to Scarlett Johansson... and after
101
+ she complained the voice from the demo, Skye, never made it to a production product.
102
+
103
+ The delay in releasing the new voice mode after the initial demo caused quite
104
+ a lot of confusion. I wrote about that in ChatGPT in “4o” mode is not running
105
+ the new features yet.'
106
+ - source_sentence: '1. What advantages does a 64GB Mac have for running models compared
107
+ to other machines?
108
+
109
+ 2. How does the mlx-lm Python library enhance the performance of MLX-compatible
110
+ models on a Mac?'
111
+ sentences:
112
+ - 'On paper, a 64GB Mac should be a great machine for running models due to the
113
+ way the CPU and GPU can share the same memory. In practice, many models are released
114
+ as model weights and libraries that reward NVIDIA’s CUDA over other platforms.
115
+
116
+ The llama.cpp ecosystem helped a lot here, but the real breakthrough has been
117
+ Apple’s MLX library, “an array framework for Apple Silicon”. It’s fantastic.
118
+
119
+ Apple’s mlx-lm Python library supports running a wide range of MLX-compatible
120
+ models on my Mac, with excellent performance. mlx-community on Hugging Face offers
121
+ more than 1,000 models that have been converted to the necessary format.'
122
+ - 'The earliest of those was Google’s Gemini 1.5 Pro, released in February. In addition
123
+ to producing GPT-4 level outputs, it introduced several brand new capabilities
124
+ to the field—most notably its 1 million (and then later 2 million) token input
125
+ context length, and the ability to input video.
126
+
127
+ I wrote about this at the time in The killer app of Gemini Pro 1.5 is video, which
128
+ earned me a short appearance as a talking head in the Google I/O opening keynote
129
+ in May.'
130
+ - 'The biggest innovation here is that it opens up a new way to scale a model: instead
131
+ of improving model performance purely through additional compute at training time,
132
+ models can now take on harder problems by spending more compute on inference.
133
+
134
+ The sequel to o1, o3 (they skipped “o2” for European trademark reasons) was announced
135
+ on 20th December with an impressive result against the ARC-AGI benchmark, albeit
136
+ one that likely involved more than $1,000,000 of compute time expense!
137
+
138
+ o3 is expected to ship in January. I doubt many people have real-world problems
139
+ that would benefit from that level of compute expenditure—I certainly don’t!—but
140
+ it appears to be a genuine next step in LLM architecture for taking on much harder
141
+ problems.'
142
+ - source_sentence: '1. What technique is being used by labs to create training data
143
+ for smaller models?
144
+
145
+ 2. How many synthetically generated examples were used in Meta’s Llama 3.3 70B
146
+ fine-tuning?'
147
+ sentences:
148
+ - 'The number of available systems has exploded. Different systems have different
149
+ tools they can apply to your problems—like Python and JavaScript and web search
150
+ and image generation and maybe even database lookups... so you’d better understand
151
+ what those tools are, what they can do and how to tell if the LLM used them or
152
+ not.
153
+
154
+ Did you know ChatGPT has two entirely different ways of running Python now?
155
+
156
+ Want to build a Claude Artifact that talks to an external API? You’d better understand
157
+ CSP and CORS HTTP headers first.'
158
+ - '7th: Prompts.js
159
+
160
+
161
+ 9th: I can now run a GPT-4 class model on my laptop
162
+
163
+
164
+ 10th: ChatGPT Canvas can make API requests now, but it’s complicated
165
+
166
+
167
+ 11th: Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming
168
+ mode
169
+
170
+
171
+ 19th: Building Python tools with a one-shot prompt using uv run and Claude Projects
172
+
173
+
174
+ 19th: Gemini 2.0 Flash “Thinking mode”
175
+
176
+
177
+ 20th: December in LLMs has been a lot
178
+
179
+
180
+ 20th: Live blog: the 12th day of OpenAI—“Early evals for OpenAI o3”
181
+
182
+
183
+ 24th: Trying out QvQ—Qwen’s new visual reasoning model
184
+
185
+
186
+ 31st: Things we learned about LLMs in 2024
187
+
188
+
189
+
190
+
191
+
192
+ (This list generated using Django SQL Dashboard with a SQL query written for me
193
+ by Claude.)'
194
+ - 'Another common technique is to use larger models to help create training data
195
+ for their smaller, cheaper alternatives—a trick used by an increasing number of
196
+ labs. DeepSeek v3 used “reasoning” data created by DeepSeek-R1. Meta’s Llama 3.3
197
+ 70B fine-tuning used over 25M synthetically generated examples.
198
+
199
+ Careful design of the training data that goes into an LLM appears to be the entire
200
+ game for creating these models. The days of just grabbing a full scrape of the
201
+ web and indiscriminately dumping it into a training run are long gone.
202
+
203
+ LLMs somehow got even harder to use'
204
+ pipeline_tag: sentence-similarity
205
+ library_name: sentence-transformers
206
+ metrics:
207
+ - cosine_accuracy@1
208
+ - cosine_accuracy@3
209
+ - cosine_accuracy@5
210
+ - cosine_accuracy@10
211
+ - cosine_precision@1
212
+ - cosine_precision@3
213
+ - cosine_precision@5
214
+ - cosine_precision@10
215
+ - cosine_recall@1
216
+ - cosine_recall@3
217
+ - cosine_recall@5
218
+ - cosine_recall@10
219
+ - cosine_ndcg@10
220
+ - cosine_mrr@10
221
+ - cosine_map@100
222
+ model-index:
223
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
224
+ results:
225
+ - task:
226
+ type: information-retrieval
227
+ name: Information Retrieval
228
+ dataset:
229
+ name: Unknown
230
+ type: unknown
231
+ metrics:
232
+ - type: cosine_accuracy@1
233
+ value: 0.8333333333333334
234
+ name: Cosine Accuracy@1
235
+ - type: cosine_accuracy@3
236
+ value: 1.0
237
+ name: Cosine Accuracy@3
238
+ - type: cosine_accuracy@5
239
+ value: 1.0
240
+ name: Cosine Accuracy@5
241
+ - type: cosine_accuracy@10
242
+ value: 1.0
243
+ name: Cosine Accuracy@10
244
+ - type: cosine_precision@1
245
+ value: 0.8333333333333334
246
+ name: Cosine Precision@1
247
+ - type: cosine_precision@3
248
+ value: 0.3333333333333333
249
+ name: Cosine Precision@3
250
+ - type: cosine_precision@5
251
+ value: 0.20000000000000004
252
+ name: Cosine Precision@5
253
+ - type: cosine_precision@10
254
+ value: 0.10000000000000002
255
+ name: Cosine Precision@10
256
+ - type: cosine_recall@1
257
+ value: 0.8333333333333334
258
+ name: Cosine Recall@1
259
+ - type: cosine_recall@3
260
+ value: 1.0
261
+ name: Cosine Recall@3
262
+ - type: cosine_recall@5
263
+ value: 1.0
264
+ name: Cosine Recall@5
265
+ - type: cosine_recall@10
266
+ value: 1.0
267
+ name: Cosine Recall@10
268
+ - type: cosine_ndcg@10
269
+ value: 0.9384882922619097
270
+ name: Cosine Ndcg@10
271
+ - type: cosine_mrr@10
272
+ value: 0.9166666666666666
273
+ name: Cosine Mrr@10
274
+ - type: cosine_map@100
275
+ value: 0.9166666666666666
276
+ name: Cosine Map@100
277
+ ---
278
+
279
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
280
+
281
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
282
+
283
+ ## Model Details
284
+
285
+ ### Model Description
286
+ - **Model Type:** Sentence Transformer
287
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
288
+ - **Maximum Sequence Length:** 512 tokens
289
+ - **Output Dimensionality:** 1024 dimensions
290
+ - **Similarity Function:** Cosine Similarity
291
+ <!-- - **Training Dataset:** Unknown -->
292
+ <!-- - **Language:** Unknown -->
293
+ <!-- - **License:** Unknown -->
294
+
295
+ ### Model Sources
296
+
297
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
298
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
299
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
300
+
301
+ ### Full Model Architecture
302
+
303
+ ```
304
+ SentenceTransformer(
305
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
306
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
307
+ (2): Normalize()
308
+ )
309
+ ```
310
+
311
+ ## Usage
312
+
313
+ ### Direct Usage (Sentence Transformers)
314
+
315
+ First install the Sentence Transformers library:
316
+
317
+ ```bash
318
+ pip install -U sentence-transformers
319
+ ```
320
+
321
+ Then you can load this model and run inference.
322
+ ```python
323
+ from sentence_transformers import SentenceTransformer
324
+
325
+ # Download from the 🤗 Hub
326
+ model = SentenceTransformer("Rsr2425/legal-ft-2")
327
+ # Run inference
328
+ sentences = [
329
+ '1. What technique is being used by labs to create training data for smaller models?\n2. How many synthetically generated examples were used in Meta’s Llama 3.3 70B fine-tuning?',
330
+ 'Another common technique is to use larger models to help create training data for their smaller, cheaper alternatives—a trick used by an increasing number of labs. DeepSeek v3 used “reasoning” data created by DeepSeek-R1. Meta’s Llama 3.3 70B fine-tuning used over 25M synthetically generated examples.\nCareful design of the training data that goes into an LLM appears to be the entire game for creating these models. The days of just grabbing a full scrape of the web and indiscriminately dumping it into a training run are long gone.\nLLMs somehow got even harder to use',
331
+ '7th: Prompts.js\n\n9th: I can now run a GPT-4 class model on my laptop\n\n10th: ChatGPT Canvas can make API requests now, but it’s complicated\n\n11th: Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode\n\n19th: Building Python tools with a one-shot prompt using uv run and Claude Projects\n\n19th: Gemini 2.0 Flash “Thinking mode”\n\n20th: December in LLMs has been a lot\n\n20th: Live blog: the 12th day of OpenAI—“Early evals for OpenAI o3”\n\n24th: Trying out QvQ—Qwen’s new visual reasoning model\n\n31st: Things we learned about LLMs in 2024\n\n\n\n\n(This list generated using Django SQL Dashboard with a SQL query written for me by Claude.)',
332
+ ]
333
+ embeddings = model.encode(sentences)
334
+ print(embeddings.shape)
335
+ # [3, 1024]
336
+
337
+ # Get the similarity scores for the embeddings
338
+ similarities = model.similarity(embeddings, embeddings)
339
+ print(similarities.shape)
340
+ # [3, 3]
341
+ ```
342
+
343
+ <!--
344
+ ### Direct Usage (Transformers)
345
+
346
+ <details><summary>Click to see the direct usage in Transformers</summary>
347
+
348
+ </details>
349
+ -->
350
+
351
+ <!--
352
+ ### Downstream Usage (Sentence Transformers)
353
+
354
+ You can finetune this model on your own dataset.
355
+
356
+ <details><summary>Click to expand</summary>
357
+
358
+ </details>
359
+ -->
360
+
361
+ <!--
362
+ ### Out-of-Scope Use
363
+
364
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
365
+ -->
366
+
367
+ ## Evaluation
368
+
369
+ ### Metrics
370
+
371
+ #### Information Retrieval
372
+
373
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
374
+
375
+ | Metric | Value |
376
+ |:--------------------|:-----------|
377
+ | cosine_accuracy@1 | 0.8333 |
378
+ | cosine_accuracy@3 | 1.0 |
379
+ | cosine_accuracy@5 | 1.0 |
380
+ | cosine_accuracy@10 | 1.0 |
381
+ | cosine_precision@1 | 0.8333 |
382
+ | cosine_precision@3 | 0.3333 |
383
+ | cosine_precision@5 | 0.2 |
384
+ | cosine_precision@10 | 0.1 |
385
+ | cosine_recall@1 | 0.8333 |
386
+ | cosine_recall@3 | 1.0 |
387
+ | cosine_recall@5 | 1.0 |
388
+ | cosine_recall@10 | 1.0 |
389
+ | **cosine_ndcg@10** | **0.9385** |
390
+ | cosine_mrr@10 | 0.9167 |
391
+ | cosine_map@100 | 0.9167 |
392
+
393
+ <!--
394
+ ## Bias, Risks and Limitations
395
+
396
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
397
+ -->
398
+
399
+ <!--
400
+ ### Recommendations
401
+
402
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
403
+ -->
404
+
405
+ ## Training Details
406
+
407
+ ### Training Dataset
408
+
409
+ #### Unnamed Dataset
410
+
411
+ * Size: 78 training samples
412
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
413
+ * Approximate statistics based on the first 78 samples:
414
+ | | sentence_0 | sentence_1 |
415
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
416
+ | type | string | string |
417
+ | details | <ul><li>min: 30 tokens</li><li>mean: 42.76 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 130.5 tokens</li><li>max: 204 tokens</li></ul> |
418
+ * Samples:
419
+ | sentence_0 | sentence_1 |
420
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
421
+ | <code>1. What key themes and pivotal moments in the field of Large Language Models were identified in 2024? <br>2. How does the review of 2024 compare to the review of 2023 regarding advancements in LLMs?</code> | <code>Things we learned about LLMs in 2024<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>Simon Willison’s Weblog<br>Subscribe<br><br><br><br><br><br><br>Things we learned about LLMs in 2024<br>31st December 2024<br>A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.<br>This is a sequel to my review of 2023.<br>In this article:</code> |
422
+ | <code>1. What advancements in multimodal capabilities have been observed in LLMs, particularly regarding audio and video?<br>2. How has the competition among LLMs affected their pricing and accessibility over time?</code> | <code>The GPT-4 barrier was comprehensively broken<br>Some of those GPT-4 models run on my laptop<br>LLM prices crashed, thanks to competition and increased efficiency<br>Multimodal vision is common, audio and video are starting to emerge<br>Voice and live camera mode are science fiction come to life<br>Prompt driven app generation is a commodity already<br>Universal access to the best models lasted for just a few short months<br>“Agents” still haven’t really happened yet<br>Evals really matter<br>Apple Intelligence is bad, Apple’s MLX library is excellent<br>The rise of inference-scaling “reasoning” models<br>Was the best currently available LLM trained in China for less than $6m?<br>The environmental impact got better<br>The environmental impact got much, much worse</code> |
423
+ | <code>1. What challenges are associated with using LLMs in 2024?<br>2. How is knowledge distribution described in the context of LLMs?</code> | <code>The year of slop<br>Synthetic training data works great<br>LLMs somehow got even harder to use<br>Knowledge is incredibly unevenly distributed<br>LLMs need better criticism<br>Everything tagged “llms” on my blog in 2024</code> |
424
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
425
+ ```json
426
+ {
427
+ "loss": "MultipleNegativesRankingLoss",
428
+ "matryoshka_dims": [
429
+ 768,
430
+ 512,
431
+ 256,
432
+ 128,
433
+ 64
434
+ ],
435
+ "matryoshka_weights": [
436
+ 1,
437
+ 1,
438
+ 1,
439
+ 1,
440
+ 1
441
+ ],
442
+ "n_dims_per_step": -1
443
+ }
444
+ ```
445
+
446
+ ### Training Hyperparameters
447
+ #### Non-Default Hyperparameters
448
+
449
+ - `eval_strategy`: steps
450
+ - `per_device_train_batch_size`: 10
451
+ - `per_device_eval_batch_size`: 10
452
+ - `num_train_epochs`: 10
453
+ - `multi_dataset_batch_sampler`: round_robin
454
+
455
+ #### All Hyperparameters
456
+ <details><summary>Click to expand</summary>
457
+
458
+ - `overwrite_output_dir`: False
459
+ - `do_predict`: False
460
+ - `eval_strategy`: steps
461
+ - `prediction_loss_only`: True
462
+ - `per_device_train_batch_size`: 10
463
+ - `per_device_eval_batch_size`: 10
464
+ - `per_gpu_train_batch_size`: None
465
+ - `per_gpu_eval_batch_size`: None
466
+ - `gradient_accumulation_steps`: 1
467
+ - `eval_accumulation_steps`: None
468
+ - `torch_empty_cache_steps`: None
469
+ - `learning_rate`: 5e-05
470
+ - `weight_decay`: 0.0
471
+ - `adam_beta1`: 0.9
472
+ - `adam_beta2`: 0.999
473
+ - `adam_epsilon`: 1e-08
474
+ - `max_grad_norm`: 1
475
+ - `num_train_epochs`: 10
476
+ - `max_steps`: -1
477
+ - `lr_scheduler_type`: linear
478
+ - `lr_scheduler_kwargs`: {}
479
+ - `warmup_ratio`: 0.0
480
+ - `warmup_steps`: 0
481
+ - `log_level`: passive
482
+ - `log_level_replica`: warning
483
+ - `log_on_each_node`: True
484
+ - `logging_nan_inf_filter`: True
485
+ - `save_safetensors`: True
486
+ - `save_on_each_node`: False
487
+ - `save_only_model`: False
488
+ - `restore_callback_states_from_checkpoint`: False
489
+ - `no_cuda`: False
490
+ - `use_cpu`: False
491
+ - `use_mps_device`: False
492
+ - `seed`: 42
493
+ - `data_seed`: None
494
+ - `jit_mode_eval`: False
495
+ - `use_ipex`: False
496
+ - `bf16`: False
497
+ - `fp16`: False
498
+ - `fp16_opt_level`: O1
499
+ - `half_precision_backend`: auto
500
+ - `bf16_full_eval`: False
501
+ - `fp16_full_eval`: False
502
+ - `tf32`: None
503
+ - `local_rank`: 0
504
+ - `ddp_backend`: None
505
+ - `tpu_num_cores`: None
506
+ - `tpu_metrics_debug`: False
507
+ - `debug`: []
508
+ - `dataloader_drop_last`: False
509
+ - `dataloader_num_workers`: 0
510
+ - `dataloader_prefetch_factor`: None
511
+ - `past_index`: -1
512
+ - `disable_tqdm`: False
513
+ - `remove_unused_columns`: True
514
+ - `label_names`: None
515
+ - `load_best_model_at_end`: False
516
+ - `ignore_data_skip`: False
517
+ - `fsdp`: []
518
+ - `fsdp_min_num_params`: 0
519
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
520
+ - `fsdp_transformer_layer_cls_to_wrap`: None
521
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
522
+ - `deepspeed`: None
523
+ - `label_smoothing_factor`: 0.0
524
+ - `optim`: adamw_torch
525
+ - `optim_args`: None
526
+ - `adafactor`: False
527
+ - `group_by_length`: False
528
+ - `length_column_name`: length
529
+ - `ddp_find_unused_parameters`: None
530
+ - `ddp_bucket_cap_mb`: None
531
+ - `ddp_broadcast_buffers`: False
532
+ - `dataloader_pin_memory`: True
533
+ - `dataloader_persistent_workers`: False
534
+ - `skip_memory_metrics`: True
535
+ - `use_legacy_prediction_loop`: False
536
+ - `push_to_hub`: False
537
+ - `resume_from_checkpoint`: None
538
+ - `hub_model_id`: None
539
+ - `hub_strategy`: every_save
540
+ - `hub_private_repo`: None
541
+ - `hub_always_push`: False
542
+ - `gradient_checkpointing`: False
543
+ - `gradient_checkpointing_kwargs`: None
544
+ - `include_inputs_for_metrics`: False
545
+ - `include_for_metrics`: []
546
+ - `eval_do_concat_batches`: True
547
+ - `fp16_backend`: auto
548
+ - `push_to_hub_model_id`: None
549
+ - `push_to_hub_organization`: None
550
+ - `mp_parameters`:
551
+ - `auto_find_batch_size`: False
552
+ - `full_determinism`: False
553
+ - `torchdynamo`: None
554
+ - `ray_scope`: last
555
+ - `ddp_timeout`: 1800
556
+ - `torch_compile`: False
557
+ - `torch_compile_backend`: None
558
+ - `torch_compile_mode`: None
559
+ - `dispatch_batches`: None
560
+ - `split_batches`: None
561
+ - `include_tokens_per_second`: False
562
+ - `include_num_input_tokens_seen`: False
563
+ - `neftune_noise_alpha`: None
564
+ - `optim_target_modules`: None
565
+ - `batch_eval_metrics`: False
566
+ - `eval_on_start`: False
567
+ - `use_liger_kernel`: False
568
+ - `eval_use_gather_object`: False
569
+ - `average_tokens_across_devices`: False
570
+ - `prompts`: None
571
+ - `batch_sampler`: batch_sampler
572
+ - `multi_dataset_batch_sampler`: round_robin
573
+
574
+ </details>
575
+
576
+ ### Training Logs
577
+ | Epoch | Step | cosine_ndcg@10 |
578
+ |:-----:|:----:|:--------------:|
579
+ | 1.0 | 8 | 1.0 |
580
+ | 2.0 | 16 | 0.9583 |
581
+ | 3.0 | 24 | 0.9276 |
582
+ | 4.0 | 32 | 0.9385 |
583
+ | 5.0 | 40 | 0.9385 |
584
+ | 6.0 | 48 | 0.9385 |
585
+ | 6.25 | 50 | 0.9385 |
586
+ | 7.0 | 56 | 0.9385 |
587
+ | 8.0 | 64 | 0.9385 |
588
+ | 9.0 | 72 | 0.9385 |
589
+ | 10.0 | 80 | 0.9385 |
590
+
591
+
592
+ ### Framework Versions
593
+ - Python: 3.11.11
594
+ - Sentence Transformers: 3.4.1
595
+ - Transformers: 4.48.3
596
+ - PyTorch: 2.5.1+cu124
597
+ - Accelerate: 1.3.0
598
+ - Datasets: 3.3.1
599
+ - Tokenizers: 0.21.0
600
+
601
+ ## Citation
602
+
603
+ ### BibTeX
604
+
605
+ #### Sentence Transformers
606
+ ```bibtex
607
+ @inproceedings{reimers-2019-sentence-bert,
608
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
609
+ author = "Reimers, Nils and Gurevych, Iryna",
610
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
611
+ month = "11",
612
+ year = "2019",
613
+ publisher = "Association for Computational Linguistics",
614
+ url = "https://arxiv.org/abs/1908.10084",
615
+ }
616
+ ```
617
+
618
+ #### MatryoshkaLoss
619
+ ```bibtex
620
+ @misc{kusupati2024matryoshka,
621
+ title={Matryoshka Representation Learning},
622
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
623
+ year={2024},
624
+ eprint={2205.13147},
625
+ archivePrefix={arXiv},
626
+ primaryClass={cs.LG}
627
+ }
628
+ ```
629
+
630
+ #### MultipleNegativesRankingLoss
631
+ ```bibtex
632
+ @misc{henderson2017efficient,
633
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
634
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
635
+ year={2017},
636
+ eprint={1705.00652},
637
+ archivePrefix={arXiv},
638
+ primaryClass={cs.CL}
639
+ }
640
+ ```
641
+
642
+ <!--
643
+ ## Glossary
644
+
645
+ *Clearly define terms in order to be accessible across audiences.*
646
+ -->
647
+
648
+ <!--
649
+ ## Model Card Authors
650
+
651
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
652
+ -->
653
+
654
+ <!--
655
+ ## Model Card Contact
656
+
657
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
658
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a7af2ac4b71d7f678ec7d9f5599601fab4ab1206b72016070551db4408825bc
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff