metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:156
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
How much has the cost of using OpenAI's most expensive model changed
compared to the previous pricing?
sentences:
- >-
Synthetic data as a substantial component of pretraining is becoming
increasingly common, and the Phi series of models has consistently
emphasized the importance of synthetic data. Rather than serving as a
cheap substitute for organic data, synthetic data has several direct
advantages over organic data.
- >-
Here’s the rest of the transcript. It’s bland and generic, but my phone
can pitch bland and generic Christmas movies to Netflix now!
LLM prices crashed, thanks to competition and increased efficiency
The past twelve months have seen a dramatic collapse in the cost of
running a prompt through the top tier hosted LLMs.
In December 2023 (here’s the Internet Archive for the OpenAI pricing
page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok
for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.
Today $30/mTok gets you OpenAI’s most expensive model, o1. GPT-4o is
$2.50 (12x cheaper than GPT-4) and GPT-4o mini is $0.15/mTok—nearly 7x
cheaper than GPT-3.5 and massively more capable.
- >-
Then there’s the rest. If you browse the Chatbot Arena leaderboard
today—still the most useful single place to get a vibes-based evaluation
of models—you’ll see that GPT-4-0314 has fallen to around 70th place.
The 18 organizations with higher scoring models are Google, OpenAI,
Alibaba, Anthropic, Meta, Reka AI, 01 AI, Amazon, Cohere, DeepSeek,
Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21 Labs, Princeton and
Tencent.
Training a GPT-4 beating model was a huge deal in 2023. In 2024 it’s an
achievement that isn’t even particularly notable, though I personally
still celebrate any time a new organization joins that list.
Some of those GPT-4 models run on my laptop
- source_sentence: >-
What are some potential consequences of making decisions based on hype and
misinformation?
sentences:
- >-
The GPT-4 barrier was comprehensively broken
In my December 2023 review I wrote about how We don’t yet know how to
build GPT-4—OpenAI’s best model was almost a year old at that point, yet
no other AI lab had produced anything better. What did OpenAI know that
the rest of us didn’t?
I’m relieved that this has changed completely in the past twelve months.
18 organizations now have models on the Chatbot Arena Leaderboard that
rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the
board)—70 models in total.
- >-
I like people who are skeptical of this stuff. The hype has been
deafening for more than two years now, and there are enormous quantities
of snake oil and misinformation out there. A lot of very bad decisions
are being made based on that hype. Being critical is a virtue.
If we want people with decision-making authority to make good decisions
about how to apply these tools we first need to acknowledge that there
ARE good applications, and then help explain how to put those into
practice while avoiding the many unintiutive traps.
(If you still don’t think there are any good applications at all I’m not
sure why you made it to this point in the article!)
- >-
17th: AI for Data Journalism: demonstrating what we can do with this
stuff right now
22nd: Options for accessing Llama 3 from the terminal using LLM
May
8th: Slop is the new name for unwanted AI-generated content
15th: ChatGPT in “4o” mode is not running the new features yet
29th: Training is not the same as chatting: ChatGPT and other LLMs don’t
remember everything you say
June
6th: Accidental prompt injection against RAG applications
10th: Thoughts on the WWDC 2024 keynote on Apple Intelligence
17th: Language models on the command-line
21st: Building search-based RAG using Claude, Datasette and Val Town
27th: Open challenges for AI engineering
July
14th: Imitation Intelligence, my keynote for PyCon US 2024
- source_sentence: >-
What advancements have been made in multimodal vision and audio/video
capabilities in LLMs?
sentences:
- >-
The year of slop
2024 was the year that the word "slop" became a term of art. I wrote
about this in May, expanding on this tweet by @deepfates:
- |-
The GPT-4 barrier was comprehensively broken
Some of those GPT-4 models run on my laptop
LLM prices crashed, thanks to competition and increased efficiency
Multimodal vision is common, audio and video are starting to emerge
Voice and live camera mode are science fiction come to life
Prompt driven app generation is a commodity already
Universal access to the best models lasted for just a few short months
“Agents” still haven’t really happened yet
Evals really matter
Apple Intelligence is bad, Apple’s MLX library is excellent
The rise of inference-scaling “reasoning” models
Was the best currently available LLM trained in China for less than $6m?
The environmental impact got better
The environmental impact got much, much worse
- >-
Posted 31st December 2024 at 6:07 pm · Follow me on Mastodon or Twitter
or subscribe to my newsletter
More recent articles
LLM 0.22, the annotated release notes - 17th February 2025
Run LLMs on macOS using llm-mlx and Apple's MLX framework - 15th
February 2025
URL-addressable Pyodide Python environments - 13th February 2025
This is Things we learned about LLMs in 2024 by Simon Willison, posted
on 31st December 2024.
Part of series LLMs annual review
Stuff we figured out about AI in 2023 - Dec. 31, 2023, 11:59 p.m.
Things we learned about LLMs in 2024 - Dec. 31, 2024, 6:07 p.m.
google
347
ai
1100
openai
257
- source_sentence: When did the author first run a large language model on their laptop?
sentences:
- >-
24th: Notes on the new Claude analysis JavaScript code execution tool
27th: Run a prompt to generate and execute jq programs using llm-jq
29th: You can now run prompts against images, audio and video in your
terminal using LLM
30th: W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October
November
4th: Claude 3.5 Haiku
7th: Project: VERDAD—tracking misinformation in radio broadcasts using
Gemini 1.5
12th: Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac
19th: Notes from Bing Chat—Our First Encounter With Manipulative AI
25th: Ask questions of SQLite databases and CSV/JSON files in your
terminal
December
4th: First impressions of the new Amazon Nova LLMs (via a new
llm-bedrock plugin)
7th: Prompts.js
- >-
260 input tokens, 92 output tokens. Cost approximately 0.0024 cents
(that’s less than a 400th of a cent).
This increase in efficiency and reduction in price is my single
favourite trend from 2024. I want the utility of LLMs at a fraction of
the energy cost and it looks like that’s what we’re getting.
Multimodal vision is common, audio and video are starting to emerge
My butterfly example above illustrates another key trend from 2024: the
rise of multi-modal LLMs.
A year ago the single most notable example of these was GPT-4 Vision,
released at OpenAI’s DevDay in November 2023. Google’s multi-modal
Gemini 1.0 was announced on December 7th 2023 so it also (just) makes it
into the 2023 window.
- >-
My personal laptop is a 64GB M2 MacBook Pro from 2023. It’s a powerful
machine, but it’s also nearly two years old now—and crucially it’s the
same laptop I’ve been using ever since I first ran an LLM on my computer
back in March 2023 (see Large language models are having their Stable
Diffusion moment).
That same laptop that could just about run a GPT-3-class model in March
last year has now run multiple GPT-4 class models! Some of my notes on
that:
- source_sentence: >-
What notable development in LLM technology occurred in the final quarter
of 2024?
sentences:
- >-
Now that those features are rolling out they’re pretty weak. As an LLM
power-user I know what these models are capable of, and Apple’s LLM
features offer a pale imitation of what a frontier LLM can do. Instead
we’re getting notification summaries that misrepresent news headlines
and writing assistant tools that I’ve not found useful at all. Genmoji
are kind of fun though.
The rise of inference-scaling “reasoning” models
The most interesting development in the final quarter of 2024 was the
introduction of a new shape of LLM, exemplified by OpenAI’s o1
models—initially released as o1-preview and o1-mini on September 12th.
- |-
The year of slop
Synthetic training data works great
LLMs somehow got even harder to use
Knowledge is incredibly unevenly distributed
LLMs need better criticism
Everything tagged “llms” on my blog in 2024
- >-
Prompt injection is a natural consequence of this gulibility. I’ve seen
precious little progress on tackling that problem in 2024, and we’ve
been talking about it since September 2022.
I’m beginning to see the most popular idea of “agents” as dependent on
AGI itself. A model that’s robust against gulliblity is a very tall
order indeed.
Evals really matter
Anthropic’s Amanda Askell (responsible for much of the work behind
Claude’s Character):
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.8333333333333334
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8333333333333334
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8333333333333334
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9330328858630988
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9097222222222222
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9097222222222222
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("dabraldeepti25/legal-ft-v0")
# Run inference
sentences = [
'What notable development in LLM technology occurred in the final quarter of 2024?',
'Now that those features are rolling out they’re pretty weak. As an LLM power-user I know what these models are capable of, and Apple’s LLM features offer a pale imitation of what a frontier LLM can do. Instead we’re getting notification summaries that misrepresent news headlines and writing assistant tools that I’ve not found useful at all. Genmoji are kind of fun though.\nThe rise of inference-scaling “reasoning” models\nThe most interesting development in the final quarter of 2024 was the introduction of a new shape of LLM, exemplified by OpenAI’s o1 models—initially released as o1-preview and o1-mini on September 12th.',
'Prompt injection is a natural consequence of this gulibility. I’ve seen precious little progress on tackling that problem in 2024, and we’ve been talking about it since September 2022.\nI’m beginning to see the most popular idea of “agents” as dependent on AGI itself. A model that’s robust against gulliblity is a very tall order indeed.\nEvals really matter\nAnthropic’s Amanda Askell (responsible for much of the work behind Claude’s Character):',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.8333 |
cosine_accuracy@3 | 1.0 |
cosine_accuracy@5 | 1.0 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.8333 |
cosine_precision@3 | 0.3333 |
cosine_precision@5 | 0.2 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.8333 |
cosine_recall@3 | 1.0 |
cosine_recall@5 | 1.0 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.933 |
cosine_mrr@10 | 0.9097 |
cosine_map@100 | 0.9097 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 156 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 156 samples:
sentence_0 sentence_1 type string string details - min: 13 tokens
- mean: 20.06 tokens
- max: 33 tokens
- min: 43 tokens
- mean: 130.5 tokens
- max: 204 tokens
- Samples:
sentence_0 sentence_1 What is the significance of Claude Artifacts in the context of LLMs and application development?
We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.
Anthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.
With Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.
Here’s my Extract URLs app, entirely generated by Claude:How does Claude enable users to interact with applications generated by its capabilities?
We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.
Anthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.
With Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.
Here’s my Extract URLs app, entirely generated by Claude:What are some of the new capabilities introduced in multi-modal models that enhance their functionality beyond text?
I think people who complain that LLM improvement has slowed are often missing the enormous advances in these multi-modal models. Being able to run prompts against images (and audio and video) is a fascinating new way to apply these models.
Voice and live camera mode are science fiction come to life
The audio and live video modes that have started to emerge deserve a special mention.
The ability to talk to ChatGPT first arrived in September 2023, but it was mostly an illusion: OpenAI used their excellent Whisper speech-to-text model and a new text-to-speech model (creatively named tts-1) to enable conversations with the ChatGPT mobile apps, but the actual model just saw text. - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 16 | 0.9039 |
2.0 | 32 | 0.9010 |
3.0 | 48 | 0.9218 |
3.125 | 50 | 0.9218 |
4.0 | 64 | 0.9218 |
5.0 | 80 | 0.9247 |
6.0 | 96 | 0.9330 |
6.25 | 100 | 0.9330 |
7.0 | 112 | 0.9330 |
8.0 | 128 | 0.9330 |
9.0 | 144 | 0.9330 |
9.375 | 150 | 0.9330 |
10.0 | 160 | 0.9330 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.1
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}