|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:156 |
|
- loss:MatryoshkaLoss |
|
- loss:MultipleNegativesRankingLoss |
|
base_model: Snowflake/snowflake-arctic-embed-m |
|
widget: |
|
- source_sentence: How many input tokens are required for each photo mentioned in |
|
the context? |
|
sentences: |
|
- 'DeepSeek v3 is a huge 685B parameter model—one of the largest openly licensed |
|
models currently available, significantly bigger than the largest of Meta’s Llama |
|
series, Llama 3.1 405B. |
|
|
|
Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka the Chatbot |
|
Arena) currently rank it 7th, just behind the Gemini 2.0 and OpenAI 4o/o1 models. |
|
This is by far the highest ranking openly licensed model. |
|
|
|
The really impressive thing about DeepSeek v3 is the training cost. The model |
|
was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama |
|
3.1 405B trained 30,840,000 GPU hours—11x that used by DeepSeek v3, for a model |
|
that benchmarks slightly worse.' |
|
- 'Each photo would need 260 input tokens and around 100 output tokens. |
|
|
|
260 * 68,000 = 17,680,000 input tokens |
|
|
|
17,680,000 * $0.0375/million = $0.66 |
|
|
|
100 * 68,000 = 6,800,000 output tokens |
|
|
|
6,800,000 * $0.15/million = $1.02 |
|
|
|
That’s a total cost of $1.68 to process 68,000 images. That’s so absurdly cheap |
|
I had to run the numbers three times to confirm I got it right. |
|
|
|
How good are those descriptions? Here’s what I got from this command: |
|
|
|
llm -m gemini-1.5-flash-8b-latest describe -a IMG_1825.jpeg' |
|
- 'The GPT-4 barrier was comprehensively broken |
|
|
|
In my December 2023 review I wrote about how We don’t yet know how to build GPT-4—OpenAI’s |
|
best model was almost a year old at that point, yet no other AI lab had produced |
|
anything better. What did OpenAI know that the rest of us didn’t? |
|
|
|
I’m relieved that this has changed completely in the past twelve months. 18 organizations |
|
now have models on the Chatbot Arena Leaderboard that rank higher than the original |
|
GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.' |
|
- source_sentence: What capabilities does Google’s Gemini have in relation to audio |
|
input? |
|
sentences: |
|
- 'Things we learned about LLMs in 2024 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Simon Willison’s Weblog |
|
|
|
Subscribe |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Things we learned about LLMs in 2024 |
|
|
|
31st December 2024 |
|
|
|
A lot has happened in the world of Large Language Models over the course of 2024. |
|
Here’s a review of things we figured out about the field in the past twelve months, |
|
plus my attempt at identifying key themes and pivotal moments. |
|
|
|
This is a sequel to my review of 2023. |
|
|
|
In this article:' |
|
- 'Your browser does not support the audio element. |
|
|
|
|
|
OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also |
|
accepts audio input, and the Google Gemini apps can speak in a similar way to |
|
ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s |
|
meant to roll out in Q1 of 2025. |
|
|
|
Google’s NotebookLM, released in September, took audio output to a new level by |
|
producing spookily realistic conversations between two “podcast hosts” about anything |
|
you fed into their tool. They later added custom instructions, so naturally I |
|
turned them into pelicans: |
|
|
|
|
|
|
|
Your browser does not support the audio element.' |
|
- 'In 2024, almost every significant model vendor released multi-modal models. We |
|
saw the Claude 3 series from Anthropic in March, Gemini 1.5 Pro in April (images, |
|
audio and video), then September brought Qwen2-VL and Mistral’s Pixtral 12B and |
|
Meta’s Llama 3.2 11B and 90B vision models. We got audio input and output from |
|
OpenAI in October, then November saw SmolVLM from Hugging Face and December saw |
|
image and video models from Amazon Nova. |
|
|
|
In October I upgraded my LLM CLI tool to support multi-modal models via attachments. |
|
It now has plugins for a whole collection of different vision models.' |
|
- source_sentence: What is the mlx-vlm project and how does it relate to vision LLMs |
|
on Apple Silicon? |
|
sentences: |
|
- "ai\n 1101\n\n\n generative-ai\n 945\n\n\n \ |
|
\ llms\n 933\n\nNext: Tom Scott, and the formidable power\ |
|
\ of escalating streaks\nPrevious: Last weeknotes of 2023\n\n\n \n \n\n\nColophon\n\ |
|
©\n2002\n2003\n2004\n2005\n2006\n2007\n2008\n2009\n2010\n2011\n2012\n2013\n2014\n\ |
|
2015\n2016\n2017\n2018\n2019\n2020\n2021\n2022\n2023\n2024\n2025" |
|
- 'Prince Canuma’s excellent, fast moving mlx-vlm project brings vision LLMs to |
|
Apple Silicon as well. I used that recently to run Qwen’s QvQ. |
|
|
|
While MLX is a game changer, Apple’s own “Apple Intelligence” features have mostly |
|
been a disappointment. I wrote about their initial announcement in June, and I |
|
was optimistic that Apple had focused hard on the subset of LLM applications that |
|
preserve user privacy and minimize the chance of users getting mislead by confusing |
|
features.' |
|
- 'Longer inputs dramatically increase the scope of problems that can be solved |
|
with an LLM: you can now throw in an entire book and ask questions about its contents, |
|
but more importantly you can feed in a lot of example code to help the model correctly |
|
solve a coding problem. LLM use-cases that involve long inputs are far more interesting |
|
to me than short prompts that rely purely on the information already baked into |
|
the model weights. Many of my tools were built using this pattern.' |
|
- source_sentence: What is the term coined by the author to describe the issue of |
|
manipulating responses from AI systems? |
|
sentences: |
|
- 'Then in February, Meta released Llama. And a few weeks later in March, Georgi |
|
Gerganov released code that got it working on a MacBook. |
|
|
|
I wrote about how Large language models are having their Stable Diffusion moment, |
|
and with hindsight that was a very good call! |
|
|
|
This unleashed a whirlwind of innovation, which was accelerated further in July |
|
when Meta released Llama 2—an improved version which, crucially, included permission |
|
for commercial use. |
|
|
|
Today there are literally thousands of LLMs that can be run locally, on all manner |
|
of different devices.' |
|
- 'On paper, a 64GB Mac should be a great machine for running models due to the |
|
way the CPU and GPU can share the same memory. In practice, many models are released |
|
as model weights and libraries that reward NVIDIA’s CUDA over other platforms. |
|
|
|
The llama.cpp ecosystem helped a lot here, but the real breakthrough has been |
|
Apple’s MLX library, “an array framework for Apple Silicon”. It’s fantastic. |
|
|
|
Apple’s mlx-lm Python library supports running a wide range of MLX-compatible |
|
models on my Mac, with excellent performance. mlx-community on Hugging Face offers |
|
more than 1,000 models that have been converted to the necessary format.' |
|
- 'Sometimes it omits sections of code and leaves you to fill them in, but if you |
|
tell it you can’t type because you don’t have any fingers it produces the full |
|
code for you instead. |
|
|
|
There are so many more examples like this. Offer it cash tips for better answers. |
|
Tell it your career depends on it. Give it positive reinforcement. It’s all so |
|
dumb, but it works! |
|
|
|
Gullibility is the biggest unsolved problem |
|
|
|
I coined the term prompt injection in September last year. |
|
|
|
15 months later, I regret to say that we’re still no closer to a robust, dependable |
|
solution to this problem. |
|
|
|
I’ve written a ton about this already. |
|
|
|
Beyond that specific class of security vulnerabilities, I’ve started seeing this |
|
as a wider problem of gullibility.' |
|
- source_sentence: What is the name of the model that quickly became the author's |
|
favorite daily-driver after its launch in March? |
|
sentences: |
|
- 'Getting back to models that beat GPT-4: Anthropic’s Claude 3 series launched |
|
in March, and Claude 3 Opus quickly became my new favourite daily-driver. They |
|
upped the ante even more in June with the launch of Claude 3.5 Sonnet—a model |
|
that is still my favourite six months later (though it got a significant upgrade |
|
on October 22, confusingly keeping the same 3.5 version number. Anthropic fans |
|
have since taken to calling it Claude 3.6).' |
|
- 'Embeddings: What they are and why they matter |
|
|
|
61.7k |
|
|
|
79.3k |
|
|
|
|
|
|
|
Catching up on the weird world of LLMs |
|
|
|
61.6k |
|
|
|
85.9k |
|
|
|
|
|
|
|
llamafile is the new best way to run an LLM on your own computer |
|
|
|
52k |
|
|
|
66k |
|
|
|
|
|
|
|
Prompt injection explained, with video, slides, and a transcript |
|
|
|
51k |
|
|
|
61.9k |
|
|
|
|
|
|
|
AI-enhanced development makes me more ambitious with my projects |
|
|
|
49.6k |
|
|
|
60.1k |
|
|
|
|
|
|
|
Understanding GPT tokenizers |
|
|
|
49.5k |
|
|
|
61.1k |
|
|
|
|
|
|
|
Exploring GPTs: ChatGPT in a trench coat? |
|
|
|
46.4k |
|
|
|
58.5k |
|
|
|
|
|
|
|
Could you train a ChatGPT-beating model for $85,000 and run it in a browser? |
|
|
|
40.5k |
|
|
|
49.2k |
|
|
|
|
|
|
|
How to implement Q&A against your documentation with GPT3, embeddings and Datasette |
|
|
|
37.3k |
|
|
|
44.9k |
|
|
|
|
|
|
|
Lawyer cites fake cases invented by ChatGPT, judge is not amused |
|
|
|
37.1k |
|
|
|
47.4k' |
|
- 'We already knew LLMs were spookily good at writing code. If you prompt them right, |
|
it turns out they can build you a full interactive application using HTML, CSS |
|
and JavaScript (and tools like React if you wire up some extra supporting build |
|
mechanisms)—often in a single prompt. |
|
|
|
Anthropic kicked this idea into high gear when they released Claude Artifacts, |
|
a groundbreaking new feature that was initially slightly lost in the noise due |
|
to being described half way through their announcement of the incredible Claude |
|
3.5 Sonnet. |
|
|
|
With Artifacts, Claude can write you an on-demand interactive application and |
|
then let you use it directly inside the Claude interface. |
|
|
|
Here’s my Extract URLs app, entirely generated by Claude:' |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
metrics: |
|
- cosine_accuracy@1 |
|
- cosine_accuracy@3 |
|
- cosine_accuracy@5 |
|
- cosine_accuracy@10 |
|
- cosine_precision@1 |
|
- cosine_precision@3 |
|
- cosine_precision@5 |
|
- cosine_precision@10 |
|
- cosine_recall@1 |
|
- cosine_recall@3 |
|
- cosine_recall@5 |
|
- cosine_recall@10 |
|
- cosine_ndcg@10 |
|
- cosine_mrr@10 |
|
- cosine_map@100 |
|
model-index: |
|
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m |
|
results: |
|
- task: |
|
type: information-retrieval |
|
name: Information Retrieval |
|
dataset: |
|
name: Unknown |
|
type: unknown |
|
metrics: |
|
- type: cosine_accuracy@1 |
|
value: 0.9166666666666666 |
|
name: Cosine Accuracy@1 |
|
- type: cosine_accuracy@3 |
|
value: 1.0 |
|
name: Cosine Accuracy@3 |
|
- type: cosine_accuracy@5 |
|
value: 1.0 |
|
name: Cosine Accuracy@5 |
|
- type: cosine_accuracy@10 |
|
value: 1.0 |
|
name: Cosine Accuracy@10 |
|
- type: cosine_precision@1 |
|
value: 0.9166666666666666 |
|
name: Cosine Precision@1 |
|
- type: cosine_precision@3 |
|
value: 0.3333333333333333 |
|
name: Cosine Precision@3 |
|
- type: cosine_precision@5 |
|
value: 0.20000000000000004 |
|
name: Cosine Precision@5 |
|
- type: cosine_precision@10 |
|
value: 0.10000000000000002 |
|
name: Cosine Precision@10 |
|
- type: cosine_recall@1 |
|
value: 0.9166666666666666 |
|
name: Cosine Recall@1 |
|
- type: cosine_recall@3 |
|
value: 1.0 |
|
name: Cosine Recall@3 |
|
- type: cosine_recall@5 |
|
value: 1.0 |
|
name: Cosine Recall@5 |
|
- type: cosine_recall@10 |
|
value: 1.0 |
|
name: Cosine Recall@10 |
|
- type: cosine_ndcg@10 |
|
value: 0.9692441461309548 |
|
name: Cosine Ndcg@10 |
|
- type: cosine_mrr@10 |
|
value: 0.9583333333333334 |
|
name: Cosine Mrr@10 |
|
- type: cosine_map@100 |
|
value: 0.9583333333333334 |
|
name: Cosine Map@100 |
|
--- |
|
|
|
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) <!-- at revision fc74610d18462d218e312aa986ec5c8a75a98152 --> |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
<!-- - **Training Dataset:** Unknown --> |
|
<!-- - **Language:** Unknown --> |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("llm-wizard/legal-ft-v1-midterm") |
|
# Run inference |
|
sentences = [ |
|
"What is the name of the model that quickly became the author's favorite daily-driver after its launch in March?", |
|
'Getting back to models that beat GPT-4: Anthropic’s Claude 3 series launched in March, and Claude 3 Opus quickly became my new favourite daily-driver. They upped the ante even more in June with the launch of Claude 3.5 Sonnet—a model that is still my favourite six months later (though it got a significant upgrade on October 22, confusingly keeping the same 3.5 version number. Anthropic fans have since taken to calling it Claude 3.6).', |
|
'We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.\nAnthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.\nWith Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.\nHere’s my Extract URLs app, entirely generated by Claude:', |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 768] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Information Retrieval |
|
|
|
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| cosine_accuracy@1 | 0.9167 | |
|
| cosine_accuracy@3 | 1.0 | |
|
| cosine_accuracy@5 | 1.0 | |
|
| cosine_accuracy@10 | 1.0 | |
|
| cosine_precision@1 | 0.9167 | |
|
| cosine_precision@3 | 0.3333 | |
|
| cosine_precision@5 | 0.2 | |
|
| cosine_precision@10 | 0.1 | |
|
| cosine_recall@1 | 0.9167 | |
|
| cosine_recall@3 | 1.0 | |
|
| cosine_recall@5 | 1.0 | |
|
| cosine_recall@10 | 1.0 | |
|
| **cosine_ndcg@10** | **0.9692** | |
|
| cosine_mrr@10 | 0.9583 | |
|
| cosine_map@100 | 0.9583 | |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
* Size: 156 training samples |
|
* Columns: <code>sentence_0</code> and <code>sentence_1</code> |
|
* Approximate statistics based on the first 156 samples: |
|
| | sentence_0 | sentence_1 | |
|
|:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------| |
|
| type | string | string | |
|
| details | <ul><li>min: 12 tokens</li><li>mean: 20.1 tokens</li><li>max: 31 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 135.18 tokens</li><li>max: 214 tokens</li></ul> | |
|
* Samples: |
|
| sentence_0 | sentence_1 | |
|
|:---------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| <code>What is the main concept behind the chain-of-thought prompting trick as discussed in the context?</code> | <code>One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.<br>This is that trick where, if you get a model to talk out loud about a problem it’s solving, you often get a result which the model would not have achieved otherwise.<br>o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.</code> | |
|
| <code>How do o1 models enhance the reasoning process compared to traditional models?</code> | <code>One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.<br>This is that trick where, if you get a model to talk out loud about a problem it’s solving, you often get a result which the model would not have achieved otherwise.<br>o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.</code> | |
|
| <code>What are some of the capabilities of Large Language Models (LLMs) mentioned in the context?</code> | <code>Here’s the sequel to this post: Things we learned about LLMs in 2024.<br>Large Language Models<br>In the past 24-36 months, our species has discovered that you can take a GIANT corpus of text, run it through a pile of GPUs, and use it to create a fascinating new kind of software.<br>LLMs can do a lot of things. They can answer questions, summarize documents, translate from one language to another, extract information and even write surprisingly competent code.<br>They can also help you cheat at your homework, generate unlimited streams of fake content and be used for all manner of nefarious purposes.</code> | |
|
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters: |
|
```json |
|
{ |
|
"loss": "MultipleNegativesRankingLoss", |
|
"matryoshka_dims": [ |
|
768, |
|
512, |
|
256, |
|
128, |
|
64 |
|
], |
|
"matryoshka_weights": [ |
|
1, |
|
1, |
|
1, |
|
1, |
|
1 |
|
], |
|
"n_dims_per_step": -1 |
|
} |
|
``` |
|
|
|
### Training Hyperparameters |
|
#### Non-Default Hyperparameters |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 10 |
|
- `per_device_eval_batch_size`: 10 |
|
- `num_train_epochs`: 10 |
|
- `multi_dataset_batch_sampler`: round_robin |
|
|
|
#### All Hyperparameters |
|
<details><summary>Click to expand</summary> |
|
|
|
- `overwrite_output_dir`: False |
|
- `do_predict`: False |
|
- `eval_strategy`: steps |
|
- `prediction_loss_only`: True |
|
- `per_device_train_batch_size`: 10 |
|
- `per_device_eval_batch_size`: 10 |
|
- `per_gpu_train_batch_size`: None |
|
- `per_gpu_eval_batch_size`: None |
|
- `gradient_accumulation_steps`: 1 |
|
- `eval_accumulation_steps`: None |
|
- `torch_empty_cache_steps`: None |
|
- `learning_rate`: 5e-05 |
|
- `weight_decay`: 0.0 |
|
- `adam_beta1`: 0.9 |
|
- `adam_beta2`: 0.999 |
|
- `adam_epsilon`: 1e-08 |
|
- `max_grad_norm`: 1 |
|
- `num_train_epochs`: 10 |
|
- `max_steps`: -1 |
|
- `lr_scheduler_type`: linear |
|
- `lr_scheduler_kwargs`: {} |
|
- `warmup_ratio`: 0.0 |
|
- `warmup_steps`: 0 |
|
- `log_level`: passive |
|
- `log_level_replica`: warning |
|
- `log_on_each_node`: True |
|
- `logging_nan_inf_filter`: True |
|
- `save_safetensors`: True |
|
- `save_on_each_node`: False |
|
- `save_only_model`: False |
|
- `restore_callback_states_from_checkpoint`: False |
|
- `no_cuda`: False |
|
- `use_cpu`: False |
|
- `use_mps_device`: False |
|
- `seed`: 42 |
|
- `data_seed`: None |
|
- `jit_mode_eval`: False |
|
- `use_ipex`: False |
|
- `bf16`: False |
|
- `fp16`: False |
|
- `fp16_opt_level`: O1 |
|
- `half_precision_backend`: auto |
|
- `bf16_full_eval`: False |
|
- `fp16_full_eval`: False |
|
- `tf32`: None |
|
- `local_rank`: 0 |
|
- `ddp_backend`: None |
|
- `tpu_num_cores`: None |
|
- `tpu_metrics_debug`: False |
|
- `debug`: [] |
|
- `dataloader_drop_last`: False |
|
- `dataloader_num_workers`: 0 |
|
- `dataloader_prefetch_factor`: None |
|
- `past_index`: -1 |
|
- `disable_tqdm`: False |
|
- `remove_unused_columns`: True |
|
- `label_names`: None |
|
- `load_best_model_at_end`: False |
|
- `ignore_data_skip`: False |
|
- `fsdp`: [] |
|
- `fsdp_min_num_params`: 0 |
|
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} |
|
- `fsdp_transformer_layer_cls_to_wrap`: None |
|
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} |
|
- `deepspeed`: None |
|
- `label_smoothing_factor`: 0.0 |
|
- `optim`: adamw_torch |
|
- `optim_args`: None |
|
- `adafactor`: False |
|
- `group_by_length`: False |
|
- `length_column_name`: length |
|
- `ddp_find_unused_parameters`: None |
|
- `ddp_bucket_cap_mb`: None |
|
- `ddp_broadcast_buffers`: False |
|
- `dataloader_pin_memory`: True |
|
- `dataloader_persistent_workers`: False |
|
- `skip_memory_metrics`: True |
|
- `use_legacy_prediction_loop`: False |
|
- `push_to_hub`: False |
|
- `resume_from_checkpoint`: None |
|
- `hub_model_id`: None |
|
- `hub_strategy`: every_save |
|
- `hub_private_repo`: None |
|
- `hub_always_push`: False |
|
- `gradient_checkpointing`: False |
|
- `gradient_checkpointing_kwargs`: None |
|
- `include_inputs_for_metrics`: False |
|
- `include_for_metrics`: [] |
|
- `eval_do_concat_batches`: True |
|
- `fp16_backend`: auto |
|
- `push_to_hub_model_id`: None |
|
- `push_to_hub_organization`: None |
|
- `mp_parameters`: |
|
- `auto_find_batch_size`: False |
|
- `full_determinism`: False |
|
- `torchdynamo`: None |
|
- `ray_scope`: last |
|
- `ddp_timeout`: 1800 |
|
- `torch_compile`: False |
|
- `torch_compile_backend`: None |
|
- `torch_compile_mode`: None |
|
- `dispatch_batches`: None |
|
- `split_batches`: None |
|
- `include_tokens_per_second`: False |
|
- `include_num_input_tokens_seen`: False |
|
- `neftune_noise_alpha`: None |
|
- `optim_target_modules`: None |
|
- `batch_eval_metrics`: False |
|
- `eval_on_start`: False |
|
- `use_liger_kernel`: False |
|
- `eval_use_gather_object`: False |
|
- `average_tokens_across_devices`: False |
|
- `prompts`: None |
|
- `batch_sampler`: batch_sampler |
|
- `multi_dataset_batch_sampler`: round_robin |
|
|
|
</details> |
|
|
|
### Training Logs |
|
| Epoch | Step | cosine_ndcg@10 | |
|
|:-----:|:----:|:--------------:| |
|
| 1.0 | 16 | 0.8768 | |
|
| 2.0 | 32 | 0.9317 | |
|
| 3.0 | 48 | 0.9484 | |
|
| 3.125 | 50 | 0.9638 | |
|
| 4.0 | 64 | 0.9692 | |
|
| 5.0 | 80 | 0.9692 | |
|
| 6.0 | 96 | 0.9692 | |
|
| 6.25 | 100 | 0.9692 | |
|
| 7.0 | 112 | 0.9692 | |
|
| 8.0 | 128 | 0.9692 | |
|
| 9.0 | 144 | 0.9692 | |
|
| 9.375 | 150 | 0.9692 | |
|
| 10.0 | 160 | 0.9692 | |
|
|
|
|
|
### Framework Versions |
|
- Python: 3.11.11 |
|
- Sentence Transformers: 3.4.1 |
|
- Transformers: 4.48.3 |
|
- PyTorch: 2.5.1+cu124 |
|
- Accelerate: 1.3.0 |
|
- Datasets: 3.3.1 |
|
- Tokenizers: 0.21.0 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MatryoshkaLoss |
|
```bibtex |
|
@misc{kusupati2024matryoshka, |
|
title={Matryoshka Representation Learning}, |
|
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, |
|
year={2024}, |
|
eprint={2205.13147}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |