llm-wizard's picture
Add new SentenceTransformer model
6cb3fce verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:156
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m
widget:
- source_sentence: How many input tokens are required for each photo mentioned in
the context?
sentences:
- 'DeepSeek v3 is a huge 685B parameter model—one of the largest openly licensed
models currently available, significantly bigger than the largest of Meta’s Llama
series, Llama 3.1 405B.
Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka the Chatbot
Arena) currently rank it 7th, just behind the Gemini 2.0 and OpenAI 4o/o1 models.
This is by far the highest ranking openly licensed model.
The really impressive thing about DeepSeek v3 is the training cost. The model
was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama
3.1 405B trained 30,840,000 GPU hours—11x that used by DeepSeek v3, for a model
that benchmarks slightly worse.'
- 'Each photo would need 260 input tokens and around 100 output tokens.
260 * 68,000 = 17,680,000 input tokens
17,680,000 * $0.0375/million = $0.66
100 * 68,000 = 6,800,000 output tokens
6,800,000 * $0.15/million = $1.02
That’s a total cost of $1.68 to process 68,000 images. That’s so absurdly cheap
I had to run the numbers three times to confirm I got it right.
How good are those descriptions? Here’s what I got from this command:
llm -m gemini-1.5-flash-8b-latest describe -a IMG_1825.jpeg'
- 'The GPT-4 barrier was comprehensively broken
In my December 2023 review I wrote about how We don’t yet know how to build GPT-4—OpenAI’s
best model was almost a year old at that point, yet no other AI lab had produced
anything better. What did OpenAI know that the rest of us didn’t?
I’m relieved that this has changed completely in the past twelve months. 18 organizations
now have models on the Chatbot Arena Leaderboard that rank higher than the original
GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.'
- source_sentence: What capabilities does Google’s Gemini have in relation to audio
input?
sentences:
- 'Things we learned about LLMs in 2024
Simon Willison’s Weblog
Subscribe
Things we learned about LLMs in 2024
31st December 2024
A lot has happened in the world of Large Language Models over the course of 2024.
Here’s a review of things we figured out about the field in the past twelve months,
plus my attempt at identifying key themes and pivotal moments.
This is a sequel to my review of 2023.
In this article:'
- 'Your browser does not support the audio element.
OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also
accepts audio input, and the Google Gemini apps can speak in a similar way to
ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s
meant to roll out in Q1 of 2025.
Google’s NotebookLM, released in September, took audio output to a new level by
producing spookily realistic conversations between two “podcast hosts” about anything
you fed into their tool. They later added custom instructions, so naturally I
turned them into pelicans:
Your browser does not support the audio element.'
- 'In 2024, almost every significant model vendor released multi-modal models. We
saw the Claude 3 series from Anthropic in March, Gemini 1.5 Pro in April (images,
audio and video), then September brought Qwen2-VL and Mistral’s Pixtral 12B and
Meta’s Llama 3.2 11B and 90B vision models. We got audio input and output from
OpenAI in October, then November saw SmolVLM from Hugging Face and December saw
image and video models from Amazon Nova.
In October I upgraded my LLM CLI tool to support multi-modal models via attachments.
It now has plugins for a whole collection of different vision models.'
- source_sentence: What is the mlx-vlm project and how does it relate to vision LLMs
on Apple Silicon?
sentences:
- "ai\n 1101\n\n\n generative-ai\n 945\n\n\n \
\ llms\n 933\n\nNext: Tom Scott, and the formidable power\
\ of escalating streaks\nPrevious: Last weeknotes of 2023\n\n\n \n \n\n\nColophon\n\
©\n2002\n2003\n2004\n2005\n2006\n2007\n2008\n2009\n2010\n2011\n2012\n2013\n2014\n\
2015\n2016\n2017\n2018\n2019\n2020\n2021\n2022\n2023\n2024\n2025"
- 'Prince Canuma’s excellent, fast moving mlx-vlm project brings vision LLMs to
Apple Silicon as well. I used that recently to run Qwen’s QvQ.
While MLX is a game changer, Apple’s own “Apple Intelligence” features have mostly
been a disappointment. I wrote about their initial announcement in June, and I
was optimistic that Apple had focused hard on the subset of LLM applications that
preserve user privacy and minimize the chance of users getting mislead by confusing
features.'
- 'Longer inputs dramatically increase the scope of problems that can be solved
with an LLM: you can now throw in an entire book and ask questions about its contents,
but more importantly you can feed in a lot of example code to help the model correctly
solve a coding problem. LLM use-cases that involve long inputs are far more interesting
to me than short prompts that rely purely on the information already baked into
the model weights. Many of my tools were built using this pattern.'
- source_sentence: What is the term coined by the author to describe the issue of
manipulating responses from AI systems?
sentences:
- 'Then in February, Meta released Llama. And a few weeks later in March, Georgi
Gerganov released code that got it working on a MacBook.
I wrote about how Large language models are having their Stable Diffusion moment,
and with hindsight that was a very good call!
This unleashed a whirlwind of innovation, which was accelerated further in July
when Meta released Llama 2—an improved version which, crucially, included permission
for commercial use.
Today there are literally thousands of LLMs that can be run locally, on all manner
of different devices.'
- 'On paper, a 64GB Mac should be a great machine for running models due to the
way the CPU and GPU can share the same memory. In practice, many models are released
as model weights and libraries that reward NVIDIA’s CUDA over other platforms.
The llama.cpp ecosystem helped a lot here, but the real breakthrough has been
Apple’s MLX library, “an array framework for Apple Silicon”. It’s fantastic.
Apple’s mlx-lm Python library supports running a wide range of MLX-compatible
models on my Mac, with excellent performance. mlx-community on Hugging Face offers
more than 1,000 models that have been converted to the necessary format.'
- 'Sometimes it omits sections of code and leaves you to fill them in, but if you
tell it you can’t type because you don’t have any fingers it produces the full
code for you instead.
There are so many more examples like this. Offer it cash tips for better answers.
Tell it your career depends on it. Give it positive reinforcement. It’s all so
dumb, but it works!
Gullibility is the biggest unsolved problem
I coined the term prompt injection in September last year.
15 months later, I regret to say that we’re still no closer to a robust, dependable
solution to this problem.
I’ve written a ton about this already.
Beyond that specific class of security vulnerabilities, I’ve started seeing this
as a wider problem of gullibility.'
- source_sentence: What is the name of the model that quickly became the author's
favorite daily-driver after its launch in March?
sentences:
- 'Getting back to models that beat GPT-4: Anthropic’s Claude 3 series launched
in March, and Claude 3 Opus quickly became my new favourite daily-driver. They
upped the ante even more in June with the launch of Claude 3.5 Sonnet—a model
that is still my favourite six months later (though it got a significant upgrade
on October 22, confusingly keeping the same 3.5 version number. Anthropic fans
have since taken to calling it Claude 3.6).'
- 'Embeddings: What they are and why they matter
61.7k
79.3k
Catching up on the weird world of LLMs
61.6k
85.9k
llamafile is the new best way to run an LLM on your own computer
52k
66k
Prompt injection explained, with video, slides, and a transcript
51k
61.9k
AI-enhanced development makes me more ambitious with my projects
49.6k
60.1k
Understanding GPT tokenizers
49.5k
61.1k
Exploring GPTs: ChatGPT in a trench coat?
46.4k
58.5k
Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
40.5k
49.2k
How to implement Q&A against your documentation with GPT3, embeddings and Datasette
37.3k
44.9k
Lawyer cites fake cases invented by ChatGPT, judge is not amused
37.1k
47.4k'
- 'We already knew LLMs were spookily good at writing code. If you prompt them right,
it turns out they can build you a full interactive application using HTML, CSS
and JavaScript (and tools like React if you wire up some extra supporting build
mechanisms)—often in a single prompt.
Anthropic kicked this idea into high gear when they released Claude Artifacts,
a groundbreaking new feature that was initially slightly lost in the noise due
to being described half way through their announcement of the incredible Claude
3.5 Sonnet.
With Artifacts, Claude can write you an on-demand interactive application and
then let you use it directly inside the Claude interface.
Here’s my Extract URLs app, entirely generated by Claude:'
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.9166666666666666
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1.0
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1.0
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1.0
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.9166666666666666
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.9166666666666666
name: Cosine Recall@1
- type: cosine_recall@3
value: 1.0
name: Cosine Recall@3
- type: cosine_recall@5
value: 1.0
name: Cosine Recall@5
- type: cosine_recall@10
value: 1.0
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9692441461309548
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9583333333333334
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9583333333333334
name: Cosine Map@100
---
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) <!-- at revision fc74610d18462d218e312aa986ec5c8a75a98152 -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("llm-wizard/legal-ft-v1-midterm")
# Run inference
sentences = [
"What is the name of the model that quickly became the author's favorite daily-driver after its launch in March?",
'Getting back to models that beat GPT-4: Anthropic’s Claude 3 series launched in March, and Claude 3 Opus quickly became my new favourite daily-driver. They upped the ante even more in June with the launch of Claude 3.5 Sonnet—a model that is still my favourite six months later (though it got a significant upgrade on October 22, confusingly keeping the same 3.5 version number. Anthropic fans have since taken to calling it Claude 3.6).',
'We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.\nAnthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.\nWith Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.\nHere’s my Extract URLs app, entirely generated by Claude:',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
## Evaluation
### Metrics
#### Information Retrieval
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
| Metric | Value |
|:--------------------|:-----------|
| cosine_accuracy@1 | 0.9167 |
| cosine_accuracy@3 | 1.0 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.9167 |
| cosine_precision@3 | 0.3333 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.9167 |
| cosine_recall@3 | 1.0 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| **cosine_ndcg@10** | **0.9692** |
| cosine_mrr@10 | 0.9583 |
| cosine_map@100 | 0.9583 |
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 156 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 156 samples:
| | sentence_0 | sentence_1 |
|:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 12 tokens</li><li>mean: 20.1 tokens</li><li>max: 31 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 135.18 tokens</li><li>max: 214 tokens</li></ul> |
* Samples:
| sentence_0 | sentence_1 |
|:---------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>What is the main concept behind the chain-of-thought prompting trick as discussed in the context?</code> | <code>One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.<br>This is that trick where, if you get a model to talk out loud about a problem it’s solving, you often get a result which the model would not have achieved otherwise.<br>o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.</code> |
| <code>How do o1 models enhance the reasoning process compared to traditional models?</code> | <code>One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.<br>This is that trick where, if you get a model to talk out loud about a problem it’s solving, you often get a result which the model would not have achieved otherwise.<br>o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend “reasoning tokens” thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.</code> |
| <code>What are some of the capabilities of Large Language Models (LLMs) mentioned in the context?</code> | <code>Here’s the sequel to this post: Things we learned about LLMs in 2024.<br>Large Language Models<br>In the past 24-36 months, our species has discovered that you can take a GIANT corpus of text, run it through a pile of GPUs, and use it to create a fascinating new kind of software.<br>LLMs can do a lot of things. They can answer questions, summarize documents, translate from one language to another, extract information and even write surprisingly competent code.<br>They can also help you cheat at your homework, generate unlimited streams of fake content and be used for all manner of nefarious purposes.</code> |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
```json
{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `num_train_epochs`: 10
- `multi_dataset_batch_sampler`: round_robin
#### All Hyperparameters
<details><summary>Click to expand</summary>
- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 10
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`:
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: round_robin
</details>
### Training Logs
| Epoch | Step | cosine_ndcg@10 |
|:-----:|:----:|:--------------:|
| 1.0 | 16 | 0.8768 |
| 2.0 | 32 | 0.9317 |
| 3.0 | 48 | 0.9484 |
| 3.125 | 50 | 0.9638 |
| 4.0 | 64 | 0.9692 |
| 5.0 | 80 | 0.9692 |
| 6.0 | 96 | 0.9692 |
| 6.25 | 100 | 0.9692 |
| 7.0 | 112 | 0.9692 |
| 8.0 | 128 | 0.9692 |
| 9.0 | 144 | 0.9692 |
| 9.375 | 150 | 0.9692 |
| 10.0 | 160 | 0.9692 |
### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.1
- Tokenizers: 0.21.0
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->