File size: 24,198 Bytes

b86e5b6

---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:400
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: What potential issues can arise from the use of AI systems in determining
    access to financial resources and essential services?
  sentences:
  - the dispatching of emergency first response services, including by police, firefighters
    and medical aid, as well as of emergency healthcare patient triage systems, should
    also be classified as high-risk since they make decisions in very critical situations
    for the life and health of persons and their property.
  - systems do not entail a high risk to legal and natural persons. In addition, AI
    systems used to evaluate the credit score or creditworthiness of natural persons
    should be classified as high-risk AI systems, since they determine those persons’
    access to financial resources or essential services such as housing, electricity,
    and telecommunication services. AI systems used for those purposes may lead to
    discrimination between persons or groups and may perpetuate historical patterns
    of discrimination, such as that based on racial or ethnic origins, gender, disabilities,
    age or sexual orientation, or may create new forms of discriminatory impacts.
    However, AI systems provided for by Union law for the purpose of detecting fraud
    in the offering
  - In accordance with Articles 2 and 2a of Protocol No 22 on the position of Denmark,
    annexed to the TEU and to the TFEU, Denmark is not bound by rules laid down in
    Article 5(1), first subparagraph, point (g), to the extent it applies to the use
    of biometric categorisation systems for activities in the field of police cooperation
    and judicial cooperation in criminal matters, Article 5(1), first subparagraph,
    point (d), to the extent it applies to the use of AI systems covered by that provision,
    Article 5(1), first subparagraph, point (h), (2) to (6) and Article 26(10) of
    this Regulation adopted on the basis of Article 16 TFEU, or subject to their application,
    which relate to the processing of personal data by the Member States when carrying
- source_sentence: Why is the failure or malfunctioning of safety components in critical
    infrastructure considered a significant risk?
  sentences:
  - As regards the management and operation of critical infrastructure, it is appropriate
    to classify as high-risk the AI systems intended to be used as safety components
    in the management and operation of critical digital infrastructure as listed in
    point (8) of the Annex to Directive (EU) 2022/2557, road traffic and the supply
    of water, gas, heating and electricity, since their failure or malfunctioning
    may put at risk the life and health of persons at large scale and lead to appreciable
    disruptions in the ordinary conduct of social and economic activities. Safety
    components of critical infrastructure, including critical digital infrastructure,
    are systems used to directly protect the physical integrity of critical infrastructure
    or the
  - (54)
  - (42)
- source_sentence: How does the current Regulation relate to the provisions set out
    in Regulation (EU) 2022/2065?
  sentences:
  - (39)
  - '(11)



    This Regulation should be without prejudice to the provisions regarding the liability
    of providers of intermediary services as set out in Regulation (EU) 2022/2065
    of the European Parliament and of the Council (15).













    (12)'
  - (53)
- source_sentence: Why is it important to ensure a consistent and high level of protection
    for AI throughout the Union?
  sentences:
  - AI systems can be easily deployed in a large variety of sectors of the economy
    and many parts of society, including across borders, and can easily circulate
    throughout the Union. Certain Member States have already explored the adoption
    of national rules to ensure that AI is trustworthy and safe and is developed and
    used in accordance with fundamental rights obligations. Diverging national rules
    may lead to the fragmentation of the internal market and may decrease legal certainty
    for operators that develop, import or use AI systems. A consistent and high level
    of protection throughout the Union should therefore be ensured in order to achieve
    trustworthy AI, while divergences hampering the free circulation, innovation,
    deployment and the
  - '(5)



    At the same time, depending on the circumstances regarding its specific application,
    use, and level of technological development, AI may generate risks and cause harm
    to public interests and fundamental rights that are protected by Union law. Such
    harm might be material or immaterial, including physical, psychological, societal
    or economic harm.













    (6)'
  - (57)
- source_sentence: What is the purpose of implementing a risk-based approach for AI
    systems according to the context?
  sentences:
  - use of lethal force and other AI systems in the context of military and defence
    activities. As regards national security purposes, the exclusion is justified
    both by the fact that national security remains the sole responsibility of Member
    States in accordance with Article 4(2) TEU and by the specific nature and operational
    needs of national security activities and specific national rules applicable to
    those activities. Nonetheless, if an AI system developed, placed on the market,
    put into service or used for military, defence or national security purposes is
    used outside those temporarily or permanently for other purposes, for example,
    civilian or humanitarian purposes, law enforcement or public security purposes,
    such a system would fall
  - '(26)



    In order to introduce a proportionate and effective set of binding rules for AI
    systems, a clearly defined risk-based approach should be followed. That approach
    should tailor the type and content of such rules to the intensity and scope of
    the risks that AI systems can generate. It is therefore necessary to prohibit
    certain unacceptable AI practices, to lay down requirements for high-risk AI systems
    and obligations for the relevant operators, and to lay down transparency obligations
    for certain AI systems.













    (27)'
  - To mitigate the risks from high-risk AI systems placed on the market or put into
    service and to ensure a high level of trustworthiness, certain mandatory requirements
    should apply to high-risk AI systems, taking into account the intended purpose
    and the context of use of the AI system and according to the risk-management system
    to be established by the provider. The measures adopted by the providers to comply
    with the mandatory requirements of this Regulation should take into account the
    generally acknowledged state of the art on AI, be proportionate and effective
    to meet the objectives of this Regulation. Based on the New Legislative Framework,
    as clarified in Commission notice ‘The “Blue Guide” on the implementation of EU
    product rules
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: Unknown
      type: unknown
    metrics:
    - type: cosine_accuracy@1
      value: 0.8958333333333334
      name: Cosine Accuracy@1
    - type: cosine_accuracy@3
      value: 1.0
      name: Cosine Accuracy@3
    - type: cosine_accuracy@5
      value: 1.0
      name: Cosine Accuracy@5
    - type: cosine_accuracy@10
      value: 1.0
      name: Cosine Accuracy@10
    - type: cosine_precision@1
      value: 0.8958333333333334
      name: Cosine Precision@1
    - type: cosine_precision@3
      value: 0.3333333333333333
      name: Cosine Precision@3
    - type: cosine_precision@5
      value: 0.19999999999999998
      name: Cosine Precision@5
    - type: cosine_precision@10
      value: 0.09999999999999999
      name: Cosine Precision@10
    - type: cosine_recall@1
      value: 0.8958333333333334
      name: Cosine Recall@1
    - type: cosine_recall@3
      value: 1.0
      name: Cosine Recall@3
    - type: cosine_recall@5
      value: 1.0
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 1.0
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.9560997762648827
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.9409722222222222
      name: Cosine Mrr@10
    - type: cosine_map@100
      value: 0.9409722222222223
      name: Cosine Map@100
---

# SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 1024 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Mdean77/legal-ft-3")
# Run inference
sentences = [
    'What is the purpose of implementing a risk-based approach for AI systems according to the context?',
    '(26)\n\n\nIn order to introduce a\xa0proportionate and effective set of binding rules for AI systems, a\xa0clearly defined risk-based approach should be followed. That approach should tailor the type and content of such rules to the intensity and scope of the risks that AI systems can generate. It is therefore necessary to prohibit certain unacceptable AI practices, to lay down requirements for high-risk AI systems and obligations for the relevant operators, and to lay down transparency obligations for certain AI systems.\n\n\n\n\n\n\n\n\n\n\n\n\n(27)',
    'use of lethal force and other AI systems in the context of military and defence activities. As regards national security purposes, the exclusion is justified both by the fact that national security remains the sole responsibility of Member States in accordance with Article\xa04(2) TEU and by the specific nature and operational needs of national security activities and specific national rules applicable to those activities. Nonetheless, if an AI system developed, placed on the market, put into service or used for military, defence or national security purposes is used outside those temporarily or permanently for other purposes, for example, civilian or humanitarian purposes, law enforcement or public security purposes, such a\xa0system would fall',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

#### Information Retrieval

* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)

| Metric              | Value      |
|:--------------------|:-----------|
| cosine_accuracy@1   | 0.8958     |
| cosine_accuracy@3   | 1.0        |
| cosine_accuracy@5   | 1.0        |
| cosine_accuracy@10  | 1.0        |
| cosine_precision@1  | 0.8958     |
| cosine_precision@3  | 0.3333     |
| cosine_precision@5  | 0.2        |
| cosine_precision@10 | 0.1        |
| cosine_recall@1     | 0.8958     |
| cosine_recall@3     | 1.0        |
| cosine_recall@5     | 1.0        |
| cosine_recall@10    | 1.0        |
| **cosine_ndcg@10**  | **0.9561** |
| cosine_mrr@10       | 0.941      |
| cosine_map@100      | 0.941      |

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### Unnamed Dataset

* Size: 400 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 400 samples:
  |         | sentence_0                                                                         | sentence_1                                                                         |
  |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
  | type    | string                                                                             | string                                                                             |
  | details | <ul><li>min: 10 tokens</li><li>mean: 20.33 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 93.01 tokens</li><li>max: 186 tokens</li></ul> |
* Samples:
  | sentence_0                                                                                  | sentence_1                                                                                                                                                                                                                                                                       |
  |:--------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>What is the significance of the number 55 in the given context?</code>                | <code>(55)</code>                                                                                                                                                                                                                                                                |
  | <code>How does the number 55 relate to the overall theme or subject being discussed?</code> | <code>(55)</code>                                                                                                                                                                                                                                                                |
  | <code>What types of practices are prohibited by Union law according to the context?</code>  | <code>(45)<br><br><br>Practices that are prohibited by Union law, including data protection law, non-discrimination law, consumer protection law, and competition law, should not be affected by this Regulation.<br><br><br><br><br><br><br><br><br><br><br><br><br>(46)</code> |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
  ```json
  {
      "loss": "MultipleNegativesRankingLoss",
      "matryoshka_dims": [
          768,
          512,
          256,
          128,
          64
      ],
      "matryoshka_weights": [
          1,
          1,
          1,
          1,
          1
      ],
      "n_dims_per_step": -1
  }
  ```

### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: steps
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `num_train_epochs`: 10
- `multi_dataset_batch_sampler`: round_robin

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 10
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: round_robin

</details>

### Training Logs
| Epoch | Step | cosine_ndcg@10 |
|:-----:|:----:|:--------------:|
| 1.0   | 40   | 0.9715         |
| 1.25  | 50   | 0.9638         |
| 2.0   | 80   | 0.9715         |
| 2.5   | 100  | 0.9638         |
| 3.0   | 120  | 0.9742         |
| 3.75  | 150  | 0.9792         |
| 4.0   | 160  | 0.9700         |
| 5.0   | 200  | 0.9715         |
| 6.0   | 240  | 0.9505         |
| 6.25  | 250  | 0.9505         |
| 7.0   | 280  | 0.9623         |
| 7.5   | 300  | 0.9638         |
| 8.0   | 320  | 0.9561         |
| 8.75  | 350  | 0.9638         |
| 9.0   | 360  | 0.9638         |
| 10.0  | 400  | 0.9561         |


### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.2
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->