ModernBERT Embed base Legal Matryoshka

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v2-moe on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/nomic-embed-text-v2-moe
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tsss1/modernbert-embed-base-legal-matryoshka-2")
# Run inference
sentences = [
    'against six federal agencies pursuant to the Freedom of Information Act (“FOIA”), 5 U.S.C. \n§ 552, claiming that the defendant agencies have violated the FOIA in numerous ways.1  NSC’s \nclaims run the gamut, including challenges to: the withholding of specific information; the \nadequacy of the agencies’ search efforts; the refusal to process FOIA requests; the refusal to',
    'How many federal agencies is the action against?',
    'Which case was quoted in Entertainment Ltd. v. U.S. Dep’t of Interior regarding the retroactivity of statutes?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric dim_768 dim_512 dim_256 dim_128 dim_64
cosine_accuracy@1 0.5533 0.5502 0.524 0.4621 0.3277
cosine_accuracy@3 0.6105 0.5997 0.5703 0.5209 0.3864
cosine_accuracy@5 0.7125 0.7002 0.6754 0.609 0.4791
cosine_accuracy@10 0.8083 0.7898 0.7682 0.6862 0.5641
cosine_precision@1 0.5533 0.5502 0.524 0.4621 0.3277
cosine_precision@3 0.5276 0.5219 0.4951 0.4456 0.322
cosine_precision@5 0.4127 0.4046 0.3889 0.3536 0.2677
cosine_precision@10 0.2502 0.243 0.2391 0.213 0.1692
cosine_recall@1 0.1985 0.1989 0.1883 0.1656 0.1172
cosine_recall@3 0.5175 0.5138 0.4858 0.4364 0.3215
cosine_recall@5 0.6555 0.6434 0.6172 0.5608 0.4338
cosine_recall@10 0.7895 0.7696 0.7508 0.6692 0.5402
cosine_ndcg@10 0.6787 0.6665 0.6436 0.5742 0.4412
cosine_mrr@10 0.6103 0.6034 0.5769 0.5144 0.3815
cosine_map@100 0.6544 0.6473 0.6222 0.5623 0.4319

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 5,822 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 29 tokens
    • mean: 94.33 tokens
    • max: 156 tokens
    • min: 8 tokens
    • mean: 18.25 tokens
    • max: 35 tokens
  • Samples:
    positive anchor
    aspect” of “substantial independent authority.” Dong v. Smithsonian Inst., 125 F.3d 877, 881

    4 See CREW v. Office of Admin., 566 F.3d 219, 220 (D.C. Cir. 2009); Armstrong v. Exec. Office
    of the President, 90 F.3d 553, 558 (D.C. Cir. 1996); Sweetland v. Walters, 60 F.3d 852, 854
    What court circuit is mentioned in connection with the case Sweetland v. Walters?
    the entire list of remaining PQPs shifts up one position.
    Once GSA has verified, through the evaluation and validation process, the point totals
    claimed by the 100/80/70 highest-scoring offerors, GSA will cease evaluations and award IDIQ
    contracts to the successful, verified bidders. AR at 1114, 2154, 2645. If, after the evaluation
    What is the GSA responsible for verifying?
    Department components], to assist with the processing of [FOIA or Privacy Act] requests for
    purposes of administrative expediency and efficiency.” Third Walter Decl. ¶ 3. Indeed, the
    State Department’s declarant explains that these five State Department components, including
    DS, “conduct their own FOIA/Privacy Act reviews and respond directly to requesters,” despite
    What is the identified purpose for assisting with processing FOIA or Privacy Act requests?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 2
  • gradient_accumulation_steps: 4
  • learning_rate: 2e-05
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 4
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.0549 10 2.6704 - - - - -
0.1099 20 1.7246 - - - - -
0.1648 30 1.3634 - - - - -
0.2198 40 1.0962 - - - - -
0.2747 50 0.8985 - - - - -
0.3297 60 0.8667 - - - - -
0.3846 70 0.7371 - - - - -
0.4396 80 1.038 - - - - -
0.4945 90 0.733 - - - - -
0.5495 100 0.9032 - - - - -
0.6044 110 0.7283 - - - - -
0.6593 120 0.6085 - - - - -
0.7143 130 0.5774 - - - - -
0.7692 140 0.6164 - - - - -
0.8242 150 0.8098 - - - - -
0.8791 160 0.6534 - - - - -
0.9341 170 0.6035 - - - - -
0.9890 180 0.5209 - - - - -
1.0 182 - 0.6911 0.6719 0.6341 0.5600 0.4203
1.0440 190 0.3718 - - - - -
1.0989 200 0.2309 - - - - -
1.1538 210 0.2128 - - - - -
1.2088 220 0.138 - - - - -
1.2637 230 0.1129 - - - - -
1.3187 240 0.0889 - - - - -
1.3736 250 0.0607 - - - - -
1.4286 260 0.1156 - - - - -
1.4835 270 0.0826 - - - - -
1.5385 280 0.098 - - - - -
1.5934 290 0.0891 - - - - -
1.6484 300 0.0451 - - - - -
1.7033 310 0.0581 - - - - -
1.7582 320 0.0722 - - - - -
1.8132 330 0.0785 - - - - -
1.8681 340 0.1407 - - - - -
1.9231 350 0.1022 - - - - -
1.9780 360 0.0771 - - - - -
2.0 364 - 0.6787 0.6665 0.6436 0.5742 0.4412
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.0
  • PyTorch: 2.3.1+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
36
Safetensors
Model size
475M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for tsss1/modernbert-embed-base-legal-matryoshka-2

Evaluation results