legal-ft-2 / README.md
ngiometti's picture
Add new SentenceTransformer model
6734d42 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:156
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
  - source_sentence: How is the author planning to utilize prompts in their Datasette project?
    sentences:
      - >-
        January


        7th: It’s OK to call it Artificial Intelligence


        9th: What I should have said about the term Artificial Intelligence


        17th: Talking about Open Source LLMs on Oxide and Friends


        26th: LLM 0.13: The annotated release notes




        February


        21st: The killer app of Gemini Pro 1.5 is video




        March


        5th: Prompt injection and jailbreaking are not the same thing


        8th: The GPT-4 barrier has finally been broken


        22nd: Claude and ChatGPT for ad-hoc sidequests


        23rd: Building and testing C extensions for SQLite with ChatGPT Code
        Interpreter


        26th: llm cmd undo last git commit—a new plugin for LLM




        April


        8th: Building files-to-prompt entirely using Claude 3 Opus


        10th: Three major LLM releases in 24 hours (plus weeknotes)
      - >-
        Then in December, the Chatbot Arena team introduced a whole new
        leaderboard for this feature, driven by users building the same
        interactive app twice with two different models and voting on the
        answer. Hard to come up with a more convincing argument that this
        feature is now a commodity that can be effectively implemented against
        all of the leading models.

        I’ve been tinkering with a version of this myself for my Datasette
        project, with the goal of letting users use prompts to build and iterate
        on custom widgets and data visualizations against their own data. I also
        figured out a similar pattern for writing one-shot Python programs,
        enabled by uv.
      - >-
        Another common technique is to use larger models to help create training
        data for their smaller, cheaper alternatives—a trick used by an
        increasing number of labs. DeepSeek v3 used “reasoning” data created by
        DeepSeek-R1. Meta’s Llama 3.3 70B fine-tuning used over 25M
        synthetically generated examples.

        Careful design of the training data that goes into an LLM appears to be
        the entire game for creating these models. The days of just grabbing a
        full scrape of the web and indiscriminately dumping it into a training
        run are long gone.

        LLMs somehow got even harder to use
  - source_sentence: What are the potential pitfalls of using LLMs as power-user tools?
    sentences:
      - >-
        Another common technique is to use larger models to help create training
        data for their smaller, cheaper alternatives—a trick used by an
        increasing number of labs. DeepSeek v3 used “reasoning” data created by
        DeepSeek-R1. Meta’s Llama 3.3 70B fine-tuning used over 25M
        synthetically generated examples.

        Careful design of the training data that goes into an LLM appears to be
        the entire game for creating these models. The days of just grabbing a
        full scrape of the web and indiscriminately dumping it into a training
        run are long gone.

        LLMs somehow got even harder to use
      - >-
        A drum I’ve been banging for a while is that LLMs are power-user
        tools—they’re chainsaws disguised as kitchen knives. They look
        deceptively simple to use—how hard can it be to type messages to a
        chatbot?—but in reality you need a huge depth of both understanding and
        experience to make the most of them and avoid their many pitfalls.

        If anything, this problem got worse in 2024.

        We’ve built computer systems you can talk to in human language, that
        will answer your questions and usually get them right! ... depending on
        the question, and how you ask it, and whether it’s accurately reflected
        in the undocumented and secret training set.
      - >-
        These abilities are just a few weeks old at this point, and I don’t
        think their impact has been fully felt yet. If you haven’t tried them
        out yet you really should.

        Both Gemini and OpenAI offer API access to these features as well.
        OpenAI started with a WebSocket API that was quite challenging to use,
        but in December they announced a new WebRTC API which is much easier to
        get started with. Building a web app that a user can talk to via voice
        is easy now!

        Prompt driven app generation is a commodity already

        This was possible with GPT-4 in 2023, but the value it provides became
        evident in 2024.
  - source_sentence: What challenges are associated with using LLMs in the year of slop?
    sentences:
      - >-
        So far, I think they’re a net positive. I’ve used them on a personal
        level to improve my productivity (and entertain myself) in all sorts of
        different ways. I think people who learn how to use them effectively can
        gain a significant boost to their quality of life.

        A lot of people are yet to be sold on their value! Some think their
        negatives outweigh their positives, some think they are all hot air, and
        some even think they represent an existential threat to humanity.

        They’re actually quite easy to build

        The most surprising thing we’ve learned about LLMs this year is that
        they’re actually quite easy to build.
      - |-
        The year of slop
        Synthetic training data works great
        LLMs somehow got even harder to use
        Knowledge is incredibly unevenly distributed
        LLMs need better criticism
        Everything tagged “llms” on my blog in 2024
      - >-
        Meta’s Llama 3.2 models deserve a special mention. They may not be GPT-4
        class, but at 1B and 3B sizes they punch massively above their weight. I
        run Llama 3.2 3B on my iPhone using the free MLC Chat iOS app and it’s a
        shockingly capable model for its tiny (<2GB) size. Try firing it up and
        asking it for “a plot outline of a Netflix Christmas movie where a data
        journalist falls in love with a local ceramacist”. Here’s what I got, at
        a respectable 20 tokens per second:
  - source_sentence: >-
      What capabilities does Google’s Gemini have regarding audio input and
      output?
    sentences:
      - >-
        There’s a flipside to this too: a lot of better informed people have
        sworn off LLMs entirely because they can’t see how anyone could benefit
        from a tool with so many flaws. The key skill in getting the most out of
        LLMs is learning to work with tech that is both inherently unreliable
        and incredibly powerful at the same time. This is a decidedly
        non-obvious skill to acquire!

        There is so much space for helpful education content here, but we need
        to do do a lot better than outsourcing it all to AI grifters with
        bombastic Twitter threads.

        Knowledge is incredibly unevenly distributed

        Most people have heard of ChatGPT by now. How many have heard of Claude?
      - >-
        There’s still plenty to worry about with respect to the environmental
        impact of the great AI datacenter buildout, but a lot of the concerns
        over the energy cost of individual prompts are no longer credible.

        Here’s a fun napkin calculation: how much would it cost to generate
        short descriptions of every one of the 68,000 photos in my personal
        photo library using Google’s Gemini 1.5 Flash 8B (released in October),
        their cheapest model?

        Each photo would need 260 input tokens and around 100 output tokens.

        260 * 68,000 = 17,680,000 input tokens

        17,680,000 * $0.0375/million = $0.66

        100 * 68,000 = 6,800,000 output tokens

        6,800,000 * $0.15/million = $1.02
      - >-
        Your browser does not support the audio element.


        OpenAI aren’t the only group with a multi-modal audio model. Google’s
        Gemini also accepts audio input, and the Google Gemini apps can speak in
        a similar way to ChatGPT now. Amazon also pre-announced voice mode for
        Amazon Nova, but that’s meant to roll out in Q1 of 2025.

        Google’s NotebookLM, released in September, took audio output to a new
        level by producing spookily realistic conversations between two “podcast
        hosts” about anything you fed into their tool. They later added custom
        instructions, so naturally I turned them into pelicans:



        Your browser does not support the audio element.
  - source_sentence: >-
      What improvements were noted in the intonation of ChatGPT Advanced Voice
      mode during its rollout?
    sentences:
      - >-
        When ChatGPT Advanced Voice mode finally did roll out (a slow roll from
        August through September) it was spectacular. I’ve been using it
        extensively on walks with my dog and it’s amazing how much the
        improvement in intonation elevates the material. I’ve also had a lot of
        fun experimenting with the OpenAI audio APIs.

        Even more fun: Advanced Voice mode can do accents! Here’s what happened
        when I told it I need you to pretend to be a California brown pelican
        with a very thick Russian accent, but you talk to me exclusively in
        Spanish.
      - >-
        When @v0 first came out we were paranoid about protecting the prompt
        with all kinds of pre and post processing complexity.

        We completely pivoted to let it rip. A prompt without the evals, models,
        and especially UX is like getting a broken ASML machine without a manual
      - >-
        January


        7th: It’s OK to call it Artificial Intelligence


        9th: What I should have said about the term Artificial Intelligence


        17th: Talking about Open Source LLMs on Oxide and Friends


        26th: LLM 0.13: The annotated release notes




        February


        21st: The killer app of Gemini Pro 1.5 is video




        March


        5th: Prompt injection and jailbreaking are not the same thing


        8th: The GPT-4 barrier has finally been broken


        22nd: Claude and ChatGPT for ad-hoc sidequests


        23rd: Building and testing C extensions for SQLite with ChatGPT Code
        Interpreter


        26th: llm cmd undo last git commit—a new plugin for LLM




        April


        8th: Building files-to-prompt entirely using Claude 3 Opus


        10th: Three major LLM releases in 24 hours (plus weeknotes)
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.75
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 1
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 1
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.75
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.3333333333333333
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.20000000000000004
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.75
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 1
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 1
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8968216255952429
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.861111111111111
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.8611111111111112
            name: Cosine Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ngiometti/legal-ft-2")
# Run inference
sentences = [
    'What improvements were noted in the intonation of ChatGPT Advanced Voice mode during its rollout?',
    'When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I’ve been using it extensively on walks with my dog and it’s amazing how much the improvement in intonation elevates the material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.\nEven more fun: Advanced Voice mode can do accents! Here’s what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.',
    'January\n\n7th: It’s OK to call it Artificial Intelligence\n\n9th: What I should have said about the term Artificial Intelligence\n\n17th: Talking about Open Source LLMs on Oxide and Friends\n\n26th: LLM 0.13: The annotated release notes\n\n\n\nFebruary\n\n21st: The killer app of Gemini Pro 1.5 is video\n\n\n\nMarch\n\n5th: Prompt injection and jailbreaking are not the same thing\n\n8th: The GPT-4 barrier has finally been broken\n\n22nd: Claude and ChatGPT for ad-hoc sidequests\n\n23rd: Building and testing C extensions for SQLite with ChatGPT Code Interpreter\n\n26th: llm cmd undo last git commit—a new plugin for LLM\n\n\n\nApril\n\n8th: Building files-to-prompt entirely using Claude 3 Opus\n\n10th: Three major LLM releases in 24 hours (plus weeknotes)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.75
cosine_accuracy@3 1.0
cosine_accuracy@5 1.0
cosine_accuracy@10 1.0
cosine_precision@1 0.75
cosine_precision@3 0.3333
cosine_precision@5 0.2
cosine_precision@10 0.1
cosine_recall@1 0.75
cosine_recall@3 1.0
cosine_recall@5 1.0
cosine_recall@10 1.0
cosine_ndcg@10 0.8968
cosine_mrr@10 0.8611
cosine_map@100 0.8611

Training Details

Training Dataset

Unnamed Dataset

  • Size: 156 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 156 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 14 tokens
    • mean: 20.31 tokens
    • max: 36 tokens
    • min: 43 tokens
    • mean: 130.44 tokens
    • max: 204 tokens
  • Samples:
    sentence_0 sentence_1
    What are some potential applications of Large Language Models (LLMs) mentioned in the context? Large Language Models
    They’re actually quite easy to build
    You can run LLMs on your own devices
    Hobbyists can build their own fine-tuned models
    We don’t yet know how to build GPT-4
    Vibes Based Development
    LLMs are really smart, and also really, really dumb
    Gullibility is the biggest unsolved problem
    Code may be the best application
    The ethics of this space remain diabolically complex
    My blog in 2023
    What is identified as the biggest unsolved problem related to LLMs? Large Language Models
    They’re actually quite easy to build
    You can run LLMs on your own devices
    Hobbyists can build their own fine-tuned models
    We don’t yet know how to build GPT-4
    Vibes Based Development
    LLMs are really smart, and also really, really dumb
    Gullibility is the biggest unsolved problem
    Code may be the best application
    The ethics of this space remain diabolically complex
    My blog in 2023
    What improvements were noted in the intonation of ChatGPT Advanced Voice mode during its rollout? When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I’ve been using it extensively on walks with my dog and it’s amazing how much the improvement in intonation elevates the material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.
    Even more fun: Advanced Voice mode can do accents! Here’s what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step cosine_ndcg@10
1.0 16 0.9122
2.0 32 0.9093
3.0 48 0.8968
3.125 50 0.8968
4.0 64 0.8939
5.0 80 0.8908
6.0 96 0.8908
6.25 100 0.8908
7.0 112 0.8939
8.0 128 0.8968
9.0 144 0.8968
9.375 150 0.8968
10.0 160 0.8968

Framework Versions

  • Python: 3.13.1
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}