ind312-ft-v0 / README.md
philipk22's picture
Add new SentenceTransformer model
9aa76ca verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:798
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m
widget:
  - source_sentence: >-
      What is the definition of a sponsor-investigator according to the provided
      context?
    sentences:
      - >-
        § 312.47 Meetings.

        (a) General. Meetings between a sponsor and the agency are frequently
        useful in resolving questions and

        issues raised during the course of a clinical investigation. FDA
        encourages such meetings to the extent

        that they aid in the evaluation of the drug and in the solution of
        scientific problems concerning the drug, to

        the extent that FDA's resources permit. The general principle underlying
        the conduct of such meetings is
      - >-
        employees to conduct an investigation that it has initiated is a
        sponsor, not a sponsor-investigator, and

        the employees are investigators.

        Sponsor-Investigator means an individual who both initiates and conducts
        an investigation, and under whose

        immediate direction the investigational drug is administered or
        dispensed. The term does not include any

        person other than an individual. The requirements applicable to a
        sponsor-investigator under this part
      - >-
        practice regulations in part 58, or, if the study was not conducted in
        compliance with those

        regulations, a brief statement of the reason for the noncompliance.

        (9) Previous human experience with the investigational drug. A summary
        of previous human experience

        known to the applicant, if any, with the investigational drug. The
        information is required to include

        the following:

        (i) If the investigational drug has been investigated or marketed
        previously, either in the United
  - source_sentence: What is the primary purpose of Phase 1 studies in drug development?
    sentences:
      - |-
        § 312.53 Selecting investigators and monitors.
        § 312.54 Emergency research under § 50.24 of this chapter.
        § 312.55 Informing investigators.
        This content is from the eCFR and is authoritative but unofficial.
        21 CFR Part 312 (up to date as of 1/23/2025)
        Investigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)
        21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 1 of 54
      - >-
        relevant to the safety of the drug as are required under § 312.32. The
        sponsor shall make annual reports

        on the progress of the investigation in accordance with § 312.33.

        (d) A sponsor who determines that its investigational drug presents an
        unreasonable and significant risk to

        subjects shall discontinue those investigations that present the risk,
        notify FDA, all institutional review

        boards, and all investigators who have at any time participated in the
        investigation of the discontinuance,
      - >-
        are typically closely monitored and may be conducted in patients or
        normal volunteer subjects.

        These studies are designed to determine the metabolism and pharmacologic
        actions of the drug in

        humans, the side effects associated with increasing doses, and, if
        possible, to gain early evidence on

        effectiveness. During Phase 1, sufficient information about the drug's
        pharmacokinetics and

        pharmacological effects should be obtained to permit the design of
        well-controlled, scientifically
  - source_sentence: >-
      What is the required format for numbering submissions related to the
      investigation?
    sentences:
      - >-
        using a single, three-digit serial number. The initial IND is required
        to be numbered 000; each subsequent

        submission (e.g., amendment, report, or correspondence) is required to
        be numbered chronologically in

        sequence.

        (f) Identification of exception from informed consent. If the
        investigation involves an exception from informed

        consent under § 50.24 of this chapter, the sponsor shall prominently
        identify on the cover sheet that the
      - >-
        response time, a sponsor may not proceed with a clinical trial on which
        a clinical hold has been imposed

        until the sponsor has been notified by FDA that the hold has been
        lifted.

        (f) Appeal. If the sponsor disagrees with the reasons cited for the
        clinical hold, the sponsor may request

        reconsideration of the decision in accordance with § 312.48.

        (g) Conversion of IND on clinical hold to inactive status. If all
        investigations covered by an IND remain on
      - >-
        investigator, the sponsor of any investigation in which the investigator
        has been named as a participant,

        and the reviewing institutional review boards (IRBs) that the
        investigator is not eligible to receive test

        articles under this part. The notification to the investigator, sponsor,
        and IRBs will provide a statement of

        21 CFR Part 312 (up to date as of 1/23/2025)

        Investigational New Drug Application 21 CFR 312.66

        21 CFR 312.70(b) (enhanced display) page 37 of 54
  - source_sentence: What are the regions mentioned in the context where drugs can be exported?
    sentences:
      - >-
        Africa, or to any country in the European Union or the European Economic
        Area, and complies with

        the laws of the country to which it is being exported, the applicable
        provisions of section 802(c), (f),

        and (g) of the act, and § 1.101 of this chapter. Drugs exported under
        this paragraph that are not the

        subject of an IND are exempt from the label requirement in § 312.6(a);
        or

        (4) Except as provided in paragraph (b)(5) of this section, the person
        exporting the drug sends an email
      - >-
        before its implementation. Protocol amendments to add a new investigator
        or to provide additional

        information about investigators may be grouped and submitted at 30-day
        intervals. When several

        submissions of new protocols or protocol changes are anticipated during
        a short period, the sponsor is

        encouraged, to the extent feasible, to include these all in a single
        submission.

        21 CFR Part 312 (up to date as of 1/23/2025)

        Investigational New Drug Application 21 CFR 312.30(b)(2)(i)(b)
      - >-
        that apply to specific types of expanded access are described in §§
        312.310 through 312.320.

        (a) Scope. This subpart contains the requirements for the use of
        investigational new drugs and approved

        drugs where availability is limited by a risk evaluation and mitigation
        strategy (REMS) when the primary

        purpose is to diagnose, monitor, or treat a patient's disease or
        condition. The aim of this subpart is to
  - source_sentence: >-
      What regulatory framework does 21 CFR Part 312 pertain to as of January
      23, 2025?
    sentences:
      - >-
        risk-benefit judgment in making the final decision on approvability. As
        part of this evaluation, consistent

        with the statement of purpose in § 312.80, FDA will consider whether the
        benefits of the drug outweigh

        the known and potential risks of the drug and the need to answer
        remaining questions about risks and

        benefits of the drug, taking into consideration the severity of the
        disease and the absence of satisfactory

        alternative therapy.
      - >-
        provide for disposition of the unused supplies of the drug under §
        312.59.

        (b) Case histories. An investigator is required to prepare and maintain
        adequate and accurate case histories

        that record all observations and other data pertinent to the
        investigation on each individual administered

        the investigational drug or employed as a control in the investigation.
        Case histories include the case

        report forms and supporting data including, for example, signed and
        dated consent forms and medical
      - |-
        § 312.315 Intermediate-size patient populations.
        21 CFR Part 312 (up to date as of 1/23/2025)
        Investigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)
        21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 2 of 54
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.92
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.99
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.99
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.92
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.33000000000000007
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.19799999999999998
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09999999999999998
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.92
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.99
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.99
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.9637992620139386
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.9516666666666665
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.9516666666666667
            name: Cosine Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("philipk22/ind312-ft-v0")
# Run inference
sentences = [
    'What regulatory framework does 21 CFR Part 312 pertain to as of January 23, 2025?',
    '§ 312.315 Intermediate-size patient populations.\n21 CFR Part 312 (up to date as of 1/23/2025)\nInvestigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)\n21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 2 of 54',
    'risk-benefit judgment in making the final decision on approvability. As part of this evaluation, consistent\nwith the statement of purpose in § 312.80, FDA will consider whether the benefits of the drug outweigh\nthe known and potential risks of the drug and the need to answer remaining questions about risks and\nbenefits of the drug, taking into consideration the severity of the disease and the absence of satisfactory\nalternative therapy.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.92
cosine_accuracy@3 0.99
cosine_accuracy@5 0.99
cosine_accuracy@10 1.0
cosine_precision@1 0.92
cosine_precision@3 0.33
cosine_precision@5 0.198
cosine_precision@10 0.1
cosine_recall@1 0.92
cosine_recall@3 0.99
cosine_recall@5 0.99
cosine_recall@10 1.0
cosine_ndcg@10 0.9638
cosine_mrr@10 0.9517
cosine_map@100 0.9517

Training Details

Training Dataset

Unnamed Dataset

  • Size: 798 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 798 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 12 tokens
    • mean: 20.82 tokens
    • max: 46 tokens
    • min: 19 tokens
    • mean: 93.06 tokens
    • max: 158 tokens
  • Samples:
    sentence_0 sentence_1
    What is the scope of Part 312 in Title 21 regarding investigational new drug applications? Title 21 —Food and Drugs
    Chapter I —Food and Drug Administration, Department of Health and Human Services
    Subchapter D —Drugs for Human Use
    Part 312 Investigational New Drug Application
    Subpart A General Provisions
    § 312.1 Scope.
    § 312.2 Applicability.
    § 312.3 Definitions and interpretations.
    § 312.6 Labeling of an investigational new drug.
    § 312.7 Promotion of investigational drugs.
    § 312.8 Charging for investigational drugs under an IND.
    § 312.10 Waivers.
    How does § 3126 address the labeling requirements for investigational new drugs? Title 21 —Food and Drugs
    Chapter I —Food and Drug Administration, Department of Health and Human Services
    Subchapter D —Drugs for Human Use
    Part 312 Investigational New Drug Application
    Subpart A General Provisions
    § 312.1 Scope.
    § 312.2 Applicability.
    § 312.3 Definitions and interpretations.
    § 312.6 Labeling of an investigational new drug.
    § 312.7 Promotion of investigational drugs.
    § 312.8 Charging for investigational drugs under an IND.
    § 312.10 Waivers.
    What are the general principles outlined in § 31222 regarding the IND submission? § 312.10 Waivers.
    Subpart B Investigational New Drug Application (IND)
    § 312.20 Requirement for an IND.
    § 312.21 Phases of an investigation.
    § 312.22 General principles of the IND submission.
    § 312.23 IND content and format.
    § 312.30 Protocol amendments.
    § 312.31 Information amendments.
    § 312.32 IND safety reporting.
    § 312.33 Annual reports.
    § 312.38 Withdrawal of an IND.
    Subpart C Administrative Actions
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_ndcg@10
0.625 50 - 0.9091
1.0 80 - 0.9209
1.25 100 - 0.9329
1.875 150 - 0.9439
2.0 160 - 0.9379
2.5 200 - 0.9367
3.0 240 - 0.9459
3.125 250 - 0.9432
3.75 300 - 0.9479
4.0 320 - 0.9515
4.375 350 - 0.9509
5.0 400 - 0.9581
5.625 450 - 0.9551
6.0 480 - 0.9604
6.25 500 0.3078 0.9577
6.875 550 - 0.9651
7.0 560 - 0.9651
7.5 600 - 0.9641
8.0 640 - 0.9641
8.125 650 - 0.9638
8.75 700 - 0.9638
9.0 720 - 0.9638
9.375 750 - 0.9601
10.0 800 - 0.9638

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}