ind312-ft-v0 / README.md
philipk22's picture
Add new SentenceTransformer model
9aa76ca verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:798
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m
widget:
- source_sentence: What is the definition of a sponsor-investigator according to the
provided context?
sentences:
- '§ 312.47 Meetings.
(a) General. Meetings between a sponsor and the agency are frequently useful in
resolving questions and
issues raised during the course of a clinical investigation. FDA encourages such
meetings to the extent
that they aid in the evaluation of the drug and in the solution of scientific
problems concerning the drug, to
the extent that FDA''s resources permit. The general principle underlying the
conduct of such meetings is'
- 'employees to conduct an investigation that it has initiated is a sponsor, not
a sponsor-investigator, and
the employees are investigators.
Sponsor-Investigator means an individual who both initiates and conducts an investigation,
and under whose
immediate direction the investigational drug is administered or dispensed. The
term does not include any
person other than an individual. The requirements applicable to a sponsor-investigator
under this part'
- 'practice regulations in part 58, or, if the study was not conducted in compliance
with those
regulations, a brief statement of the reason for the noncompliance.
(9) Previous human experience with the investigational drug. A summary of previous
human experience
known to the applicant, if any, with the investigational drug. The information
is required to include
the following:
(i) If the investigational drug has been investigated or marketed previously,
either in the United'
- source_sentence: What is the primary purpose of Phase 1 studies in drug development?
sentences:
- '§ 312.53 Selecting investigators and monitors.
§ 312.54 Emergency research under § 50.24 of this chapter.
§ 312.55 Informing investigators.
This content is from the eCFR and is authoritative but unofficial.
21 CFR Part 312 (up to date as of 1/23/2025)
Investigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)
21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 1 of 54'
- 'relevant to the safety of the drug as are required under § 312.32. The sponsor
shall make annual reports
on the progress of the investigation in accordance with § 312.33.
(d) A sponsor who determines that its investigational drug presents an unreasonable
and significant risk to
subjects shall discontinue those investigations that present the risk, notify
FDA, all institutional review
boards, and all investigators who have at any time participated in the investigation
of the discontinuance,'
- 'are typically closely monitored and may be conducted in patients or normal volunteer
subjects.
These studies are designed to determine the metabolism and pharmacologic actions
of the drug in
humans, the side effects associated with increasing doses, and, if possible, to
gain early evidence on
effectiveness. During Phase 1, sufficient information about the drug''s pharmacokinetics
and
pharmacological effects should be obtained to permit the design of well-controlled,
scientifically'
- source_sentence: What is the required format for numbering submissions related to
the investigation?
sentences:
- 'using a single, three-digit serial number. The initial IND is required to be
numbered 000; each subsequent
submission (e.g., amendment, report, or correspondence) is required to be numbered
chronologically in
sequence.
(f) Identification of exception from informed consent. If the investigation involves
an exception from informed
consent under § 50.24 of this chapter, the sponsor shall prominently identify
on the cover sheet that the'
- 'response time, a sponsor may not proceed with a clinical trial on which a clinical
hold has been imposed
until the sponsor has been notified by FDA that the hold has been lifted.
(f) Appeal. If the sponsor disagrees with the reasons cited for the clinical hold,
the sponsor may request
reconsideration of the decision in accordance with § 312.48.
(g) Conversion of IND on clinical hold to inactive status. If all investigations
covered by an IND remain on'
- 'investigator, the sponsor of any investigation in which the investigator has
been named as a participant,
and the reviewing institutional review boards (IRBs) that the investigator is
not eligible to receive test
articles under this part. The notification to the investigator, sponsor, and IRBs
will provide a statement of
21 CFR Part 312 (up to date as of 1/23/2025)
Investigational New Drug Application 21 CFR 312.66
21 CFR 312.70(b) (enhanced display) page 37 of 54'
- source_sentence: What are the regions mentioned in the context where drugs can be
exported?
sentences:
- 'Africa, or to any country in the European Union or the European Economic Area,
and complies with
the laws of the country to which it is being exported, the applicable provisions
of section 802(c), (f),
and (g) of the act, and § 1.101 of this chapter. Drugs exported under this paragraph
that are not the
subject of an IND are exempt from the label requirement in § 312.6(a); or
(4) Except as provided in paragraph (b)(5) of this section, the person exporting
the drug sends an email'
- 'before its implementation. Protocol amendments to add a new investigator or to
provide additional
information about investigators may be grouped and submitted at 30-day intervals.
When several
submissions of new protocols or protocol changes are anticipated during a short
period, the sponsor is
encouraged, to the extent feasible, to include these all in a single submission.
21 CFR Part 312 (up to date as of 1/23/2025)
Investigational New Drug Application 21 CFR 312.30(b)(2)(i)(b)'
- 'that apply to specific types of expanded access are described in §§ 312.310 through
312.320.
(a) Scope. This subpart contains the requirements for the use of investigational
new drugs and approved
drugs where availability is limited by a risk evaluation and mitigation strategy
(REMS) when the primary
purpose is to diagnose, monitor, or treat a patient''s disease or condition. The
aim of this subpart is to'
- source_sentence: What regulatory framework does 21 CFR Part 312 pertain to as of
January 23, 2025?
sentences:
- 'risk-benefit judgment in making the final decision on approvability. As part
of this evaluation, consistent
with the statement of purpose in § 312.80, FDA will consider whether the benefits
of the drug outweigh
the known and potential risks of the drug and the need to answer remaining questions
about risks and
benefits of the drug, taking into consideration the severity of the disease and
the absence of satisfactory
alternative therapy.'
- 'provide for disposition of the unused supplies of the drug under § 312.59.
(b) Case histories. An investigator is required to prepare and maintain adequate
and accurate case histories
that record all observations and other data pertinent to the investigation on
each individual administered
the investigational drug or employed as a control in the investigation. Case histories
include the case
report forms and supporting data including, for example, signed and dated consent
forms and medical'
- '§ 312.315 Intermediate-size patient populations.
21 CFR Part 312 (up to date as of 1/23/2025)
Investigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)
21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 2 of 54'
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.92
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.99
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.99
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1.0
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.92
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.33000000000000007
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.19799999999999998
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09999999999999998
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.92
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.99
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.99
name: Cosine Recall@5
- type: cosine_recall@10
value: 1.0
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9637992620139386
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9516666666666665
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9516666666666667
name: Cosine Map@100
---
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) <!-- at revision fc74610d18462d218e312aa986ec5c8a75a98152 -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("philipk22/ind312-ft-v0")
# Run inference
sentences = [
'What regulatory framework does 21 CFR Part 312 pertain to as of January 23, 2025?',
'§ 312.315 Intermediate-size patient populations.\n21 CFR Part 312 (up to date as of 1/23/2025)\nInvestigational New Drug Application 21 CFR Part 312 (Jan. 23, 2025)\n21 CFR Part 312 (Jan. 23, 2025) (enhanced display) page 2 of 54',
'risk-benefit judgment in making the final decision on approvability. As part of this evaluation, consistent\nwith the statement of purpose in § 312.80, FDA will consider whether the benefits of the drug outweigh\nthe known and potential risks of the drug and the need to answer remaining questions about risks and\nbenefits of the drug, taking into consideration the severity of the disease and the absence of satisfactory\nalternative therapy.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
## Evaluation
### Metrics
#### Information Retrieval
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
| Metric | Value |
|:--------------------|:-----------|
| cosine_accuracy@1 | 0.92 |
| cosine_accuracy@3 | 0.99 |
| cosine_accuracy@5 | 0.99 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.92 |
| cosine_precision@3 | 0.33 |
| cosine_precision@5 | 0.198 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.92 |
| cosine_recall@3 | 0.99 |
| cosine_recall@5 | 0.99 |
| cosine_recall@10 | 1.0 |
| **cosine_ndcg@10** | **0.9638** |
| cosine_mrr@10 | 0.9517 |
| cosine_map@100 | 0.9517 |
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 798 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 798 samples:
| | sentence_0 | sentence_1 |
|:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 12 tokens</li><li>mean: 20.82 tokens</li><li>max: 46 tokens</li></ul> | <ul><li>min: 19 tokens</li><li>mean: 93.06 tokens</li><li>max: 158 tokens</li></ul> |
* Samples:
| sentence_0 | sentence_1 |
|:--------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>What is the scope of Part 312 in Title 21 regarding investigational new drug applications?</code> | <code>Title 21 —Food and Drugs<br>Chapter I —Food and Drug Administration, Department of Health and Human Services<br>Subchapter D —Drugs for Human Use<br>Part 312 Investigational New Drug Application<br>Subpart A General Provisions<br>§ 312.1 Scope.<br>§ 312.2 Applicability.<br>§ 312.3 Definitions and interpretations.<br>§ 312.6 Labeling of an investigational new drug.<br>§ 312.7 Promotion of investigational drugs.<br>§ 312.8 Charging for investigational drugs under an IND.<br>§ 312.10 Waivers.</code> |
| <code>How does § 3126 address the labeling requirements for investigational new drugs?</code> | <code>Title 21 —Food and Drugs<br>Chapter I —Food and Drug Administration, Department of Health and Human Services<br>Subchapter D —Drugs for Human Use<br>Part 312 Investigational New Drug Application<br>Subpart A General Provisions<br>§ 312.1 Scope.<br>§ 312.2 Applicability.<br>§ 312.3 Definitions and interpretations.<br>§ 312.6 Labeling of an investigational new drug.<br>§ 312.7 Promotion of investigational drugs.<br>§ 312.8 Charging for investigational drugs under an IND.<br>§ 312.10 Waivers.</code> |
| <code>What are the general principles outlined in § 31222 regarding the IND submission?</code> | <code>§ 312.10 Waivers.<br>Subpart B Investigational New Drug Application (IND)<br>§ 312.20 Requirement for an IND.<br>§ 312.21 Phases of an investigation.<br>§ 312.22 General principles of the IND submission.<br>§ 312.23 IND content and format.<br>§ 312.30 Protocol amendments.<br>§ 312.31 Information amendments.<br>§ 312.32 IND safety reporting.<br>§ 312.33 Annual reports.<br>§ 312.38 Withdrawal of an IND.<br>Subpart C Administrative Actions</code> |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
```json
{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `num_train_epochs`: 10
- `multi_dataset_batch_sampler`: round_robin
#### All Hyperparameters
<details><summary>Click to expand</summary>
- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 10
- `per_device_eval_batch_size`: 10
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 10
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`:
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: round_robin
</details>
### Training Logs
| Epoch | Step | Training Loss | cosine_ndcg@10 |
|:-----:|:----:|:-------------:|:--------------:|
| 0.625 | 50 | - | 0.9091 |
| 1.0 | 80 | - | 0.9209 |
| 1.25 | 100 | - | 0.9329 |
| 1.875 | 150 | - | 0.9439 |
| 2.0 | 160 | - | 0.9379 |
| 2.5 | 200 | - | 0.9367 |
| 3.0 | 240 | - | 0.9459 |
| 3.125 | 250 | - | 0.9432 |
| 3.75 | 300 | - | 0.9479 |
| 4.0 | 320 | - | 0.9515 |
| 4.375 | 350 | - | 0.9509 |
| 5.0 | 400 | - | 0.9581 |
| 5.625 | 450 | - | 0.9551 |
| 6.0 | 480 | - | 0.9604 |
| 6.25 | 500 | 0.3078 | 0.9577 |
| 6.875 | 550 | - | 0.9651 |
| 7.0 | 560 | - | 0.9651 |
| 7.5 | 600 | - | 0.9641 |
| 8.0 | 640 | - | 0.9641 |
| 8.125 | 650 | - | 0.9638 |
| 8.75 | 700 | - | 0.9638 |
| 9.0 | 720 | - | 0.9638 |
| 9.375 | 750 | - | 0.9601 |
| 10.0 | 800 | - | 0.9638 |
### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->