File size: 7,415 Bytes
6c4f341 25bb1a8 6c4f341 68d1586 f6e6a27 68d1586 80ea7b1 24fd287 d4425c5 68d1586 24fd287 68d1586 24fd287 68d1586 d4425c5 24fd287 68d1586 d4425c5 b61c651 68d1586 f6e6a27 fe0f16b d4425c5 60104c0 68d1586 60104c0 68d1586 25bb1a8 68d1586 25bb1a8 68d1586 25bb1a8 68d1586 25bb1a8 68d1586 06d861c 25bb1a8 06d861c 68d1586 daa7a71 3bbe019 daa7a71 42a2d4c daa7a71 550d4f9 daa7a71 68d1586 daa7a71 06d861c daa7a71 06d861c daa7a71 06d861c 74a0540 68d1586 daa7a71 550d4f9 06d861c daa7a71 26f764d daa7a71 550d4f9 daa7a71 68d1586 06d861c 550d4f9 06d861c 550d4f9 06d861c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: apache-2.0
datasets:
- netrias/cancer_metadata_harmonization
language:
- en
metrics:
- accuracy
base_model:
- openai-community/gpt2-large
pipeline_tag: text-generation
library_name: transformers
tags:
- harmonization
- curation
- standardization
- metadata-standardization
---
# Cancer Metadata Harmonization GPT2-Large
## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Environmental Impact](#environmental-impact)
- [Citation](#citation)
## Model Details
### Model Description
`netrias/cancer_harmonization_gpt2large` is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.
- **Developed by:** [Netrias, LLC](https://www.netrias.com/)
- **Model type:** Autoregressive transformer (GPT2-Large)
- **Language(s):** English
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Fine-tuned from model:** [openai-community/gpt2-large](https://huggingface.co/openai-community/gpt2-large)
### Model Sources
- **Repository:** [netrias/cancer_harmonization_gpt2large](https://huggingface.co/netrias/cancer_harmonization_gpt2large)
- **Paper:** [Metadata Harmonization from Biological Datasets with Language Models](https://doi.org/10.1101/2025.01.15.633281)
## Uses
### Direct Use
This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.
### Downstream Use
Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.
### Out-of-Scope Use
Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for cancer-related terms or to semantic types beyond those covered in training.
## Bias, Risks, and Limitations
This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.
### Recommendations
Limit use to the supported domain, and validate outputs before applying them in downstream applications.
## How to Get Started with the Model
Prompt the model using a structured sentence of the form: `The standardized form of "your input term" is "`. It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's `pipeline` to generate the top 5 completions using beam search:
```python
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="netrias/cancer_harmonization_gpt2large"
)
outputs = pipe(
'The standardized form of "glioblastoma multiforme" is "',
num_beams=10,
num_beam_groups=2,
num_return_sequences=5,
diversity_penalty=0.8,
max_new_tokens=200,
do_sample=False
)
for i, out in enumerate(outputs, 1):
term = out["generated_text"].split('"')[-2]
print(f"{i}. {term}")
```
Expected output:
```
1. Glioblastoma
2. Glioblastoma, IDH Wildtype
3. Glioblastoma, IDH-Wildtype
4. Glioblastoma Multiforme
5. Glioblastoma, Not Otherwise Specified
```
## Training Details
### Training Data
Trained on the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset.
### Training Procedure
#### Preprocessing
Each training example was formatted as a natural language prompt-completion pair:
`'The standardized form of "{variant_term}" is "{standard_term}"'`
#### Training Hyperparameters
- **Epochs:** 100
- **Batch size:** 8 (train and eval, per device)
- **Optimizer:** AdamW (`β₁=0.9`, `β₂=0.999`, `ε=1e-8`)
- **Learning rate:** 5e-5
- **Weight decay:** 0.0
- **Gradient accumulation steps:** 1
- **Learning rate scheduler:** Linear
- **Warmup steps:** 0
- **Precision:** Full (FP32; no mixed precision)
- **Evaluation strategy:** Every 1,763 steps
- **Metric for best model selection:** `eval_top_accuracy` (greater is better)
- **Save strategy:** Every 1,763 steps (retain best model checkpoint only)
- **Logging:** Every 500 steps (`mlflow`, `tensorboard`)
- **Seed:** 773819057
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Evaluation was conducted on held-out splits from the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.
#### Factors
Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).
#### Metrics
- **Accuracy:** Whether the top prediction exactly matches the gold standard.
- **Top-5 Accuracy:** Whether the gold standard appears among the top 5 outputs from beam search (`num_beams=10`, `num_return_sequences=5`, `diversity_penalty=0.8`).
### Results
| Split | Accuracy | Top-5 Accuracy |
|------------------------|----------------|----------------|
| Validation | 94% | 98% |
| In-Dictionary (ID) | 93% | 96% |
| Out-of-Dictionary (OOD)| 9% | 16% |
#### Summary
High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** NVIDIA A10G
*Estimate based on RTX A5000, the closest available option in the calculator.*
- **Hours Used:** 23.4
- **Cloud Provider:** Amazon Web Services
- **Compute Region:** US East (Ohio)
- **Carbon Emitted:** ~3.07 kg CO₂eq
## Citation
**BibTeX:**
```bibtex
@article{verbitsky2025metadata,
title={Metadata Harmonization from Biological Datasets with Language Models},
author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.01.15.633281},
publisher={Cold Spring Harbor Laboratory}
}
```
**APA:**
Verbitsky, A., Boutet, P., & Eslami, M. (2025). *Metadata Harmonization from Biological Datasets with Language Models*. bioRxiv. https://doi.org/10.1101/2025.01.15.633281 |