File size: 7,415 Bytes

6c4f341
 
 
 
 
 
 
 
 
 
25bb1a8
6c4f341
 
 
 
 
 
68d1586
 
f6e6a27
 
 
 
 
 
 
 
 
 
 
68d1586
 
 
 
80ea7b1
24fd287
d4425c5
 
 
 
 
68d1586
 
24fd287
 
68d1586
 
 
 
24fd287
68d1586
d4425c5
24fd287
68d1586
d4425c5
b61c651
68d1586
f6e6a27
fe0f16b
d4425c5
 
60104c0
68d1586
 
60104c0
68d1586
 
 
 
 
25bb1a8
68d1586
 
 
 
 
25bb1a8
 
68d1586
25bb1a8
68d1586
 
 
 
 
25bb1a8
 
68d1586
 
06d861c
 
25bb1a8
 
 
 
 
06d861c
68d1586
 
 
daa7a71
3bbe019
daa7a71
 
 
 
 
 
 
 
42a2d4c
daa7a71
550d4f9
daa7a71
 
 
 
 
 
 
 
 
 
 
68d1586
 
daa7a71
 
 
 
06d861c
daa7a71
 
06d861c
daa7a71
 
06d861c
74a0540
68d1586
daa7a71
550d4f9
06d861c
 
 
 
daa7a71
 
26f764d
daa7a71
550d4f9
 
 
 
 
 
 
 
 
daa7a71
68d1586
 
06d861c
550d4f9
06d861c
 
 
 
 
 
 
 
550d4f9
06d861c

---
license: apache-2.0
datasets:
- netrias/cancer_metadata_harmonization
language:
- en
metrics:
- accuracy
base_model:
- openai-community/gpt2-large
pipeline_tag: text-generation
library_name: transformers
tags:
- harmonization
- curation
- standardization
- metadata-standardization
---

# Cancer Metadata Harmonization GPT2-Large

## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Environmental Impact](#environmental-impact)
- [Citation](#citation)

## Model Details

### Model Description
`netrias/cancer_harmonization_gpt2large` is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.

- **Developed by:** [Netrias, LLC](https://www.netrias.com/)
- **Model type:** Autoregressive transformer (GPT2-Large)
- **Language(s):** English
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Fine-tuned from model:** [openai-community/gpt2-large](https://huggingface.co/openai-community/gpt2-large)

### Model Sources
- **Repository:** [netrias/cancer_harmonization_gpt2large](https://huggingface.co/netrias/cancer_harmonization_gpt2large)
- **Paper:** [Metadata Harmonization from Biological Datasets with Language Models](https://doi.org/10.1101/2025.01.15.633281)

## Uses

### Direct Use
This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.

### Downstream Use
Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.

### Out-of-Scope Use
Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for cancer-related terms or to semantic types beyond those covered in training.

## Bias, Risks, and Limitations
This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.

### Recommendations
Limit use to the supported domain, and validate outputs before applying them in downstream applications.

## How to Get Started with the Model
Prompt the model using a structured sentence of the form: `The standardized form of "your input term" is "`. It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's `pipeline` to generate the top 5 completions using beam search:

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="netrias/cancer_harmonization_gpt2large"
)

outputs = pipe(
    'The standardized form of "glioblastoma multiforme" is "',
    num_beams=10,
    num_beam_groups=2,
    num_return_sequences=5,
    diversity_penalty=0.8,
    max_new_tokens=200,
    do_sample=False
)

for i, out in enumerate(outputs, 1):
    term = out["generated_text"].split('"')[-2]
    print(f"{i}. {term}")
```

Expected output: 
```
1. Glioblastoma
2. Glioblastoma, IDH Wildtype
3. Glioblastoma, IDH-Wildtype
4. Glioblastoma Multiforme
5. Glioblastoma, Not Otherwise Specified
```

## Training Details

### Training Data
Trained on the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset.

### Training Procedure

#### Preprocessing
Each training example was formatted as a natural language prompt-completion pair:  
`'The standardized form of "{variant_term}" is "{standard_term}"'`

#### Training Hyperparameters
- **Epochs:** 100
- **Batch size:** 8 (train and eval, per device)
- **Optimizer:** AdamW (`β₁=0.9`, `β₂=0.999`, `ε=1e-8`)
- **Learning rate:** 5e-5
- **Weight decay:** 0.0
- **Gradient accumulation steps:** 1
- **Learning rate scheduler:** Linear
- **Warmup steps:** 0
- **Precision:** Full (FP32; no mixed precision)
- **Evaluation strategy:** Every 1,763 steps
- **Metric for best model selection:** `eval_top_accuracy` (greater is better)
- **Save strategy:** Every 1,763 steps (retain best model checkpoint only)
- **Logging:** Every 500 steps (`mlflow`, `tensorboard`)
- **Seed:** 773819057

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
Evaluation was conducted on held-out splits from the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.

#### Factors
Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).

#### Metrics
- **Accuracy:** Whether the top prediction exactly matches the gold standard.
- **Top-5 Accuracy:** Whether the gold standard appears among the top 5 outputs from beam search (`num_beams=10`, `num_return_sequences=5`, `diversity_penalty=0.8`).

### Results
| Split                  | Accuracy       | Top-5 Accuracy |
|------------------------|----------------|----------------|
| Validation             | 94%            | 98%            |
| In-Dictionary (ID)     | 93%            | 96%            |
| Out-of-Dictionary (OOD)| 9%             | 16%            |

#### Summary
High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.

## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** NVIDIA A10G  
  *Estimate based on RTX A5000, the closest available option in the calculator.*
- **Hours Used:** 23.4
- **Cloud Provider:** Amazon Web Services
- **Compute Region:** US East (Ohio)
- **Carbon Emitted:** ~3.07 kg CO₂eq

## Citation

**BibTeX:**
```bibtex
@article{verbitsky2025metadata,
  title={Metadata Harmonization from Biological Datasets with Language Models},
  author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.15.633281},
  publisher={Cold Spring Harbor Laboratory}
}
```

**APA:**
Verbitsky, A., Boutet, P., & Eslami, M. (2025). *Metadata Harmonization from Biological Datasets with Language Models*. bioRxiv. https://doi.org/10.1101/2025.01.15.633281