Cancer Metadata Harmonization GPT2-Large

Model Details
Uses
Bias, Risks, and Limitations
How to Get Started with the Model
Training Details
Evaluation
Environmental Impact
Citation

Model Details

Model Description

netrias/cancer_harmonization_gpt2large is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.

Developed by: Netrias, LLC
Model type: Autoregressive transformer (GPT2-Large)
Language(s): English
License: Apache 2.0
Fine-tuned from model: openai-community/gpt2-large

Model Sources

Repository: netrias/cancer_harmonization_gpt2large
Paper: Metadata Harmonization from Biological Datasets with Language Models

Uses

Direct Use

This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.

Downstream Use

Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.

Out-of-Scope Use

Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for cancer-related terms or to semantic types beyond those covered in training.

Bias, Risks, and Limitations

This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.

Recommendations

Limit use to the supported domain, and validate outputs before applying them in downstream applications.

How to Get Started with the Model

Prompt the model using a structured sentence of the form: The standardized form of "your input term" is ". It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's pipeline to generate the top 5 completions using beam search:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="netrias/cancer_harmonization_gpt2large"
)

outputs = pipe(
    'The standardized form of "glioblastoma multiforme" is "',
    num_beams=10,
    num_beam_groups=2,
    num_return_sequences=5,
    diversity_penalty=0.8,
    max_new_tokens=200,
    do_sample=False
)

for i, out in enumerate(outputs, 1):
    term = out["generated_text"].split('"')[-2]
    print(f"{i}. {term}")

Expected output:

1. Glioblastoma
2. Glioblastoma, IDH Wildtype
3. Glioblastoma, IDH-Wildtype
4. Glioblastoma Multiforme
5. Glioblastoma, Not Otherwise Specified

Training Details

Training Data

Trained on the netrias/cancer_metadata_harmonization dataset.

Training Procedure

Preprocessing

Each training example was formatted as a natural language prompt-completion pair:
'The standardized form of "{variant_term}" is "{standard_term}"'

Training Hyperparameters

Epochs: 100
Batch size: 8 (train and eval, per device)
Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8)
Learning rate: 5e-5
Weight decay: 0.0
Gradient accumulation steps: 1
Learning rate scheduler: Linear
Warmup steps: 0
Precision: Full (FP32; no mixed precision)
Evaluation strategy: Every 1,763 steps
Metric for best model selection: eval_top_accuracy (greater is better)
Save strategy: Every 1,763 steps (retain best model checkpoint only)
Logging: Every 500 steps (mlflow, tensorboard)
Seed: 773819057

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was conducted on held-out splits from the netrias/cancer_metadata_harmonization dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.

Factors

Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).

Metrics

Accuracy: Whether the top prediction exactly matches the gold standard.
Top-5 Accuracy: Whether the gold standard appears among the top 5 outputs from beam search (num_beams=10, num_return_sequences=5, diversity_penalty=0.8).

Results

Split	Accuracy	Top-5 Accuracy
Validation	94%	98%
In-Dictionary (ID)	93%	96%
Out-of-Dictionary (OOD)	9%	16%

Summary

High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A10G
Estimate based on RTX A5000, the closest available option in the calculator.
Hours Used: 23.4
Cloud Provider: Amazon Web Services
Compute Region: US East (Ohio)
Carbon Emitted: ~3.07 kg CO₂eq

Citation

BibTeX:

@article{verbitsky2025metadata,
  title={Metadata Harmonization from Biological Datasets with Language Models},
  author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.15.633281},
  publisher={Cold Spring Harbor Laboratory}
}

APA: Verbitsky, A., Boutet, P., & Eslami, M. (2025). Metadata Harmonization from Biological Datasets with Language Models. bioRxiv. https://doi.org/10.1101/2025.01.15.633281

netrias
/

cancer_harmonization_gpt2large