File size: 7,415 Bytes
6c4f341
 
 
 
 
 
 
 
 
 
25bb1a8
6c4f341
 
 
 
 
 
68d1586
 
f6e6a27
 
 
 
 
 
 
 
 
 
 
68d1586
 
 
 
80ea7b1
24fd287
d4425c5
 
 
 
 
68d1586
 
24fd287
 
68d1586
 
 
 
24fd287
68d1586
d4425c5
24fd287
68d1586
d4425c5
b61c651
68d1586
f6e6a27
fe0f16b
d4425c5
 
60104c0
68d1586
 
60104c0
68d1586
 
 
 
 
25bb1a8
68d1586
 
 
 
 
25bb1a8
 
68d1586
25bb1a8
68d1586
 
 
 
 
25bb1a8
 
68d1586
 
06d861c
 
25bb1a8
 
 
 
 
06d861c
68d1586
 
 
daa7a71
3bbe019
daa7a71
 
 
 
 
 
 
 
42a2d4c
daa7a71
550d4f9
daa7a71
 
 
 
 
 
 
 
 
 
 
68d1586
 
daa7a71
 
 
 
06d861c
daa7a71
 
06d861c
daa7a71
 
06d861c
74a0540
68d1586
daa7a71
550d4f9
06d861c
 
 
 
daa7a71
 
26f764d
daa7a71
550d4f9
 
 
 
 
 
 
 
 
daa7a71
68d1586
 
06d861c
550d4f9
06d861c
 
 
 
 
 
 
 
550d4f9
06d861c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
license: apache-2.0
datasets:
- netrias/cancer_metadata_harmonization
language:
- en
metrics:
- accuracy
base_model:
- openai-community/gpt2-large
pipeline_tag: text-generation
library_name: transformers
tags:
- harmonization
- curation
- standardization
- metadata-standardization
---

# Cancer Metadata Harmonization GPT2-Large

## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Environmental Impact](#environmental-impact)
- [Citation](#citation)

## Model Details

### Model Description
`netrias/cancer_harmonization_gpt2large` is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.

- **Developed by:** [Netrias, LLC](https://www.netrias.com/)
- **Model type:** Autoregressive transformer (GPT2-Large)
- **Language(s):** English
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Fine-tuned from model:** [openai-community/gpt2-large](https://huggingface.co/openai-community/gpt2-large)

### Model Sources
- **Repository:** [netrias/cancer_harmonization_gpt2large](https://huggingface.co/netrias/cancer_harmonization_gpt2large)
- **Paper:** [Metadata Harmonization from Biological Datasets with Language Models](https://doi.org/10.1101/2025.01.15.633281)

## Uses

### Direct Use
This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.

### Downstream Use
Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.

### Out-of-Scope Use
Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for cancer-related terms or to semantic types beyond those covered in training.

## Bias, Risks, and Limitations
This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.

### Recommendations
Limit use to the supported domain, and validate outputs before applying them in downstream applications.

## How to Get Started with the Model
Prompt the model using a structured sentence of the form: `The standardized form of "your input term" is "`. It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's `pipeline` to generate the top 5 completions using beam search:

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="netrias/cancer_harmonization_gpt2large"
)

outputs = pipe(
    'The standardized form of "glioblastoma multiforme" is "',
    num_beams=10,
    num_beam_groups=2,
    num_return_sequences=5,
    diversity_penalty=0.8,
    max_new_tokens=200,
    do_sample=False
)

for i, out in enumerate(outputs, 1):
    term = out["generated_text"].split('"')[-2]
    print(f"{i}. {term}")
```

Expected output: 
```
1. Glioblastoma
2. Glioblastoma, IDH Wildtype
3. Glioblastoma, IDH-Wildtype
4. Glioblastoma Multiforme
5. Glioblastoma, Not Otherwise Specified
```

## Training Details

### Training Data
Trained on the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset.

### Training Procedure

#### Preprocessing
Each training example was formatted as a natural language prompt-completion pair:  
`'The standardized form of "{variant_term}" is "{standard_term}"'`

#### Training Hyperparameters
- **Epochs:** 100
- **Batch size:** 8 (train and eval, per device)
- **Optimizer:** AdamW (`β₁=0.9`, `β₂=0.999`, `ε=1e-8`)
- **Learning rate:** 5e-5
- **Weight decay:** 0.0
- **Gradient accumulation steps:** 1
- **Learning rate scheduler:** Linear
- **Warmup steps:** 0
- **Precision:** Full (FP32; no mixed precision)
- **Evaluation strategy:** Every 1,763 steps
- **Metric for best model selection:** `eval_top_accuracy` (greater is better)
- **Save strategy:** Every 1,763 steps (retain best model checkpoint only)
- **Logging:** Every 500 steps (`mlflow`, `tensorboard`)
- **Seed:** 773819057

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
Evaluation was conducted on held-out splits from the [`netrias/cancer_metadata_harmonization`](https://huggingface.co/datasets/netrias/cancer_metadata_harmonization) dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.

#### Factors
Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).

#### Metrics
- **Accuracy:** Whether the top prediction exactly matches the gold standard.
- **Top-5 Accuracy:** Whether the gold standard appears among the top 5 outputs from beam search (`num_beams=10`, `num_return_sequences=5`, `diversity_penalty=0.8`).

### Results
| Split                  | Accuracy       | Top-5 Accuracy |
|------------------------|----------------|----------------|
| Validation             | 94%            | 98%            |
| In-Dictionary (ID)     | 93%            | 96%            |
| Out-of-Dictionary (OOD)| 9%             | 16%            |

#### Summary
High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.

## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** NVIDIA A10G  
  *Estimate based on RTX A5000, the closest available option in the calculator.*
- **Hours Used:** 23.4
- **Cloud Provider:** Amazon Web Services
- **Compute Region:** US East (Ohio)
- **Carbon Emitted:** ~3.07 kg CO₂eq

## Citation

**BibTeX:**
```bibtex
@article{verbitsky2025metadata,
  title={Metadata Harmonization from Biological Datasets with Language Models},
  author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.15.633281},
  publisher={Cold Spring Harbor Laboratory}
}
```

**APA:**
Verbitsky, A., Boutet, P., & Eslami, M. (2025). *Metadata Harmonization from Biological Datasets with Language Models*. bioRxiv. https://doi.org/10.1101/2025.01.15.633281