Bacteria Metadata Harmonization GPT2-Large
Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- How to Get Started with the Model
- Training Details
- Evaluation
- Environmental Impact
- Citation
Model Details
Model Description
netrias/alcohol0_bacteria100_harmonization_gpt2large
is a fine-tuned GPT2-Large model for harmonizing bacteria-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.
- Developed by: Netrias, LLC
- Model type: Autoregressive transformer (GPT2-Large)
- Language(s): English
- License: Apache 2.0
- Fine-tuned from model: openai-community/gpt2-large
Model Sources
- Repository: netrias/alcohol0_bacteria100_harmonization_gpt2large
- Paper: Metadata Harmonization from Biological Datasets with Language Models
Uses
Direct Use
This model standardizes bacteria-related metadata terms, such as those found in experimental variables, survey items, and protocol fields.
Downstream Use
Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.
Out-of-Scope Use
Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for bacteria-related terms.
Bias, Risks, and Limitations
This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.
Recommendations
Limit use to the supported domain, and validate outputs before applying them in downstream applications.
How to Get Started with the Model
Prompt the model using a structured sentence of the form: The standardized form of "your input term" is "
. It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's pipeline
to generate the top 5 completions using beam search:
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="netrias/alcohol0_bacteria100_harmonization_gpt2large"
)
outputs = pipe(
'The standardized form of "staph" is "',
num_beams=10,
num_beam_groups=2,
num_return_sequences=5,
diversity_penalty=0.8,
max_new_tokens=200,
do_sample=False
)
for i, out in enumerate(outputs, 1):
term = out["generated_text"].split('"')[-2]
print(f"{i}. {term}")
Expected output:
1. Staphylococcus aureus subsp. aureus RN4220
2. Staphylococcus aureus subsp. aureus VRS6
3. Staphylococcus aureus subsp. aureus CIG1114
4. Staphylococcus aureus subsp. aureus CO-08
5. Staphylococcus aureus M0270
Training Details
Training Data
Trained on the alcohol0_bacteria100
subset of the netrias/alcohol_bacteria_metadata_harmonization
dataset.
Training Procedure
Preprocessing
Each training example was formatted as a natural language prompt-completion pair:'The standardized form of "{variant_term}" is "{standard_term}"'
Training Hyperparameters
- Epochs: 100
- Batch size: 8 (train and eval, per device)
- Optimizer: AdamW (
β₁=0.9
,β₂=0.999
,ε=1e-8
) - Learning rate: 5e-5
- Weight decay: 0.0
- Gradient accumulation steps: 1
- Learning rate scheduler: Linear
- Warmup steps: 0
- Precision: Full (FP32; no mixed precision)
- Evaluation strategy: Every 1,763 steps
- Metric for best model selection:
eval_top_accuracy
(greater is better) - Save strategy: Every 1,763 steps (retain best model checkpoint only)
- Logging: Every 500 steps (
mlflow
,tensorboard
) - Seed: 773819057
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluation was conducted on held-out splits from the alcohol0_bacteria100
subset of the netrias/alcohol_bacteria_metadata_harmonization
dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.
Factors
Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).
Metrics
- Accuracy: Whether the top prediction exactly matches the gold standard.
- Top-5 Accuracy: Whether the gold standard appears among the top 5 outputs from beam search (
num_beams=10
,num_return_sequences=5
,diversity_penalty=0.8
).
Results
Split | Accuracy | Top-5 Accuracy |
---|---|---|
Validation | 75% | 83% |
In-Dictionary (ID) | 72% | 80% |
Out-of-Dictionary (OOD) | 17% | 35% |
Summary
High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA A10G
Estimate based on RTX A5000, the closest available option in the calculator. - Hours Used: 38
- Cloud Provider: Amazon Web Services
- Compute Region: US East (Ohio)
- Carbon Emitted: ~4.98 kg CO₂eq
Citation
BibTeX:
@article{verbitsky2025metadata,
title={Metadata Harmonization from Biological Datasets with Language Models},
author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.01.15.633281},
publisher={Cold Spring Harbor Laboratory}
}
APA: Verbitsky, A., Boutet, P., & Eslami, M. (2025). Metadata Harmonization from Biological Datasets with Language Models. bioRxiv. https://doi.org/10.1101/2025.01.15.633281
- Downloads last month
- 8
Model tree for netrias/alcohol0_bacteria100_harmonization_gpt2large
Base model
openai-community/gpt2-large