Bacteria Metadata Harmonization GPT2-Large

Model Details
Uses
Bias, Risks, and Limitations
How to Get Started with the Model
Training Details
Evaluation
Environmental Impact
Citation

Model Details

Model Description

netrias/alcohol0_bacteria100_harmonization_gpt2large is a fine-tuned GPT2-Large model for harmonizing bacteria-related biomedical terms. It converts variant input terms, such as synonyms, abbreviations, misspellings, and rephrasings, into standardized terms for consistent metadata curation.

Developed by: Netrias, LLC
Model type: Autoregressive transformer (GPT2-Large)
Language(s): English
License: Apache 2.0
Fine-tuned from model: openai-community/gpt2-large

Model Sources

Repository: netrias/alcohol0_bacteria100_harmonization_gpt2large
Paper: Metadata Harmonization from Biological Datasets with Language Models

Uses

Direct Use

This model standardizes bacteria-related metadata terms, such as those found in experimental variables, survey items, and protocol fields.

Downstream Use

Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.

Out-of-Scope Use

Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied outside the scope of metadata harmonization for bacteria-related terms.

Bias, Risks, and Limitations

This model was trained on a narrow domain, limiting its applicability to other biomedical areas. Inputs outside this scope may yield unreliable harmonizations. It may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants. The model does not validate input quality and may generate plausible but incorrect outputs for irrelevant, incomplete, or malformed inputs. Human review is recommended before downstream use.

Recommendations

Limit use to the supported domain, and validate outputs before applying them in downstream applications.

How to Get Started with the Model

Prompt the model using a structured sentence of the form: The standardized form of "your input term" is ". It returns the most likely standardized term, followed by a closing quote. The example below uses Hugging Face's pipeline to generate the top 5 completions using beam search:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="netrias/alcohol0_bacteria100_harmonization_gpt2large"
)

outputs = pipe(
    'The standardized form of "staph" is "',
    num_beams=10,
    num_beam_groups=2,
    num_return_sequences=5,
    diversity_penalty=0.8,
    max_new_tokens=200,
    do_sample=False
)

for i, out in enumerate(outputs, 1):
    term = out["generated_text"].split('"')[-2]
    print(f"{i}. {term}")

Expected output:

1. Staphylococcus aureus subsp. aureus RN4220
2. Staphylococcus aureus subsp. aureus VRS6
3. Staphylococcus aureus subsp. aureus CIG1114
4. Staphylococcus aureus subsp. aureus CO-08
5. Staphylococcus aureus M0270

Training Details

Training Data

Trained on the alcohol0_bacteria100 subset of the netrias/alcohol_bacteria_metadata_harmonization dataset.

Training Procedure

Preprocessing

Each training example was formatted as a natural language prompt-completion pair:
'The standardized form of "{variant_term}" is "{standard_term}"'

Training Hyperparameters

Epochs: 100
Batch size: 8 (train and eval, per device)
Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8)
Learning rate: 5e-5
Weight decay: 0.0
Gradient accumulation steps: 1
Learning rate scheduler: Linear
Warmup steps: 0
Precision: Full (FP32; no mixed precision)
Evaluation strategy: Every 1,763 steps
Metric for best model selection: eval_top_accuracy (greater is better)
Save strategy: Every 1,763 steps (retain best model checkpoint only)
Logging: Every 500 steps (mlflow, tensorboard)
Seed: 773819057

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was conducted on held-out splits from the alcohol0_bacteria100 subset of the netrias/alcohol_bacteria_metadata_harmonization dataset. The model was tested on a validation set, an in-dictionary (ID) test set, and an out-of-dictionary (OOD) test set.

Factors

Results are disaggregated by dictionary inclusion: whether the gold standard appeared during training (ID) or not (OOD).

Metrics

Accuracy: Whether the top prediction exactly matches the gold standard.
Top-5 Accuracy: Whether the gold standard appears among the top 5 outputs from beam search (num_beams=10, num_return_sequences=5, diversity_penalty=0.8).

Results

Split	Accuracy	Top-5 Accuracy
Validation	75%	83%
In-Dictionary (ID)	72%	80%
Out-of-Dictionary (OOD)	17%	35%

Summary

High performance on the validation and ID test sets indicates effective learning of known representations. Lower performance on OOD terms suggests reduced generalization to unseen standards and highlights the importance of human review for unfamiliar inputs.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A10G
Estimate based on RTX A5000, the closest available option in the calculator.
Hours Used: 38
Cloud Provider: Amazon Web Services
Compute Region: US East (Ohio)
Carbon Emitted: ~4.98 kg CO₂eq

Citation

BibTeX:

@article{verbitsky2025metadata,
  title={Metadata Harmonization from Biological Datasets with Language Models},
  author={Verbitsky, Alex and Boutet, Patrick and Eslami, Mohammed},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.15.633281},
  publisher={Cold Spring Harbor Laboratory}
}

APA: Verbitsky, A., Boutet, P., & Eslami, M. (2025). Metadata Harmonization from Biological Datasets with Language Models. bioRxiv. https://doi.org/10.1101/2025.01.15.633281

netrias
/

alcohol0_bacteria100_harmonization_gpt2large