Update README Usage section
Browse files
README.md
CHANGED
|
@@ -20,14 +20,12 @@ tags:
|
|
| 20 |
|
| 21 |
# Cancer Metadata Harmonization GPT2-Large Model
|
| 22 |
|
| 23 |
-
## Summary
|
| 24 |
-
|
| 25 |
-
A fine-tuned GPT2-Large model for standardizing cancer-related biomedical metadata terms. Converts noisy variants into standardized ontology terms (NCIt, caDSR, GDC, ICD-O3, and MedDRA) for harmonization.
|
| 26 |
-
|
| 27 |
## Model Details
|
| 28 |
|
| 29 |
### Model Description
|
| 30 |
|
|
|
|
|
|
|
| 31 |
- **Developed by:** [Netrias, LLC](https://www.netrias.com/)
|
| 32 |
- **Model type:** Autoregressive transformer (GPT2-Large)
|
| 33 |
- **Language(s):** English
|
|
@@ -36,37 +34,25 @@ A fine-tuned GPT2-Large model for standardizing cancer-related biomedical metada
|
|
| 36 |
|
| 37 |
### Model Sources
|
| 38 |
|
| 39 |
-
- **Repository:** https://huggingface.co/netrias/cancer_harmonization_gpt2large
|
| 40 |
-
- **Paper:** https://doi.org/10.1101/2025.01.15.633281
|
| 41 |
|
| 42 |
## Uses
|
| 43 |
|
| 44 |
### Direct Use
|
| 45 |
-
This model
|
| 46 |
-
|
| 47 |
-
- Standardizing free-text annotations in biological or clinical datasets
|
| 48 |
-
- Improving metadata quality and consistency
|
| 49 |
-
- Enabling downstream analytics and ontology mapping
|
| 50 |
|
| 51 |
### Downstream Use
|
| 52 |
-
|
| 53 |
|
| 54 |
### Out-of-Scope Use
|
| 55 |
-
|
| 56 |
-
- Clinical decision-making or patient care
|
| 57 |
-
- Generating diagnoses or treatment plans
|
| 58 |
-
- Applications outside biomedical metadata harmonization
|
| 59 |
-
- Harmonizing data in non-biomedical domains
|
| 60 |
-
- Use in production systems without additional validation
|
| 61 |
|
| 62 |
### Bias, Risks, and Limitations
|
| 63 |
-
|
| 64 |
-
- Misrepresent or hallucinate terms outside the training domain
|
| 65 |
-
- Produce inconsistent results for ambiguous inputs
|
| 66 |
-
- Be biased toward cancer-related vocabularies and standards
|
| 67 |
|
| 68 |
### Recommendations
|
| 69 |
-
|
| 70 |
|
| 71 |
## How to Get Started with the Model
|
| 72 |
|
|
|
|
| 20 |
|
| 21 |
# Cancer Metadata Harmonization GPT2-Large Model
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
## Model Details
|
| 24 |
|
| 25 |
### Model Description
|
| 26 |
|
| 27 |
+
`netrias/cancer_harmonization_gpt2large` is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It transforms variant input terms—such as synonyms, abbreviations, misspellings, and rephrasings—into standardized ontology terms for consistent metadata curation. Standard terms are drawn from NCIt, caDSR, GDC, ICD-O3, and MedDRA.
|
| 28 |
+
|
| 29 |
- **Developed by:** [Netrias, LLC](https://www.netrias.com/)
|
| 30 |
- **Model type:** Autoregressive transformer (GPT2-Large)
|
| 31 |
- **Language(s):** English
|
|
|
|
| 34 |
|
| 35 |
### Model Sources
|
| 36 |
|
| 37 |
+
- **Repository:** [netrias/cancer_harmonization_gpt2large](https://huggingface.co/netrias/cancer_harmonization_gpt2large)
|
| 38 |
+
- **Paper:** [Metadata Harmonization from Biological Datasets with Language Models](https://doi.org/10.1101/2025.01.15.633281)
|
| 39 |
|
| 40 |
## Uses
|
| 41 |
|
| 42 |
### Direct Use
|
| 43 |
+
This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
### Downstream Use
|
| 46 |
+
Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.
|
| 47 |
|
| 48 |
### Out-of-Scope Use
|
| 49 |
+
Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied to domains outside biomedical metadata or to semantic types beyond those covered in training.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
### Bias, Risks, and Limitations
|
| 52 |
+
Outputs are constrained by the conventions and vocabulary present in the training data, which focuses on a narrow set of semantic types. The model may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants or unsupported types. The model does not validate input and may generate plausible-looking but incorrect outputs for irrelevant, incomplete, or empty prompts. Manual review is recommended before downstream use.
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
### Recommendations
|
| 55 |
+
Limit use to the supported semantic types, and review outputs before applying them in downstream applications.
|
| 56 |
|
| 57 |
## How to Get Started with the Model
|
| 58 |
|