averbitsky commited on
Commit
24fd287
·
verified ·
1 Parent(s): d4425c5

Update README Usage section

Browse files
Files changed (1) hide show
  1. README.md +9 -23
README.md CHANGED
@@ -20,14 +20,12 @@ tags:
20
 
21
  # Cancer Metadata Harmonization GPT2-Large Model
22
 
23
- ## Summary
24
-
25
- A fine-tuned GPT2-Large model for standardizing cancer-related biomedical metadata terms. Converts noisy variants into standardized ontology terms (NCIt, caDSR, GDC, ICD-O3, and MedDRA) for harmonization.
26
-
27
  ## Model Details
28
 
29
  ### Model Description
30
 
 
 
31
  - **Developed by:** [Netrias, LLC](https://www.netrias.com/)
32
  - **Model type:** Autoregressive transformer (GPT2-Large)
33
  - **Language(s):** English
@@ -36,37 +34,25 @@ A fine-tuned GPT2-Large model for standardizing cancer-related biomedical metada
36
 
37
  ### Model Sources
38
 
39
- - **Repository:** https://huggingface.co/netrias/cancer_harmonization_gpt2large
40
- - **Paper:** https://doi.org/10.1101/2025.01.15.633281
41
 
42
  ## Uses
43
 
44
  ### Direct Use
45
- This model is intended for research, metadata harmonization, and integration tasks in biomedical and clinical informatics. It is useful for:
46
-
47
- - Standardizing free-text annotations in biological or clinical datasets
48
- - Improving metadata quality and consistency
49
- - Enabling downstream analytics and ontology mapping
50
 
51
  ### Downstream Use
52
- May be integrated into larger metadata harmonization pipelines for curation workflows, biomedical search tools, or dataset standardization utilities.
53
 
54
  ### Out-of-Scope Use
55
- This model should not be used for:
56
- - Clinical decision-making or patient care
57
- - Generating diagnoses or treatment plans
58
- - Applications outside biomedical metadata harmonization
59
- - Harmonizing data in non-biomedical domains
60
- - Use in production systems without additional validation
61
 
62
  ### Bias, Risks, and Limitations
63
- This model reflects only the terminology and conventions found in its training data. It may:
64
- - Misrepresent or hallucinate terms outside the training domain
65
- - Produce inconsistent results for ambiguous inputs
66
- - Be biased toward cancer-related vocabularies and standards
67
 
68
  ### Recommendations
69
- Users should manually review outputs when using this model in production pipelines or data integration workflows.
70
 
71
  ## How to Get Started with the Model
72
 
 
20
 
21
  # Cancer Metadata Harmonization GPT2-Large Model
22
 
 
 
 
 
23
  ## Model Details
24
 
25
  ### Model Description
26
 
27
+ `netrias/cancer_harmonization_gpt2large` is a fine-tuned GPT2-Large model for harmonizing cancer-related biomedical terms. It transforms variant input terms—such as synonyms, abbreviations, misspellings, and rephrasings—into standardized ontology terms for consistent metadata curation. Standard terms are drawn from NCIt, caDSR, GDC, ICD-O3, and MedDRA.
28
+
29
  - **Developed by:** [Netrias, LLC](https://www.netrias.com/)
30
  - **Model type:** Autoregressive transformer (GPT2-Large)
31
  - **Language(s):** English
 
34
 
35
  ### Model Sources
36
 
37
+ - **Repository:** [netrias/cancer_harmonization_gpt2large](https://huggingface.co/netrias/cancer_harmonization_gpt2large)
38
+ - **Paper:** [Metadata Harmonization from Biological Datasets with Language Models](https://doi.org/10.1101/2025.01.15.633281)
39
 
40
  ## Uses
41
 
42
  ### Direct Use
43
+ This model standardizes variant biomedical terms belonging to five semantic types from the National Cancer Institute Thesaurus: Neoplastic Process, Disease or Syndrome, Finding, Laboratory Procedure, and Quantitative Concept. It is intended for harmonizing cancer-related metadata in biomedical informatics and research datasets.
 
 
 
 
44
 
45
  ### Downstream Use
46
+ Can be integrated into larger metadata harmonization workflows, including ontology curation tools, search utilities, terminology normalization services, or dataset integration pipelines requiring consistent annotation of biomedical terms.
47
 
48
  ### Out-of-Scope Use
49
+ Not suitable for clinical decision-making, patient-facing applications, or unvalidated production use. It should not be applied to domains outside biomedical metadata or to semantic types beyond those covered in training.
 
 
 
 
 
50
 
51
  ### Bias, Risks, and Limitations
52
+ Outputs are constrained by the conventions and vocabulary present in the training data, which focuses on a narrow set of semantic types. The model may misinterpret ambiguous terms or produce incorrect standardizations for unfamiliar variants or unsupported types. The model does not validate input and may generate plausible-looking but incorrect outputs for irrelevant, incomplete, or empty prompts. Manual review is recommended before downstream use.
 
 
 
53
 
54
  ### Recommendations
55
+ Limit use to the supported semantic types, and review outputs before applying them in downstream applications.
56
 
57
  ## How to Get Started with the Model
58