Aidan Mannion commited on
Commit
b2b97cc
·
1 Parent(s): 87dfd80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -6,10 +6,79 @@ tags:
6
  - medical
7
  ---
8
 
9
- ### UMLS-KGI-BERT-ES
10
- This is BERT encoder trained on the Spanish section of the European Clinical Case corpus as well as the UMLS metathesaurus knowledge graph, as described in [this paper](https://aclanthology.org/2023.clinicalnlp-1.35/).
11
 
12
- If you use this model in your research, please cite one or both of the following:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ```
14
  @inproceedings{mannion-etal-2023-umls,
15
  title = "{UMLS}-{KGI}-{BERT}: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition",
 
6
  - medical
7
  ---
8
 
9
+ # UMLS-KGI-BERT-ES
 
10
 
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+ This is BERT encoder trained on the Spanish-language section of the European Clinical Case corpus as well as the UMLS metathesaurus knowledge graph, as described in [this paper](https://aclanthology.org/2023.clinicalnlp-1.35/).
14
+ The training corpus consists of a custom combination of clinical documents from the E3C and text sequences derived from the metathesaurus (see our [Github repo](https://github.com/ap-mannion/bertify-umls) for more details).
15
+
16
+ ## Model Details
17
+
18
+ This model was trained using a multi-task approach combining Masked Language Modelling with knowledge-graph-based classification/fill-mask type objectives.
19
+ The idea behind this framework was to try to improve the robustness of specialised biomedical BERT models by having them learn from structured data as well as natural language, while remaining in the cross-entropy-based learning paradigm.
20
+
21
+ - **Developed by:** Aidan Mannion
22
+ - **Funded by :** GENCI-IDRIS grant AD011013535R1
23
+ - **Model type:** DistilBERT
24
+ - **Language(s) (NLP):** English
25
+
26
+ For further details on the model architecture, training objectives, hardware \& software used, as well as the preliminary downstream evaluation experiments carried out, refer to the [ArXiv paper](https://arxiv.org/abs/2307.11170).
27
+
28
+
29
+ ### Direct/Downstream Use
30
+
31
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
32
+ This model is intended for use in experimental clinical/biomedical NLP work, either as a part of a larger system requiring text encoding or fine-tuned on a specific downstream task requiring clinical language modelling.
33
+ It has **not** been sufficiently tested for accuracy, robustness and bias to be used in production settings.
34
+
35
+ ### Out-of-Scope Use
36
+
37
+ Experiments on general-domain data suggest that, given it's specialised training corpus, this model is **not** suitable for use on out-of-domain NLP tasks, and we recommend that it only be used for processing clinical text.
38
+
39
+ ### Training Data
40
+
41
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
42
+
43
+ - [European Clinical Case Corpus](https://live.european-language-grid.eu/catalogue/corpus/7618)
44
+ - [UMLS Metathesaurus](https://www.nlm.nih.gov/research/umls/index.html)
45
+
46
+
47
+ #### Training Hyperparameters
48
+
49
+ - sequence length: 256
50
+ - learning rate $7.5\times10^{-5}$
51
+ - linear learning rate schedule with 10,770 warmup steps
52
+ - effective batch size 1500 (15 sequences per batch x 100 gradient accumulation steps)
53
+ - MLM masking probability 0.15
54
+ **Training regime:** The model was trained with fp16 non-mixed precision, using the AdamW optimizer with default parameters.
55
+
56
+
57
+ ## Evaluation
58
+
59
+ <!-- This section describes the evaluation protocols and provides the results. -->
60
+
61
+ ### Testing Data, Factors & Metrics
62
+
63
+ #### Testing Data
64
+
65
+ <!-- This should link to a Dataset Card if possible. -->
66
+
67
+ [More Information Needed]
68
+
69
+ #### Metrics
70
+
71
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
72
+
73
+ [More Information Needed]
74
+
75
+ ### Results
76
+
77
+ [More Information Needed]
78
+
79
+ ## Citation [BibTeX]
80
+
81
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
82
  ```
83
  @inproceedings{mannion-etal-2023-umls,
84
  title = "{UMLS}-{KGI}-{BERT}: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition",