Thomas Lemberger
commited on
Commit
·
d66184b
1
Parent(s):
7af11f5
initial card
Browse files
README.md
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
-
|
4 |
+
-
|
5 |
+
thumbnail:
|
6 |
+
tags:
|
7 |
+
-
|
8 |
+
-
|
9 |
+
-
|
10 |
+
license:
|
11 |
+
datasets:
|
12 |
+
-
|
13 |
+
-
|
14 |
+
metrics:
|
15 |
+
-
|
16 |
+
-
|
17 |
+
---
|
18 |
+
|
19 |
+
# MyModelName
|
20 |
+
|
21 |
+
## Model description
|
22 |
+
|
23 |
+
This model is a [RoBERTa base model](https://huggingface.co/roberta-base) pre-trained model further trained with masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang).
|
24 |
+
|
25 |
+
## Intended uses & limitations
|
26 |
+
|
27 |
+
#### How to use
|
28 |
+
|
29 |
+
The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.
|
30 |
+
|
31 |
+
|
32 |
+
To have a quick check of the model as-is in a fill-mask task:
|
33 |
+
|
34 |
+
```python
|
35 |
+
from transformers import pipeline, RobertaTokenizerFast
|
36 |
+
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
|
37 |
+
text = "Let us try this model to see if it <mask>."
|
38 |
+
fill_mask = pipeline(
|
39 |
+
"fill-mask",
|
40 |
+
model='EMBO/bio-lm',
|
41 |
+
tokenizer=tokenizer
|
42 |
+
)
|
43 |
+
fill_mask(text)
|
44 |
+
```
|
45 |
+
|
46 |
+
#### Limitations and bias
|
47 |
+
|
48 |
+
This model should be fine-tuned on a specifi task like token classification.
|
49 |
+
The model must be used with the `roberta-base` tokenizer.
|
50 |
+
|
51 |
+
## Training data
|
52 |
+
|
53 |
+
The model was trained with a masked language modeling taskon the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.
|
54 |
+
|
55 |
+
## Training procedure
|
56 |
+
|
57 |
+
The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
|
58 |
+
|
59 |
+
Training code is available at https://github.com/source-data/soda-roberta
|
60 |
+
|
61 |
+
- Command: `python -m lm.train /data/json/oapmc_abstracts_figs/ MLM`
|
62 |
+
- Tokenizer vocab size: 50265
|
63 |
+
- Training data: bio_lang/MLM
|
64 |
+
- Training with: 12005390 examples.
|
65 |
+
- Evaluating on: 36713 examples.
|
66 |
+
- Epochs :3.0
|
67 |
+
- per_device_train_batch_size: 16,
|
68 |
+
- per_device_eval_batch_size; 16,
|
69 |
+
- learning_rate: 5e-05,
|
70 |
+
- weight_decay: 0.0,
|
71 |
+
- adam_beta1: 0.9,
|
72 |
+
- adam_beta2: 0.999,
|
73 |
+
- adam_epsilon: 1e-08,
|
74 |
+
- max_grad_norm: 1.0,
|
75 |
+
- tensorboard run: lm-MLM-2021-01-27T15-17-43.113766
|
76 |
+
|
77 |
+
End of training eval on validation set:
|
78 |
+
```
|
79 |
+
{'loss': 0.8653350830078125, 'learning_rate': 6.708070119323685e-08, 'epoch': 2.995975157928406}
|
80 |
+
{'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597, 'epoch': 2.995975157928406}
|
81 |
+
```
|
82 |
+
|
83 |
+
|
84 |
+
## Eval results
|
85 |
+
|
86 |
+
Eval on test set:
|
87 |
+
{'test_loss': 0.8240728974342346, 'test_recall': 0.814471959728645}
|
88 |
+
|
89 |
+
### BibTeX entry and citation info
|
90 |
+
|
91 |
+
```bibtex
|
92 |
+
@inproceedings{...,
|
93 |
+
year={2020}
|
94 |
+
}
|
95 |
+
```
|