File size: 3,476 Bytes
19e5fe4
66ee485
 
 
 
19e5fe4
 
 
66ee485
 
 
e837441
 
66ee485
 
 
 
 
c13c1c9
4319524
85fa905
 
 
 
 
 
 
 
 
 
 
66ee485
 
87f4de8
 
 
 
 
 
 
 
 
 
66ee485
87f4de8
 
d40ab2d
 
 
 
 
 
 
19e5fe4
 
93e897b
 
 
1c651e5
 
 
 
 
 
93e897b
1c651e5
 
fc8b571
 
1c651e5
4c1fb38
1c651e5
4c1fb38
1c651e5
4c1fb38
1c651e5
 
fc8b571
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93e897b
1c651e5
 
 
 
 
93e897b
66ee485
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
language:
- en
thumbnail: >-
  https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
- bert-base-cased
- biodiversity
- token-classification
- sequence-classification
license: apache-2.0
citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain."
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
- f1
- precision
- recall
- accuracy
evaluation datasets:
  - url: https://doi.org/10.5281/zenodo.6554208
  - named entity recognition:
      - COPIOUS
      - QEMP
      - BiodivNER
      - LINNAEUS
      - Species800
  - relation extraction:
      - GAD
      - EU-ADR
      - BiodivRE
      - BioRelEx  
training_data:
- crawling-keywords: 
  - biodivers
  - genetic diversity
  - omic diversity
  - phylogenetic diversity
  - soil diversity
  - population diversity
  - species diversity
  - ecosystem diversity
  - functional diversity
  - microbial diversity
- corpora:
  - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
  - (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020
pre-training-hyperparams:
- MAX_LEN = 512 # Default of BERT Tokenizer
- MLM_PROP = 0.15 # Data Collator
- num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here
- per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run
- per_device_eval_batch_size = 16 # usually as above
- gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.
---

# BiodivBERT

## Model description
* BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
* It uses the tokenizer from BERTT base cased model.
* BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
* BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
* Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details.

## How to use
* You can use BiodivBERT via huggingface library as follows:

1. Masked Language Model 

````
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
````

2. Token Classification - Named Entity Recognition

````
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
````

3. Sequence Classification - Relation Extraction

````
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
````

## Training data

* BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
* We used both Elsevier and Springer APIs to crawl such data.
* We covered publications over the duration of 1990-2020.

## Evaluation results
BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks.