File size: 3,476 Bytes
19e5fe4 66ee485 19e5fe4 66ee485 e837441 66ee485 c13c1c9 4319524 85fa905 66ee485 87f4de8 66ee485 87f4de8 d40ab2d 19e5fe4 93e897b 1c651e5 93e897b 1c651e5 fc8b571 1c651e5 4c1fb38 1c651e5 4c1fb38 1c651e5 4c1fb38 1c651e5 fc8b571 93e897b 1c651e5 93e897b 66ee485 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
language:
- en
thumbnail: >-
https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
- bert-base-cased
- biodiversity
- token-classification
- sequence-classification
license: apache-2.0
citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain."
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
- f1
- precision
- recall
- accuracy
evaluation datasets:
- url: https://doi.org/10.5281/zenodo.6554208
- named entity recognition:
- COPIOUS
- QEMP
- BiodivNER
- LINNAEUS
- Species800
- relation extraction:
- GAD
- EU-ADR
- BiodivRE
- BioRelEx
training_data:
- crawling-keywords:
- biodivers
- genetic diversity
- omic diversity
- phylogenetic diversity
- soil diversity
- population diversity
- species diversity
- ecosystem diversity
- functional diversity
- microbial diversity
- corpora:
- (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
- (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020
pre-training-hyperparams:
- MAX_LEN = 512 # Default of BERT Tokenizer
- MLM_PROP = 0.15 # Data Collator
- num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here
- per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run
- per_device_eval_batch_size = 16 # usually as above
- gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.
---
# BiodivBERT
## Model description
* BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
* It uses the tokenizer from BERTT base cased model.
* BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
* BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
* Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details.
## How to use
* You can use BiodivBERT via huggingface library as follows:
1. Masked Language Model
````
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
````
2. Token Classification - Named Entity Recognition
````
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
````
3. Sequence Classification - Relation Extraction
````
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
````
## Training data
* BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
* We used both Elsevier and Springer APIs to crawl such data.
* We covered publications over the duration of 1990-2020.
## Evaluation results
BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks. |