File size: 3,476 Bytes

---
language:
- en
thumbnail: >-
  https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
- bert-base-cased
- biodiversity
- token-classification
- sequence-classification
license: apache-2.0
citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain."
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
- f1
- precision
- recall
- accuracy
evaluation datasets:
  - url: https://doi.org/10.5281/zenodo.6554208
  - named entity recognition:
      - COPIOUS
      - QEMP
      - BiodivNER
      - LINNAEUS
      - Species800
  - relation extraction:
      - GAD
      - EU-ADR
      - BiodivRE
      - BioRelEx  
training_data:
- crawling-keywords: 
  - biodivers
  - genetic diversity
  - omic diversity
  - phylogenetic diversity
  - soil diversity
  - population diversity
  - species diversity
  - ecosystem diversity
  - functional diversity
  - microbial diversity
- corpora:
  - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
  - (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020
pre-training-hyperparams:
- MAX_LEN = 512 # Default of BERT Tokenizer
- MLM_PROP = 0.15 # Data Collator
- num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here
- per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run
- per_device_eval_batch_size = 16 # usually as above
- gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.
---

# BiodivBERT

## Model description
* BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
* It uses the tokenizer from BERTT base cased model.
* BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
* BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
* Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details.

## How to use
* You can use BiodivBERT via huggingface library as follows:

1. Masked Language Model 

````
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
````

2. Token Classification - Named Entity Recognition

````
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
````

3. Sequence Classification - Relation Extraction

````
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
````

## Training data

* BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
* We used both Elsevier and Springer APIs to crawl such data.
* We covered publications over the duration of 1990-2020.

## Evaluation results
BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks.