BiodivBERT / README.md
NoYo25's picture
Update README.md
e837441
---
language:
- en
thumbnail: >-
https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
tags:
- bert-base-cased
- biodiversity
- token-classification
- sequence-classification
license: apache-2.0
citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain."
paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
metrics:
- f1
- precision
- recall
- accuracy
evaluation datasets:
- url: https://doi.org/10.5281/zenodo.6554208
- named entity recognition:
- COPIOUS
- QEMP
- BiodivNER
- LINNAEUS
- Species800
- relation extraction:
- GAD
- EU-ADR
- BiodivRE
- BioRelEx
training_data:
- crawling-keywords:
- biodivers
- genetic diversity
- omic diversity
- phylogenetic diversity
- soil diversity
- population diversity
- species diversity
- ecosystem diversity
- functional diversity
- microbial diversity
- corpora:
- (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
- (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020
pre-training-hyperparams:
- MAX_LEN = 512 # Default of BERT Tokenizer
- MLM_PROP = 0.15 # Data Collator
- num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here
- per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run
- per_device_eval_batch_size = 16 # usually as above
- gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.
---
# BiodivBERT
## Model description
* BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
* It uses the tokenizer from BERTT base cased model.
* BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
* BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
* Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details.
## How to use
* You can use BiodivBERT via huggingface library as follows:
1. Masked Language Model
````
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
````
2. Token Classification - Named Entity Recognition
````
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
````
3. Sequence Classification - Relation Extraction
````
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
````
## Training data
* BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
* We used both Elsevier and Springer APIs to crawl such data.
* We covered publications over the duration of 1990-2020.
## Evaluation results
BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks.