File size: 3,800 Bytes

e99bf7b
 
 
 
 
 
 
 
 
 
a68b4ec
e99bf7b
a68b4ec
e99bf7b
bd2bf08
e99bf7b
 
 
 
 
 
 
 
 
a68b4ec
e99bf7b
a68b4ec
e99bf7b
a68b4ec
 
88880eb
 
a68b4ec
 
 
 
 
e99bf7b
 
 
 
 
 
 
 
a68b4ec
e99bf7b
 
 
 
 
 
a68b4ec
bd2bf08
 
a68b4ec
 
e99bf7b
a68b4ec
e99bf7b
bd2bf08
e99bf7b
 
 
 
 
 
 
 
bd2bf08
e99bf7b
 
 
a68b4ec
e99bf7b
 
 
 
 
 
 
 
bd2bf08
e99bf7b
 
a68b4ec
e99bf7b
 
 
a68b4ec

---
license: mit
datasets:
- IEETA/SPACCC-Spanish-NER
language:
- es
metrics:
- f1
---

# Model Card for Biomedical Named Entity Recognition in Spanish Clinical Texts

Our model focuses on Biomedical Named Entity Recognition (NER) in Spanish clinical texts, crucial for automated information extraction in medical research and treatment improvements. It proposes a novel approach using a Multi-Head Conditional Random Field (CRF) classifier to tackle multi-class NER tasks, overcoming challenges of overlapping entity instances. The classes it recognizes include symptoms, procedures, diseases, chemicals, and proteins.

We provide 4 different models, available as branches of this repository.

## Model Details

### Model Description

- **Developed by:** IEETA
- **Model type:** Multi-Head-CRF, Roberta Base
- **Language(s) (NLP):** Spanish
- **License:** MIT
- **Finetuned from model:** lcampillos/roberta-es-clinical-trials-ner

### Model Sources

- **Repository:** [IEETA Multi-Head-CRF GitHub](https://github.com/ieeta-pt/Multi-Head-CRF)
- **Paper:** Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes [Awaiting Publication]

**Authors:**
- Richard A A Jonker ([ORCID: 0000-0002-3806-6940](https://orcid.org/0000-0002-3806-6940))
- Tiago Almeida ([ORCID: 0000-0002-4258-3350](https://orcid.org/0000-0002-4258-3350))
- Rui Antunes ([ORCID: 0000-0003-3533-8872](https://orcid.org/0000-0003-3533-8872))
- João R Almeida ([ORCID: 0000-0003-0729-2264](https://orcid.org/0000-0003-0729-2264))
- Sérgio Matos ([ORCID: 0000-0003-1941-3983](https://orcid.org/0000-0003-1941-3983))


## Uses

Note we do not take any liability for the use of the model in any professional/medical domain. The model is intended for academic purposes only. It performs Named Entity Recognition over 5 classes namely: SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL

## How to Get Started with the Model

Please refer to our GitHub repository for more information on how to train the model and run inference: [IEETA Multi-Head-CRF GitHub](https://github.com/ieeta-pt/Multi-Head-CRF)

## Training Details

### Training Data

The training data can be found on IEETA/SPACCC-Spanish-NER, which is further described on the dataset card.
The dataset used consists of 4 seperate datasets:
- [SympTEMIST](https://zenodo.org/records/10635215)
- [MedProcNER](https://zenodo.org/records/8224056)
- [DisTEMIST](https://zenodo.org/records/7614764)
- [PharmaCoNER](https://zenodo.org/records/4270158)

### Speeds, Sizes, Times

The models were trained using an Nvidia Quadro RTX 8000. The models for 5 classes took approximately 1 hour to train and occupy around 1GB of disk space. Additionally, this model shows linear complexity (+8 minutes) per entity class to classify.

### Testing Data, Factors & Metrics

#### Testing Data
The testing data can be found on IEETA/SPACCC-Spanish-NER, which is further described on the dataset card.

#### Metrics

The models were evaluated using the micro-averaged F1-score metric, the standard for entity recognition tasks.

### Results

We provide 4 separate models with various hyperparameter changes:

| HLs per head | Augmentation | Percentage Tags | Augmentation Probability | F1     |
|--------------|--------------|-----------------|--------------------------|--------|
| 3            | Random       | 0.25            | 0.50                     | 78.73  |
| 3            | Unknown      | 0.50            | 0.25                     | 78.50  |
| 3            | None         | -               | -                        | **78.89** |
| 1            | Random       | 0.25            | 0.50                     | **78.89** |

All models are trained with a context size of 32 tokens for 60 epochs.


## Citation

**BibTeX:**

[Awaiting Publication]