|
--- |
|
license: mit |
|
datasets: |
|
- IEETA/SPACCC-Spanish-NER |
|
language: |
|
- es |
|
metrics: |
|
- f1 |
|
--- |
|
|
|
# Model Card for Biomedical Named Entity Recognition in Spanish Clinical Texts |
|
|
|
Our model focuses on Biomedical Named Entity Recognition (NER) in Spanish clinical texts, crucial for automated information extraction in medical research and treatment improvements. It proposes a novel approach using a Multi-Head Conditional Random Field (CRF) classifier to tackle multi-class NER tasks, overcoming challenges of overlapping entity instances. The classes it recognizes include symptoms, procedures, diseases, chemicals, and proteins. |
|
|
|
We provide 4 different models, available as branches of this repository. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** IEETA |
|
- **Model type:** Multi-Head-CRF, Roberta Base |
|
- **Language(s) (NLP):** Spanish |
|
- **License:** MIT |
|
- **Finetuned from model:** lcampillos/roberta-es-clinical-trials-ner |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [IEETA Multi-Head-CRF GitHub](https://github.com/ieeta-pt/Multi-Head-CRF) |
|
- **Paper:** Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes [Awaiting Publication] |
|
|
|
**Authors:** |
|
- Richard A A Jonker ([ORCID: 0000-0002-3806-6940](https://orcid.org/0000-0002-3806-6940)) |
|
- Tiago Almeida ([ORCID: 0000-0002-4258-3350](https://orcid.org/0000-0002-4258-3350)) |
|
- Rui Antunes ([ORCID: 0000-0003-3533-8872](https://orcid.org/0000-0003-3533-8872)) |
|
- João R Almeida ([ORCID: 0000-0003-0729-2264](https://orcid.org/0000-0003-0729-2264)) |
|
- Sérgio Matos ([ORCID: 0000-0003-1941-3983](https://orcid.org/0000-0003-1941-3983)) |
|
|
|
|
|
## Uses |
|
|
|
Note we do not take any liability for the use of the model in any professional/medical domain. The model is intended for academic purposes only. It performs Named Entity Recognition over 5 classes namely: SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL |
|
|
|
## How to Get Started with the Model |
|
|
|
Please refer to our GitHub repository for more information on how to train the model and run inference: [IEETA Multi-Head-CRF GitHub](https://github.com/ieeta-pt/Multi-Head-CRF) |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The training data can be found on IEETA/SPACCC-Spanish-NER, which is further described on the dataset card. |
|
The dataset used consists of 4 seperate datasets: |
|
- [SympTEMIST](https://zenodo.org/records/10635215) |
|
- [MedProcNER](https://zenodo.org/records/8224056) |
|
- [DisTEMIST](https://zenodo.org/records/7614764) |
|
- [PharmaCoNER](https://zenodo.org/records/4270158) |
|
|
|
### Speeds, Sizes, Times |
|
|
|
The models were trained using an Nvidia Quadro RTX 8000. The models for 5 classes took approximately 1 hour to train and occupy around 1GB of disk space. Additionally, this model shows linear complexity (+8 minutes) per entity class to classify. |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
The testing data can be found on IEETA/SPACCC-Spanish-NER, which is further described on the dataset card. |
|
|
|
#### Metrics |
|
|
|
The models were evaluated using the micro-averaged F1-score metric, the standard for entity recognition tasks. |
|
|
|
### Results |
|
|
|
We provide 4 separate models with various hyperparameter changes: |
|
|
|
| HLs per head | Augmentation | Percentage Tags | Augmentation Probability | F1 | |
|
|--------------|--------------|-----------------|--------------------------|--------| |
|
| 3 | Random | 0.25 | 0.50 | 78.73 | |
|
| 3 | Unknown | 0.50 | 0.25 | 78.50 | |
|
| 3 | None | - | - | **78.89** | |
|
| 1 | Random | 0.25 | 0.50 | **78.89** | |
|
|
|
All models are trained with a context size of 32 tokens for 60 epochs. |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
[Awaiting Publication] |
|
|