|
--- |
|
license: mit |
|
datasets: |
|
- eriktks/conll2003 |
|
language: |
|
- en |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: token-classification |
|
library_name: transformers |
|
tags: |
|
- ner |
|
--- |
|
# Model Card: BERT for Named Entity Recognition (NER) |
|
|
|
## Model Overview |
|
|
|
This model, **bert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities. |
|
|
|
### Model Architecture |
|
- **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture. |
|
- **Task**: Token Classification (NER). |
|
|
|
## Training Dataset |
|
|
|
- **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans. |
|
- **Classes**: |
|
- `PER` (Person) |
|
- `ORG` (Organization) |
|
- `LOC` (Location) |
|
- `MISC` (Miscellaneous) |
|
- `O` (Outside of any entity span) |
|
|
|
## Performance Metrics |
|
|
|
The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set: |
|
|
|
| Metric | Value | |
|
|-------------|------------| |
|
| **Loss** | 0.0649 | |
|
| **Precision** | 93.59% | |
|
| **Recall** | 95.07% | |
|
| **F1 Score** | 94.32% | |
|
| **Accuracy** | 98.79% | |
|
|
|
These metrics indicate the model's high accuracy and robustness in identifying and classifying entities. |
|
|
|
## Training Details |
|
|
|
- **Optimizer**: AdamW (Adam with weight decay) |
|
- **Learning Rate**: 2e-5 |
|
- **Batch Size**: 8 |
|
- **Number of Epochs**: 3 |
|
- **Scheduler**: Linear scheduler with warm-up steps |
|
- **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens |
|
|
|
## Model Input/Output |
|
|
|
- **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`. |
|
- **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.). |
|
|
|
|
|
|
|
## How to Use the Model |
|
|
|
### Installation |
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
### Loading the Model |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner") |
|
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner") |
|
``` |
|
|
|
### Running Inference |
|
```python |
|
from transformers import pipeline |
|
|
|
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
text = "John lives in New York City." |
|
result = nlp(text) |
|
print(result) |
|
``` |
|
|
|
```json |
|
[{'entity_group': 'PER', |
|
'score': 0.99912304, |
|
'word': 'john', |
|
'start': 0, |
|
'end': 4}, |
|
{'entity_group': 'LOC', |
|
'score': 0.9993351, |
|
'word': 'new york city', |
|
'start': 14, |
|
'end': 27}] |
|
``` |
|
|
|
## Limitations |
|
|
|
1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset. |
|
2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled. |
|
## Recommendations |
|
|
|
- For domain-specific tasks, consider fine-tuning this model further on a relevant dataset. |
|
- Use a pre-processing pipeline to handle long texts by splitting them into smaller segments. |
|
|
|
## Acknowledgements |
|
|
|
- **Transformers Library**: Hugging Face |
|
- **Dataset**: CoNLL-2003 |
|
- **Base Model**: `bert-base-uncased` by Google |