sfarrukhm
/

bert-conll-ner

Token Classification

Model card Files Files and versions Metrics Training metrics Community

bert-conll-ner / README.md

Muhammad Farrukh Mehmood

Update README.md

c42a082 verified 2 months ago

|

history blame contribute delete

3.45 kB

	---
	license: mit
	datasets:
	- eriktks/conll2003
	language:
	- en
	base_model:
	- google-bert/bert-base-uncased
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- ner
	---
	# Model Card: BERT for Named Entity Recognition (NER)

	## Model Overview

	This model, bert-conll-ner, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset. It is designed to identify and classify entities in text, such as person names (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC) entities.

	### Model Architecture
	- Base Model: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture.
	- Task: Token Classification (NER).

	## Training Dataset

	- Dataset: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
	- Classes:
	- `PER` (Person)
	- `ORG` (Organization)
	- `LOC` (Location)
	- `MISC` (Miscellaneous)
	- `O` (Outside of any entity span)

	## Performance Metrics

	The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:

	\| Metric \| Value \|
	\|-------------\|------------\|
	\| Loss \| 0.0649 \|
	\| Precision \| 93.59% \|
	\| Recall \| 95.07% \|
	\| F1 Score \| 94.32% \|
	\| Accuracy \| 98.79% \|

	These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.

	## Training Details

	- Optimizer: AdamW (Adam with weight decay)
	- Learning Rate: 2e-5
	- Batch Size: 8
	- Number of Epochs: 3
	- Scheduler: Linear scheduler with warm-up steps
	- Loss Function: Cross-entropy loss with ignored index (`-100`) for padding tokens

	## Model Input/Output

	- Input Format: Tokenized text with special tokens `[CLS]` and `[SEP]`.
	- Output Format: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.).



	## How to Use the Model

	### Installation
	```bash
	pip install transformers
	```

	### Loading the Model
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
	model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")
	```

	### Running Inference
	```python
	from transformers import pipeline

	nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
	text = "John lives in New York City."
	result = nlp(text)
	print(result)
	```

	```json
	[{'entity_group': 'PER',
	'score': 0.99912304,
	'word': 'john',
	'start': 0,
	'end': 4},
	{'entity_group': 'LOC',
	'score': 0.9993351,
	'word': 'new york city',
	'start': 14,
	'end': 27}]
	```

	## Limitations

	1. Domain-Specific Adaptability: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
	2. Ambiguity: Ambiguous entities or overlapping spans are not explicitly handled.
	## Recommendations

	- For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
	- Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.

	## Acknowledgements

	- Transformers Library: Hugging Face
	- Dataset: CoNLL-2003
	- Base Model: `bert-base-uncased` by Google