File size: 5,621 Bytes
404506e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a7b2e7
404506e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c8e17b
 
 
 
 
 
404506e
 
4c8e17b
 
 
404506e
 
 
 
 
 
 
4c8e17b
 
 
404506e
8a7b2e7
 
 
 
 
7aa272b
 
 
 
 
 
8a7b2e7
 
 
404506e
 
 
 
88e5dc6
404506e
54b47cb
404506e
 
 
 
 
 
88e5dc6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: apache-2.0
language:
- ne
base_model: NepBERTa/NepBERTa
tags:
- token-classification
- ner
- nepali
datasets:
- custom
metrics:
- f1
- precision
- recall
---

# Model Card for Finetuned NepBertA-NER

This model is a fine-tuned version of the **NepBERTa** model, specifically trained for Named Entity Recognition (NER) tasks in the Nepali language. It recognizes entities such as persons (PER), organizations (ORG), and locations (LOC) in Nepali text. The model has been trained on a custom dataset and supports token classification for the following entity tags:

- `O` (Other)
- `B-PER` (Beginning of a person’s name)
- `I-PER` (Inside of a person’s name)
- `B-ORG` (Beginning of an organization)
- `I-ORG` (Inside of an organization)
- `B-LOC` (Beginning of a location)
- `I-LOC` (Inside of a location)

## Model Details

### Model Description

- **Developed by:** Priyanshu Koirala (Synapse Technologies)
- **Model type:** Token Classification (NER)
- **Language(s) (NLP):** Nepali
- **License:** Apache 2.0
- **Finetuned from model:** NepBERTa


## Uses

### Direct Use
The model can be directly used to recognize and classify named entities in Nepali text, such as persons, organizations, and locations. This is useful for text analysis tasks like extracting important information from Nepali documents, news articles, and customer feedback.

### Downstream Use
The model can be further fine-tuned on other similar datasets or integrated into applications for Nepali language processing.

### Out-of-Scope Use
The model may not perform well for texts outside the scope of its training data, such as texts with unseen entity types or non-Nepali language texts.

## Bias, Risks, and Limitations

As with any NER model, there may be biases in the data that influence how the model identifies and classifies entities. It may struggle with unseen entities, domain-specific jargon, or ambiguous contexts.

### Recommendations
Users should evaluate the model in their specific use case, ensuring that the data fed into the model aligns with the training data, and understand that the model might require further fine-tuning for specialized tasks.

## How to Get Started with the Model

Use the following code to start using the model:

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
tokenizer = AutoTokenizer.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
model.to(device)

def predict_ner_chunked(text, model, tokenizer, device, max_length=512):
    model.eval()
    words = text.split()
    ner_results = []
    
    for i in range(0, len(words), max_length):
        chunk = ' '.join(words[i:i+max_length])
        tokens = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
        tokens = {k: v.to(device) for k, v in tokens.items()}
        
        with torch.no_grad():
            outputs = model(**tokens)
        
        predictions = torch.argmax(outputs.logits, dim=2)
        predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
        
        chunk_words = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
        for word, label in zip(chunk_words, predicted_labels):
            if label in ["B-PER", "I-PER", "B-ORG"] and word not in ["[CLS]", "[SEP]", "[PAD]"]:
                ner_results.append((word, label))
    
    return ner_results

# Test the model
text = "सङ्घीय लोकतान्त्रिक गणतन्त्र नेपालको प्रधानमन्त्री शेरबहादुर देउवा हुन्।"
ner_results = predict_ner_chunked(text, model, tokenizer, device)
print(ner_results)
```

## Training Details
# Training Data
The model was trained on a custom-labeled dataset in Nepali, consisting of sentences annotated with named entities for People (PER), Organizations (ORG), and Locations (LOC).

# Training Procedure
- **Optimizer:** AdamW
- **Learning Rate:** 5e-5
- **Batch Size:** 16
- **Epochs:** 5
- **Validation Split:** 20% of the dataset was reserved for validation.
- **Hardware:** Trained on a single GPU.

# Training Hyperparameters
- **Number of labels:** 7 (including O label)
- **Maximum sequence length:** 128 tokens
- **Gradient accumulation:** 1

## Evaluation

# Metrics

The model was evaluated using the seqeval metric, with the following results on the validation set:

- **F1 Score:** 0.89
- **Precision:** 0.86
- **Recall:** 0.90

## Citation for the Base Model

If you use this model or the base model in your work, please consider citing **NepBERTa** as follows:

```bibtex
@inproceedings{timilsina2022nepberta,
  title={NepBERTa: Nepali language model trained in a large corpus},
  author={Timilsina, Sulav and Gautam, Milan and Bhattarai, Binod},
  booktitle={Proceedings of the 2nd conference of the Asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing},
  year={2022},
  organization={Association for Computational Linguistics (ACL)}
}
```

## Citation

If you use this model in your research, please consider citing it:

``` bibtex
@misc{nepali_ner,
  author = {Synapse Technologies},
  title = {Finetuned NepBertA-NER for Nepali},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SynapseHQ/Finetuned-NER-NepBertA}},
}

```