File size: 5,770 Bytes
886d485 e32cb16 31a4f58 2789f81 886d485 0668382 e39bc5d 0668382 e39bc5d 0668382 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
library_name: transformers
base_model:
- panagoa/xlm-roberta-base-kbd
language:
- kbd
tags:
- Part-of-Speech
- XLM-RoBERTa
datasets:
- panagoa/kbd-pos-tags
pipeline_tag: token-classification
---
# XLM-RoBERTa for Kabardian Part-of-Speech Tagging
## Model description
This model is a fine-tuned version of [panagoa/xlm-roberta-base-kbd](https://huggingface.co/panagoa/xlm-roberta-base-kbd) on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset. It is designed to perform Part-of-Speech (POS) tagging for text in the Kabardian language (kbd).
The model identifies 17 different POS tags:
| Tag | Description | Examples |
|-----|-------------|----------|
| ADJ | Adjective | хужь (white), къабзэ (clean) |
| ADP | Adposition | щхьэкIэ (for), папщIэ (because of) |
| ADV | Adverb | псынщIэу (quickly), жыжьэу (far) |
| AUX | Auxiliary | хъунщ (will be), щытащ (was) |
| CCONJ | Coordinating conjunction | икIи (and), ауэ (but) |
| DET | Determiner | мо (that), мыпхуэдэ (this kind) |
| INTJ | Interjection | уэлэхьи (by God), зиунагъуэрэ (oh my) |
| NOUN | Noun | унэ (house), щIалэ (boy) |
| NUM | Numeral | зы (one), тIу (two) |
| PART | Particle | мы (this), а (that) |
| PRON | Pronoun | сэ (I), уэ (you) |
| PROPN | Proper noun | Мурат (Murat), Налшык (Nalchik) |
| PUNCT | Punctuation | . (period), , (comma) |
| SCONJ | Subordinating conjunction | щхьэкIэ (because), щыгъуэ (when) |
| SYM | Symbol | % (percent), $ (dollar) |
| VERB | Verb | мэкIуэ (goes), матхэ (writes) |
| X | Other | - |
## Intended Use
This model is intended for:
- Linguistic analysis of Kabardian text
- Natural language processing pipelines for Kabardian
- Research on low-resource languages
- Educational purposes for teaching Kabardian grammar
## Training Data
The model was trained on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset, which contains 82,925 tagged sentences in Kabardian. The dataset shows the following tag distribution:
- VERB: 116,377 (30.0%)
- NOUN: 115,232 (29.7%)
- PRON: 63,827 (16.5%)
- ADV: 35,036 (9.0%)
- ADJ: 20,817 (5.4%)
- PROPN: 18,692 (4.8%)
- DET: 6,830 (1.8%)
- CCONJ: 6,098 (1.6%)
- ADP: 4,793 (1.2%)
- PUNCT: 4,752 (1.2%)
- NUM: 4,741 (1.2%)
- INTJ: 2,787 (0.7%)
- PART: 2,241 (0.6%)
- SCONJ: 1,206 (0.3%)
- AUX: 560 (0.1%)
- X: 273 (0.1%)
- SYM: 7 (<0.1%)
## Training Procedure
The model was trained with the following configuration:
- Base model: panagoa/xlm-roberta-base-kbd
- Learning rate: 2e-5
- Batch size: 32
- Epochs: 3
- Weight decay: 0.01
- Class weights: Applied to handle class imbalance
- Maximum sequence length: 128
Class weights were calculated inversely proportional to the class frequencies to address the imbalance in the dataset, with rare tags given higher importance during training.
## Evaluation Results
The model achieved the following performance on a validation set (20% of the data):
- Overall accuracy: ~85%
- Performance varies across different POS tags, with better results on common tags like NOUN and VERB.
## Limitations
- The model may struggle with rare POS tags (like SYM) due to limited examples in the training data
- Performance may vary with dialectal variations or non-standard Kabardian text
- The model has a context window limitation of 128 tokens
- Some ambiguous words might be incorrectly tagged based on context
## Usage Example
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
# Define function for prediction
def predict_pos_tags(text, model, tokenizer):
# Split text into words if it's a string
if isinstance(text, str):
text = text.split()
# Determine device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
# Tokenize input text
encoded_input = tokenizer(
text,
truncation=True,
is_split_into_words=True,
return_tensors="pt"
)
# Move inputs to the same device
inputs = {k: v.to(device) for k, v in encoded_input.items()}
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Map to POS tags
word_ids = encoded_input.word_ids()
previous_word_idx = None
predicted_tags = []
for idx, word_idx in enumerate(word_ids):
if word_idx != previous_word_idx:
predicted_tags.append(model.config.id2label[predictions[0][idx].item()])
previous_word_idx = word_idx
return predicted_tags[:len(text)]
# Example usage
text = "Хъыджэбзыр щIэкIри фошыгъу къыхуихьащ"
words = text.split()
tags = predict_pos_tags(words, model, tokenizer)
# Print results
for word, tag in zip(words, tags):
print(f"{word}: {tag}")
Хъыджэбзыр: NOUN
щIэкIри: VERB
фошыгъу: NOUN
къыхуихьащ: VERB
```
## Author
This model was trained by panagoa and contributed to the Hugging Face community to support NLP research and applications for the Kabardian language.
## Citation
If you use this model in your research, please cite:
```
@misc{panagoa2025kabardianpos,
author = {Panagoa},
title = {XLM-RoBERTa for Kabardian Part-of-Speech Tagging},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/panagoa/xlm-roberta-base-kbd-pos-tagger}}
}
``` |