File size: 5,770 Bytes
886d485
 
e32cb16
 
31a4f58
 
2789f81
 
 
 
 
 
886d485
 
 
0668382
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e39bc5d
0668382
 
 
 
 
 
e39bc5d
 
 
 
 
0668382
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
library_name: transformers
base_model:
- panagoa/xlm-roberta-base-kbd
language:
- kbd
tags:
- Part-of-Speech
- XLM-RoBERTa
datasets:
- panagoa/kbd-pos-tags
pipeline_tag: token-classification
---


# XLM-RoBERTa for Kabardian Part-of-Speech Tagging

## Model description

This model is a fine-tuned version of [panagoa/xlm-roberta-base-kbd](https://huggingface.co/panagoa/xlm-roberta-base-kbd) on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset. It is designed to perform Part-of-Speech (POS) tagging for text in the Kabardian language (kbd).

The model identifies 17 different POS tags:

| Tag | Description | Examples |
|-----|-------------|----------|
| ADJ | Adjective | хужь (white), къабзэ (clean) |
| ADP | Adposition | щхьэкIэ (for), папщIэ (because of) |
| ADV | Adverb | псынщIэу (quickly), жыжьэу (far) |
| AUX | Auxiliary | хъунщ (will be), щытащ (was) |
| CCONJ | Coordinating conjunction | икIи (and), ауэ (but) |
| DET | Determiner | мо (that), мыпхуэдэ (this kind) |
| INTJ | Interjection | уэлэхьи (by God), зиунагъуэрэ (oh my) |
| NOUN | Noun | унэ (house), щIалэ (boy) |
| NUM | Numeral | зы (one), тIу (two) |
| PART | Particle | мы (this), а (that) |
| PRON | Pronoun | сэ (I), уэ (you) |
| PROPN | Proper noun | Мурат (Murat), Налшык (Nalchik) |
| PUNCT | Punctuation | . (period), , (comma) |
| SCONJ | Subordinating conjunction | щхьэкIэ (because), щыгъуэ (when) |
| SYM | Symbol | % (percent), $ (dollar) |
| VERB | Verb | мэкIуэ (goes), матхэ (writes) |
| X | Other | - |


## Intended Use

This model is intended for:
- Linguistic analysis of Kabardian text
- Natural language processing pipelines for Kabardian
- Research on low-resource languages
- Educational purposes for teaching Kabardian grammar

## Training Data

The model was trained on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset, which contains 82,925 tagged sentences in Kabardian. The dataset shows the following tag distribution:

- VERB: 116,377 (30.0%)
- NOUN: 115,232 (29.7%)
- PRON: 63,827 (16.5%)
- ADV: 35,036 (9.0%)
- ADJ: 20,817 (5.4%)
- PROPN: 18,692 (4.8%)
- DET: 6,830 (1.8%)
- CCONJ: 6,098 (1.6%)
- ADP: 4,793 (1.2%)
- PUNCT: 4,752 (1.2%)
- NUM: 4,741 (1.2%)
- INTJ: 2,787 (0.7%)
- PART: 2,241 (0.6%)
- SCONJ: 1,206 (0.3%)
- AUX: 560 (0.1%)
- X: 273 (0.1%)
- SYM: 7 (<0.1%)

## Training Procedure

The model was trained with the following configuration:
- Base model: panagoa/xlm-roberta-base-kbd
- Learning rate: 2e-5
- Batch size: 32
- Epochs: 3
- Weight decay: 0.01
- Class weights: Applied to handle class imbalance
- Maximum sequence length: 128

Class weights were calculated inversely proportional to the class frequencies to address the imbalance in the dataset, with rare tags given higher importance during training.

## Evaluation Results

The model achieved the following performance on a validation set (20% of the data):
- Overall accuracy: ~85%
- Performance varies across different POS tags, with better results on common tags like NOUN and VERB.

## Limitations

- The model may struggle with rare POS tags (like SYM) due to limited examples in the training data
- Performance may vary with dialectal variations or non-standard Kabardian text
- The model has a context window limitation of 128 tokens
- Some ambiguous words might be incorrectly tagged based on context

## Usage Example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")

# Define function for prediction
def predict_pos_tags(text, model, tokenizer):
    # Split text into words if it's a string
    if isinstance(text, str):
        text = text.split()
        
    # Determine device
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = model.to(device)
    
    # Tokenize input text
    encoded_input = tokenizer(
        text,
        truncation=True,
        is_split_into_words=True,
        return_tensors="pt"
    )
    
    # Move inputs to the same device
    inputs = {k: v.to(device) for k, v in encoded_input.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2)
    
    # Map to POS tags
    word_ids = encoded_input.word_ids()
    previous_word_idx = None
    predicted_tags = []
    
    for idx, word_idx in enumerate(word_ids):
        if word_idx != previous_word_idx:
            predicted_tags.append(model.config.id2label[predictions[0][idx].item()])
        previous_word_idx = word_idx
    
    return predicted_tags[:len(text)]

# Example usage
text = "Хъыджэбзыр щIэкIри фошыгъу къыхуихьащ"
words = text.split()
tags = predict_pos_tags(words, model, tokenizer)

# Print results
for word, tag in zip(words, tags):
    print(f"{word}: {tag}")

Хъыджэбзыр: NOUN
щIэкIри: VERB
фошыгъу: NOUN
къыхуихьащ: VERB
```

## Author

This model was trained by panagoa and contributed to the Hugging Face community to support NLP research and applications for the Kabardian language.

## Citation

If you use this model in your research, please cite:

```
@misc{panagoa2025kabardianpos,
  author = {Panagoa},
  title = {XLM-RoBERTa for Kabardian Part-of-Speech Tagging},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/panagoa/xlm-roberta-base-kbd-pos-tagger}}
}
```