--- library_name: transformers base_model: - panagoa/xlm-roberta-base-kbd language: - kbd tags: - Part-of-Speech - XLM-RoBERTa datasets: - panagoa/kbd-pos-tags pipeline_tag: token-classification --- # XLM-RoBERTa for Kabardian Part-of-Speech Tagging ## Model description This model is a fine-tuned version of [panagoa/xlm-roberta-base-kbd](https://huggingface.co/panagoa/xlm-roberta-base-kbd) on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset. It is designed to perform Part-of-Speech (POS) tagging for text in the Kabardian language (kbd). The model identifies 17 different POS tags: | Tag | Description | Examples | |-----|-------------|----------| | ADJ | Adjective | хужь (white), къабзэ (clean) | | ADP | Adposition | щхьэкIэ (for), папщIэ (because of) | | ADV | Adverb | псынщIэу (quickly), жыжьэу (far) | | AUX | Auxiliary | хъунщ (will be), щытащ (was) | | CCONJ | Coordinating conjunction | икIи (and), ауэ (but) | | DET | Determiner | мо (that), мыпхуэдэ (this kind) | | INTJ | Interjection | уэлэхьи (by God), зиунагъуэрэ (oh my) | | NOUN | Noun | унэ (house), щIалэ (boy) | | NUM | Numeral | зы (one), тIу (two) | | PART | Particle | мы (this), а (that) | | PRON | Pronoun | сэ (I), уэ (you) | | PROPN | Proper noun | Мурат (Murat), Налшык (Nalchik) | | PUNCT | Punctuation | . (period), , (comma) | | SCONJ | Subordinating conjunction | щхьэкIэ (because), щыгъуэ (when) | | SYM | Symbol | % (percent), $ (dollar) | | VERB | Verb | мэкIуэ (goes), матхэ (writes) | | X | Other | - | ## Intended Use This model is intended for: - Linguistic analysis of Kabardian text - Natural language processing pipelines for Kabardian - Research on low-resource languages - Educational purposes for teaching Kabardian grammar ## Training Data The model was trained on the [panagoa/kbd-pos-tags](https://huggingface.co/datasets/panagoa/kbd-pos-tags) dataset, which contains 82,925 tagged sentences in Kabardian. The dataset shows the following tag distribution: - VERB: 116,377 (30.0%) - NOUN: 115,232 (29.7%) - PRON: 63,827 (16.5%) - ADV: 35,036 (9.0%) - ADJ: 20,817 (5.4%) - PROPN: 18,692 (4.8%) - DET: 6,830 (1.8%) - CCONJ: 6,098 (1.6%) - ADP: 4,793 (1.2%) - PUNCT: 4,752 (1.2%) - NUM: 4,741 (1.2%) - INTJ: 2,787 (0.7%) - PART: 2,241 (0.6%) - SCONJ: 1,206 (0.3%) - AUX: 560 (0.1%) - X: 273 (0.1%) - SYM: 7 (<0.1%) ## Training Procedure The model was trained with the following configuration: - Base model: panagoa/xlm-roberta-base-kbd - Learning rate: 2e-5 - Batch size: 32 - Epochs: 3 - Weight decay: 0.01 - Class weights: Applied to handle class imbalance - Maximum sequence length: 128 Class weights were calculated inversely proportional to the class frequencies to address the imbalance in the dataset, with rare tags given higher importance during training. ## Evaluation Results The model achieved the following performance on a validation set (20% of the data): - Overall accuracy: ~85% - Performance varies across different POS tags, with better results on common tags like NOUN and VERB. ## Limitations - The model may struggle with rare POS tags (like SYM) due to limited examples in the training data - Performance may vary with dialectal variations or non-standard Kabardian text - The model has a context window limitation of 128 tokens - Some ambiguous words might be incorrectly tagged based on context ## Usage Example ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger") model = AutoModelForTokenClassification.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger") # Define function for prediction def predict_pos_tags(text, model, tokenizer): # Split text into words if it's a string if isinstance(text, str): text = text.split() # Determine device device = 'cuda' if torch.cuda.is_available() else 'cpu' model = model.to(device) # Tokenize input text encoded_input = tokenizer( text, truncation=True, is_split_into_words=True, return_tensors="pt" ) # Move inputs to the same device inputs = {k: v.to(device) for k, v in encoded_input.items()} # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Map to POS tags word_ids = encoded_input.word_ids() previous_word_idx = None predicted_tags = [] for idx, word_idx in enumerate(word_ids): if word_idx != previous_word_idx: predicted_tags.append(model.config.id2label[predictions[0][idx].item()]) previous_word_idx = word_idx return predicted_tags[:len(text)] # Example usage text = "Хъыджэбзыр щIэкIри фошыгъу къыхуихьащ" words = text.split() tags = predict_pos_tags(words, model, tokenizer) # Print results for word, tag in zip(words, tags): print(f"{word}: {tag}") Хъыджэбзыр: NOUN щIэкIри: VERB фошыгъу: NOUN къыхуихьащ: VERB ``` ## Author This model was trained by panagoa and contributed to the Hugging Face community to support NLP research and applications for the Kabardian language. ## Citation If you use this model in your research, please cite: ``` @misc{panagoa2025kabardianpos, author = {Panagoa}, title = {XLM-RoBERTa for Kabardian Part-of-Speech Tagging}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/panagoa/xlm-roberta-base-kbd-pos-tagger}} } ```