KyrgyzBert

Overview

KyrgyzBert is a small-scale BERT-based language model pre-trained on a large Kyrgyz text corpus. It is designed for masked language modeling (MLM), text classification, and Kyrgyz NLP applications. Developed by Metinov Adilet, this model aims to enhance Kyrgyz NLP research and practical applications.

Model Details

Architecture: BERT (small-scale variant)
Vocabulary Size: Custom Kyrgyz tokenizer
Hidden Size: 512
Number of Layers: 6
Attention Heads: 8
Intermediate Size: 2048
Max Sequence Length: 512
Pretraining Task: Masked Language Modeling (MLM)
Framework: Hugging Face Transformers

Training Data

This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a metinovadilet/bert-kyrgyz-tokenizer.

Training Setup

Hardware: Trained on an RTX 3090 GPU
Batch Size: 16
Optimizer: AdamW
Learning Rate: 1e-4
Weight Decay: 0.01
Training Epochs: 10 00

Intended Use

Text Completion & Prediction: Filling in missing words in Kyrgyz text.
Feature Extraction: Kyrgyz word embeddings for downstream NLP tasks.
Fine-Tuning: Can be fine-tuned for Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.

How to Use

You can load the model using Hugging Face's transformers library:

from transformers import BertTokenizerFast, BertForMaskedLM
import torch

# Load model and tokenizer
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Input text with [MASK] token
text = "Бул жерден [MASK] нерселерди таба аласыз."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Model prediction
with torch.no_grad():
    outputs = model(**inputs).logits

# Find masked token index
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()

# Get top 5 predictions for the masked token
probs = torch.softmax(outputs[0, masked_index], dim=-1)
top_k = torch.topk(probs, k=5)  # Get top 5 predictions

# Decode predicted tokens
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]

# Print predictions
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")

Limitations

The model may struggle with low-resource dialects and code-switching.
Performance depends on the quality and diversity of training data.
It is not fine-tuned for specific tasks like sentiment analysis or NER.

Acknowledgments

This model was developed by Metinov Adilet. If you use this model, please consider citing our work.

License

This model is released under the Apache 2.0 License.

Citation

If you use this model in your research, please cite:

@misc{metinovadilet2025kyrgyzbert,
  author = {Metinov Adilet},
  title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
  year = {2025},
  howpublished = {Hugging Face},
  url = {https://huggingface.co/metinovadilet/KyrgyzBert}
}

Contact

For questions, reach out to Metinov Adilet via Hugging Face or email - [email protected]

metinovadilet
/

KyrgyzBert