KyrgyzBert

Overview

KyrgyzBert is a small-scale BERT-based language model pre-trained on a large Kyrgyz text corpus. It is designed for masked language modeling (MLM), text classification, and Kyrgyz NLP applications. Developed by Metinov Adilet, this model aims to enhance Kyrgyz NLP research and practical applications.

Model Details

  • Architecture: BERT (small-scale variant)
  • Vocabulary Size: Custom Kyrgyz tokenizer
  • Hidden Size: 512
  • Number of Layers: 6
  • Attention Heads: 8
  • Intermediate Size: 2048
  • Max Sequence Length: 512
  • Pretraining Task: Masked Language Modeling (MLM)
  • Framework: Hugging Face Transformers

Training Data

This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a metinovadilet/bert-kyrgyz-tokenizer.

Training Setup

  • Hardware: Trained on an RTX 3090 GPU
  • Batch Size: 16
  • Optimizer: AdamW
  • Learning Rate: 1e-4
  • Weight Decay: 0.01
  • Training Epochs: 10 00

Intended Use

  • Text Completion & Prediction: Filling in missing words in Kyrgyz text.
  • Feature Extraction: Kyrgyz word embeddings for downstream NLP tasks.
  • Fine-Tuning: Can be fine-tuned for Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.

How to Use

You can load the model using Hugging Face's transformers library:

from transformers import BertTokenizerFast, BertForMaskedLM
import torch

# Load model and tokenizer
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Input text with [MASK] token
text = "Бул жерден [MASK] нерселерди таба аласыз."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Model prediction
with torch.no_grad():
    outputs = model(**inputs).logits

# Find masked token index
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()

# Get top 5 predictions for the masked token
probs = torch.softmax(outputs[0, masked_index], dim=-1)
top_k = torch.topk(probs, k=5)  # Get top 5 predictions

# Decode predicted tokens
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]

# Print predictions
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")

Limitations

  • The model may struggle with low-resource dialects and code-switching.
  • Performance depends on the quality and diversity of training data.
  • It is not fine-tuned for specific tasks like sentiment analysis or NER.

Acknowledgments

This model was developed by Metinov Adilet. If you use this model, please consider citing our work.

License

This model is released under the Apache 2.0 License.

Citation

If you use this model in your research, please cite:

@misc{metinovadilet2025kyrgyzbert,
  author = {Metinov Adilet},
  title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
  year = {2025},
  howpublished = {Hugging Face},
  url = {https://huggingface.co/metinovadilet/KyrgyzBert}
}

Contact

For questions, reach out to Metinov Adilet via Hugging Face or email - [email protected]

Ulutsoft collaboration

Downloads last month
122
Safetensors
Model size
35.9M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Space using metinovadilet/KyrgyzBert 1