KyrgyzBert
Overview
KyrgyzBert is a small-scale BERT-based language model pre-trained on a large Kyrgyz text corpus. It is designed for masked language modeling (MLM), text classification, and Kyrgyz NLP applications. Developed by Metinov Adilet, this model aims to enhance Kyrgyz NLP research and practical applications.
Model Details
- Architecture: BERT (small-scale variant)
- Vocabulary Size: Custom Kyrgyz tokenizer
- Hidden Size: 512
- Number of Layers: 6
- Attention Heads: 8
- Intermediate Size: 2048
- Max Sequence Length: 512
- Pretraining Task: Masked Language Modeling (MLM)
- Framework: Hugging Face Transformers
Training Data
This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a metinovadilet/bert-kyrgyz-tokenizer.
Training Setup
- Hardware: Trained on an RTX 3090 GPU
- Batch Size: 16
- Optimizer: AdamW
- Learning Rate: 1e-4
- Weight Decay: 0.01
- Training Epochs: 10 00
Intended Use
- Text Completion & Prediction: Filling in missing words in Kyrgyz text.
- Feature Extraction: Kyrgyz word embeddings for downstream NLP tasks.
- Fine-Tuning: Can be fine-tuned for Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.
How to Use
You can load the model using Hugging Face's transformers
library:
from transformers import BertTokenizerFast, BertForMaskedLM
import torch
# Load model and tokenizer
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
# Input text with [MASK] token
text = "Бул жерден [MASK] нерселерди таба аласыз."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Model prediction
with torch.no_grad():
outputs = model(**inputs).logits
# Find masked token index
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()
# Get top 5 predictions for the masked token
probs = torch.softmax(outputs[0, masked_index], dim=-1)
top_k = torch.topk(probs, k=5) # Get top 5 predictions
# Decode predicted tokens
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]
# Print predictions
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")
Limitations
- The model may struggle with low-resource dialects and code-switching.
- Performance depends on the quality and diversity of training data.
- It is not fine-tuned for specific tasks like sentiment analysis or NER.
Acknowledgments
This model was developed by Metinov Adilet. If you use this model, please consider citing our work.
License
This model is released under the Apache 2.0 License.
Citation
If you use this model in your research, please cite:
@misc{metinovadilet2025kyrgyzbert,
author = {Metinov Adilet},
title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
year = {2025},
howpublished = {Hugging Face},
url = {https://huggingface.co/metinovadilet/KyrgyzBert}
}
Contact
For questions, reach out to Metinov Adilet via Hugging Face or email - [email protected]
Ulutsoft collaboration
- Downloads last month
- 122