KyrgyzBert / README.md
metinovadilet's picture
Update README.md
f696cf6 verified
---
license: apache-2.0
language:
- ky
pipeline_tag: fill-mask
library_name: transformers
---
# KyrgyzBert
## Overview
KyrgyzBert is a **small-scale BERT-based language model** pre-trained on a large **Kyrgyz text corpus**. It is designed for **masked language modeling (MLM), text classification, and Kyrgyz NLP applications**. Developed by **Metinov Adilet**, this model aims to enhance Kyrgyz NLP research and practical applications.
## Model Details
- **Architecture:** BERT (small-scale variant)
- **Vocabulary Size:** Custom Kyrgyz tokenizer
- **Hidden Size:** 512
- **Number of Layers:** 6
- **Attention Heads:** 8
- **Intermediate Size:** 2048
- **Max Sequence Length:** 512
- **Pretraining Task:** Masked Language Modeling (MLM)
- **Framework:** Hugging Face Transformers
## Training Data
This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a **metinovadilet/bert-kyrgyz-tokenizer**.
## Training Setup
- **Hardware:** Trained on an **RTX 3090 GPU**
- **Batch Size:** 16
- **Optimizer:** AdamW
- **Learning Rate:** 1e-4
- **Weight Decay:** 0.01
- **Training Epochs:** 10 00
## Intended Use
- **Text Completion & Prediction:** Filling in missing words in Kyrgyz text.
- **Feature Extraction:** Kyrgyz word embeddings for downstream NLP tasks.
- **Fine-Tuning:** Can be fine-tuned for **Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.**
## How to Use
You can load the model using Hugging Face's `transformers` library:
```python
from transformers import BertTokenizerFast, BertForMaskedLM
import torch
# Load model and tokenizer
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
# Input text with [MASK] token
text = "Бул жерден [MASK] нерселерди таба аласыз."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Model prediction
with torch.no_grad():
outputs = model(**inputs).logits
# Find masked token index
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()
# Get top 5 predictions for the masked token
probs = torch.softmax(outputs[0, masked_index], dim=-1)
top_k = torch.topk(probs, k=5) # Get top 5 predictions
# Decode predicted tokens
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]
# Print predictions
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")
```
## Limitations
- The model may struggle with **low-resource dialects** and **code-switching**.
- Performance depends on the quality and diversity of training data.
- It is not fine-tuned for **specific tasks** like sentiment analysis or NER.
## Acknowledgments
This model was developed by **Metinov Adilet**. If you use this model, please consider citing our work.
## License
This model is released under the **Apache 2.0 License**.
## Citation
If you use this model in your research, please cite:
```
@misc{metinovadilet2025kyrgyzbert,
author = {Metinov Adilet},
title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
year = {2025},
howpublished = {Hugging Face},
url = {https://huggingface.co/metinovadilet/KyrgyzBert}
}
```
## Contact
For questions, reach out to **Metinov Adilet** via Hugging Face or email - [email protected]
## Ulutsoft collaboration