|
--- |
|
license: apache-2.0 |
|
language: |
|
- ky |
|
pipeline_tag: fill-mask |
|
library_name: transformers |
|
--- |
|
# KyrgyzBert |
|
|
|
## Overview |
|
KyrgyzBert is a **small-scale BERT-based language model** pre-trained on a large **Kyrgyz text corpus**. It is designed for **masked language modeling (MLM), text classification, and Kyrgyz NLP applications**. Developed by **Metinov Adilet**, this model aims to enhance Kyrgyz NLP research and practical applications. |
|
|
|
## Model Details |
|
- **Architecture:** BERT (small-scale variant) |
|
- **Vocabulary Size:** Custom Kyrgyz tokenizer |
|
- **Hidden Size:** 512 |
|
- **Number of Layers:** 6 |
|
- **Attention Heads:** 8 |
|
- **Intermediate Size:** 2048 |
|
- **Max Sequence Length:** 512 |
|
- **Pretraining Task:** Masked Language Modeling (MLM) |
|
- **Framework:** Hugging Face Transformers |
|
|
|
## Training Data |
|
This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a **metinovadilet/bert-kyrgyz-tokenizer**. |
|
|
|
## Training Setup |
|
- **Hardware:** Trained on an **RTX 3090 GPU** |
|
- **Batch Size:** 16 |
|
- **Optimizer:** AdamW |
|
- **Learning Rate:** 1e-4 |
|
- **Weight Decay:** 0.01 |
|
- **Training Epochs:** 10 00 |
|
|
|
## Intended Use |
|
- **Text Completion & Prediction:** Filling in missing words in Kyrgyz text. |
|
- **Feature Extraction:** Kyrgyz word embeddings for downstream NLP tasks. |
|
- **Fine-Tuning:** Can be fine-tuned for **Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.** |
|
|
|
## How to Use |
|
You can load the model using Hugging Face's `transformers` library: |
|
|
|
```python |
|
from transformers import BertTokenizerFast, BertForMaskedLM |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "metinovadilet/KyrgyzBert" |
|
tokenizer = BertTokenizerFast.from_pretrained(model_name) |
|
model = BertForMaskedLM.from_pretrained(model_name) |
|
|
|
# Input text with [MASK] token |
|
text = "Бул жерден [MASK] нерселерди таба аласыз." |
|
|
|
# Tokenize input |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Model prediction |
|
with torch.no_grad(): |
|
outputs = model(**inputs).logits |
|
|
|
# Find masked token index |
|
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item() |
|
|
|
# Get top 5 predictions for the masked token |
|
probs = torch.softmax(outputs[0, masked_index], dim=-1) |
|
top_k = torch.topk(probs, k=5) # Get top 5 predictions |
|
|
|
# Decode predicted tokens |
|
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()] |
|
|
|
# Print predictions |
|
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}") |
|
|
|
``` |
|
|
|
## Limitations |
|
- The model may struggle with **low-resource dialects** and **code-switching**. |
|
- Performance depends on the quality and diversity of training data. |
|
- It is not fine-tuned for **specific tasks** like sentiment analysis or NER. |
|
|
|
## Acknowledgments |
|
This model was developed by **Metinov Adilet**. If you use this model, please consider citing our work. |
|
|
|
## License |
|
This model is released under the **Apache 2.0 License**. |
|
|
|
## Citation |
|
If you use this model in your research, please cite: |
|
``` |
|
@misc{metinovadilet2025kyrgyzbert, |
|
author = {Metinov Adilet}, |
|
title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language}, |
|
year = {2025}, |
|
howpublished = {Hugging Face}, |
|
url = {https://huggingface.co/metinovadilet/KyrgyzBert} |
|
} |
|
``` |
|
|
|
## Contact |
|
For questions, reach out to **Metinov Adilet** via Hugging Face or email - [email protected] |
|
|
|
## Ulutsoft collaboration |