metinovadilet
/

KyrgyzBert

Inference Endpoints

Model card Files Files and versions Community

KyrgyzBert / README.md

metinovadilet's picture

Update README.md

f696cf6 verified 23 days ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	language:
	- ky
	pipeline_tag: fill-mask
	library_name: transformers
	---
	# KyrgyzBert

	## Overview
	KyrgyzBert is a small-scale BERT-based language model pre-trained on a large Kyrgyz text corpus. It is designed for masked language modeling (MLM), text classification, and Kyrgyz NLP applications. Developed by Metinov Adilet, this model aims to enhance Kyrgyz NLP research and practical applications.

	## Model Details
	- Architecture: BERT (small-scale variant)
	- Vocabulary Size: Custom Kyrgyz tokenizer
	- Hidden Size: 512
	- Number of Layers: 6
	- Attention Heads: 8
	- Intermediate Size: 2048
	- Max Sequence Length: 512
	- Pretraining Task: Masked Language Modeling (MLM)
	- Framework: Hugging Face Transformers

	## Training Data
	This model was trained on non-disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a metinovadilet/bert-kyrgyz-tokenizer.

	## Training Setup
	- Hardware: Trained on an RTX 3090 GPU
	- Batch Size: 16
	- Optimizer: AdamW
	- Learning Rate: 1e-4
	- Weight Decay: 0.01
	- Training Epochs: 10 00

	## Intended Use
	- Text Completion & Prediction: Filling in missing words in Kyrgyz text.
	- Feature Extraction: Kyrgyz word embeddings for downstream NLP tasks.
	- Fine-Tuning: Can be fine-tuned for Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.

	## How to Use
	You can load the model using Hugging Face's `transformers` library:

	```python
	from transformers import BertTokenizerFast, BertForMaskedLM
	import torch

	# Load model and tokenizer
	model_name = "metinovadilet/KyrgyzBert"
	tokenizer = BertTokenizerFast.from_pretrained(model_name)
	model = BertForMaskedLM.from_pretrained(model_name)

	# Input text with [MASK] token
	text = "Бул жерден [MASK] нерселерди таба аласыз."

	# Tokenize input
	inputs = tokenizer(text, return_tensors="pt")

	# Model prediction
	with torch.no_grad():
	outputs = model(**inputs).logits

	# Find masked token index
	masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()

	# Get top 5 predictions for the masked token
	probs = torch.softmax(outputs[0, masked_index], dim=-1)
	top_k = torch.topk(probs, k=5) # Get top 5 predictions

	# Decode predicted tokens
	predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]

	# Print predictions
	print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")

	```

	## Limitations
	- The model may struggle with low-resource dialects and code-switching.
	- Performance depends on the quality and diversity of training data.
	- It is not fine-tuned for specific tasks like sentiment analysis or NER.

	## Acknowledgments
	This model was developed by Metinov Adilet. If you use this model, please consider citing our work.

	## License
	This model is released under the Apache 2.0 License.

	## Citation
	If you use this model in your research, please cite:
	```
	@misc{metinovadilet2025kyrgyzbert,
	author = {Metinov Adilet},
	title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
	year = {2025},
	howpublished = {Hugging Face},
	url = {https://huggingface.co/metinovadilet/KyrgyzBert}
	}
	```

	## Contact
	For questions, reach out to Metinov Adilet via Hugging Face or email - [email protected]

	## Ulutsoft collaboration