--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_17_0 language: - uz base_model: - FacebookAI/xlm-roberta-base --- # Tokenizer for Uzbek Language ## Introduction Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences ## Features - Matnlarni tokenlarga ajratadi. - Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi. ## Installation Python va kerakli kutubxonalar: ``` pip install transformers datasets ``` ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer") text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda" tokens = tokenizer.tokenize(text) print(tokens) ``` ## Dataset Description Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi. ## Contact [Jamshid Ahmadov](https://www.linkedin.com/in/jamshid-ds)