--- title: Turkish Tokenizer colorFrom: blue colorTo: blue sdk: gradio sdk_version: 5.42.0 app_file: app.py pinned: false license: cc-by-nc-nd-4.0 short_description: Turkish Morphological Tokenizer --- # Turkish Tokenizer A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction. ## Features - **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens - **Visual Tokenization**: Color-coded token display with interactive highlighting - **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution - **Real-time Processing**: Instant tokenization with live statistics - **Example Texts**: Pre-loaded Turkish examples for testing ## How to Use 1. Enter Turkish text in the input field 2. Click "🚀 Tokenize" to process the text 3. View the color-coded tokens in the visualization 4. Check the statistics for detailed analysis 5. See the encoded token IDs and decoded text ## Token Types - **🔴 Roots (ROOT)**: Base word forms - **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes - **🟡 BPE**: Byte Pair Encoding tokens for subword units ## Examples Try these example texts: - "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir." - "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum." - "KitapOkumak çok güzeldir ve bilgi verir." - "Türkiye Cumhuriyeti'nin başkenti Ankara'dır." - "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor." ## Technical Details This tokenizer uses: - Custom morphological analysis for Turkish - JSON-based vocabulary files - Gradio for the web interface - Advanced tokenization algorithms ## Research Paper This implementation is based on the research paper: **"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"** 📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292) **Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik **Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. Please cite this paper if you use this tokenizer in your research: ```bibtex @article{bayram2024tokens, title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP}, author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan}, journal={arXiv preprint arXiv:2508.14292}, year={2025}, url={https://arxiv.org/abs/2508.14292} } ``` ## Files - `app.py`: Main Gradio application - `requirements.txt`: Python dependencies ## Local Development To run locally: ```bash pip install -r requirements.txt python app.py ``` The app will be available at `http://localhost:7860` ## Dependencies - `gradio`: Web interface framework - `turkish-tokenizer`: Core tokenization library ## License This project is open source and available under the MIT License.