---
title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---

# Turkish Tokenizer

A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.

## Features

- **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens
- **Visual Tokenization**: Color-coded token display with interactive highlighting
- **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution
- **Real-time Processing**: Instant tokenization with live statistics
- **Example Texts**: Pre-loaded Turkish examples for testing

## How to Use

1. Enter Turkish text in the input field
2. Click "🚀 Tokenize" to process the text
3. View the color-coded tokens in the visualization
4. Check the statistics for detailed analysis
5. See the encoded token IDs and decoded text

## Token Types

- **🔴 Roots (ROOT)**: Base word forms
- **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes
- **🟡 BPE**: Byte Pair Encoding tokens for subword units

## Examples

Try these example texts:

- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
- "KitapOkumak çok güzeldir ve bilgi verir."
- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."

## Technical Details

This tokenizer uses:

- Custom morphological analysis for Turkish
- JSON-based vocabulary files
- Gradio for the web interface
- Advanced tokenization algorithms

## Research Paper

This implementation is based on the research paper:

**"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"**

📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)

**Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

**Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.

Please cite this paper if you use this tokenizer in your research:

```bibtex
@article{bayram2024tokens,
  title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
  author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
  journal={arXiv preprint arXiv:2508.14292},
  year={2025},
  url={https://arxiv.org/abs/2508.14292}
}
```

## Files

- `app.py`: Main Gradio application
- `requirements.txt`: Python dependencies

## Local Development

To run locally:

```bash
pip install -r requirements.txt
python app.py
```

The app will be available at `http://localhost:7860`

## Dependencies

- `gradio`: Web interface framework
- `turkish-tokenizer`: Core tokenization library

## License

This project is open source and available under the MIT License.