Spaces:
Running
Running
Enhance README with a new section detailing the research paper on the hybrid tokenization approach, including citation information and authors. Update requirements to specify the version of the Turkish Tokenizer package.
Browse files- README.md +24 -0
- requirements.txt +1 -1
README.md
CHANGED
|
@@ -55,6 +55,30 @@ This tokenizer uses:
|
|
| 55 |
- Gradio for the web interface
|
| 56 |
- Advanced tokenization algorithms
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## Files
|
| 59 |
|
| 60 |
- `app.py`: Main Gradio application
|
|
|
|
| 55 |
- Gradio for the web interface
|
| 56 |
- Advanced tokenization algorithms
|
| 57 |
|
| 58 |
+
## Research Paper
|
| 59 |
+
|
| 60 |
+
This implementation is based on the research paper:
|
| 61 |
+
|
| 62 |
+
**"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"**
|
| 63 |
+
|
| 64 |
+
📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)
|
| 65 |
+
|
| 66 |
+
**Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik
|
| 67 |
+
|
| 68 |
+
**Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
|
| 69 |
+
|
| 70 |
+
Please cite this paper if you use this tokenizer in your research:
|
| 71 |
+
|
| 72 |
+
```bibtex
|
| 73 |
+
@article{bayram2024tokens,
|
| 74 |
+
title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
|
| 75 |
+
author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
|
| 76 |
+
journal={arXiv preprint arXiv:2508.14292},
|
| 77 |
+
year={2025},
|
| 78 |
+
url={https://arxiv.org/abs/2508.14292}
|
| 79 |
+
}
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
## Files
|
| 83 |
|
| 84 |
- `app.py`: Main Gradio application
|
requirements.txt
CHANGED
|
@@ -1,2 +1,2 @@
|
|
| 1 |
gradio
|
| 2 |
-
turkish-tokenizer
|
|
|
|
| 1 |
gradio
|
| 2 |
+
turkish-tokenizer==0.2.24
|