BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization Paper • 2505.24689 • Published May 30 • 1
Evaluating Morphological Alignment of Tokenizers in 70 Languages Paper • 2507.06378 • Published Jul 8
Tokenizer Study Collection Models comparing the effects of tokenizer properties on pre-training compression, and its relationship with downstream performance. • 84 items • Updated 5 days ago • 3