--- language: te tags: - telugu - tokenizer - bpe license: mit --- # Telugu Tokenizer A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios. ## Key Features ### Tokenizer Statistics - **Vocabulary Size**: 50000 tokens (✓ Exceeds requirement of 5000+) - **Compression Ratio**: 6.77 (✓ Meets requirement of ≥3.0) - **Average Token Length**: 6.26 characters - **Training Data**: 2,500+ Telugu articles - **Minimum Text Length**: 500 characters per article ### Model Configuration - **Architecture**: Unigram Language Model - **Max Piece Length**: 128 - **Sub-iterations**: 20 - **Initial Vocabulary**: 50000 tokens - **Auto-scaling**: Up to 500,000 tokens if needed ### Special Tokens - ``: Start of text token - ``: End of text token - ``: Unknown token - ``: Padding token - ``: Mask token (for potential MLM tasks) ## Dataset Details - **Sources**: - Telugu Wikipedia articles - Major Telugu news websites - Combined and cleaned text corpus - **Content**: Diverse topics including literature, culture, history, and general knowledge - **Preprocessing**: - Removed references and citations - Normalized whitespace - Filtered short articles - Cleaned special characters - Combined short texts for better context ## Usage ### Installation ```bash pip install tokenizers ``` ### Basic Usage ```python from tokenizers import Tokenizer # Load the tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") # Encode text text = "నమస్కారం" # Hello encoding = tokenizer.encode(text) # Get tokens print("Tokens:", encoding.tokens) print("Token IDs:", encoding.ids) ``` ### Example Outputs ```python # Input: "తెలుగు భాష చాలా అందమైనది" # Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది'] ``` ## Technical Details ### Tokenizer Configuration - **Model**: Unigram Language Model (SentencePiece-style) - **Pre-tokenization**: ByteLevel + Character-level splitting - **Decoder**: ByteLevel - **Post-processor**: ByteLevel with trimmed offsets ### Performance Metrics 1. **Compression Ratio**: 6.77 - Calculated as: total_chars / total_tokens - Higher ratio indicates better compression - Median ratio: 7.05 2. **Vocabulary Coverage**: 50000 unique tokens - Includes special tokens - Optimized for Telugu language patterns - Auto-scales vocabulary size for better compression ## Examples Check `examples.json` for more tokenization examples with different types of Telugu text, including: - Short phrases - Complete sentences - Long paragraphs - Various writing styles ## Training Process The tokenizer was trained using the following steps: 1. Collected 2,500+ Telugu articles from multiple sources 2. Cleaned and preprocessed the text 3. Combined short texts to create better context 4. Trained Unigram model with initial vocab size of 50,000 5. Auto-scaled vocabulary if needed for better compression 6. Validated against requirements