metadata

title: Hindi BPE Tokenizer
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
app_file: dockerfile
pinned: false

Hindi Byte Pair Encoding (BPE) Implementation

This project implements Byte Pair Encoding (BPE) for Hindi text tokenization. The implementation includes both character-level and subword tokenization with a focus on handling Hindi script's unique characteristics.

Core Components

1. Base Vocabulary (वर्णमाला)

The tokenizer starts with a base vocabulary consisting of:

व्यंजन (Consonants): क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, etc.
स्वर (Vowels): अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ
मात्राएँ (Matras): ा, ि, ी, ु, ू, ृ, े, ै, ो, ौ
विशेष चिह्न (Special Characters):
- Halant (विराम): ्
- Anusvara (अनुस्वार): ं
- Visarga (विसर्ग): ः
- Chandrabindu (चन्द्रबिन्दु): ँ
- Nukta (नुक्ता): ़

Learned Vocabulary

Refer to backend/bpe_model_latest.json & backend/token_frequencies.json

2. BPE Training Process

2.1 Data Preparation

Text Collection:
- Downloads Hindi articles from Wikipedia
- Cleans and preprocesses text (removes markup, references, etc.)
- Splits into sentences
Initial Tokenization:
- Splits text into words
- Converts each word into space-separated characters
- Creates initial frequency dictionary

2.2 Learning Algorithm

Initialization:

vocab = base_vocabulary
learned_vocab = set()
merges = {}

Training Loop:

while len(vocab) < target_vocab_size:
    # Count pair frequencies
    pairs = get_pair_frequencies(word_freqs)
    if not pairs:
        break
        
    # Find most frequent pair
    best_pair = max(pairs.items(), key=lambda x: x[1])
    new_token = ''.join(best_pair[0])
    
    # Add to vocabulary
    vocab.add(new_token)
    learned_vocab.add(new_token)
    merges[best_pair] = new_token
    
    # Update word frequencies
    word_freqs = apply_merge(word_freqs, best_pair)

Merge Operations:
- Finds most frequent adjacent character pairs
- Merges them into a new token
- Updates vocabulary and frequency counts
- keep a track of all trained merges in merge_history
- in regular interval revisit the vocabulary and update with the top 5000 from the tracked merges

3. Tokenization Process

3.1 Character-level Tokenization


### 4. Token Analysis

#### 4.1 Token Types
- **Consonant**: Single consonant characters (व्यंजन)
- **Vowel**: Single vowel characters (स्वर)
- **Matra**: Vowel markers (मात्राएँ)
- **Special**: Special characters like halant, anusvara, etc.
- **Compound**: Learned tokens combining multiple characters

#### 4.2 Statistics Tracking
- Original character count
- Token count
- Unique tokens
- Compression ratio
- Token frequencies
- Merge frequencies

### 5. Metrics and Visualization

#### 5.1 Training Metrics
- Vocabulary size growth
- Learned vocabulary size
- Compression ratios
- Merge frequencies
- Unique token counts

#### 5.2 Token Analysis
- Token type distribution
- Token lengths
- Unicode representations
- Usage frequencies

## Implementation Details

### File Structure

Key Classes

HindiTokenizer:
- Manages base vocabulary
- Handles character-level tokenization
- Provides token type identification
BPETokenizer:
- Inherits from HindiTokenizer
- Implements BPE training algorithm
- Handles subword tokenization
- Tracks statistics and merge history

API Endpoints

POST /tokenize:
- Input: Hindi text
- Output:
  - Original and BPE tokens
  - Token statistics
  - Token details (type, length, Unicode)
GET /vocabulary-stats:
- Returns:
  - Current vocabulary size
  - Most frequent tokens
  - Most frequent pairs

Usage Example

# Initialize tokenizer
tokenizer = BPETokenizer(vocab_size=10000)

# Train on text
tokenizer.learn_bpe(text)

# Tokenize text
result = tokenizer.tokenize_with_details("नमस्ते भारत")
print(result['bpe_tokens'])

Performance Considerations

Memory Usage:
- Stores full vocabulary in memory
- Maintains frequency dictionaries
- Tracks merge history
Speed Optimizations:
- Uses sets for vocabulary lookups
- Caches token types
- Optimizes merge operations
Training Efficiency:
- Stops when no frequent pairs remain
- Uses frequency threshold for merges
- Tracks progress with tqdm

Future Improvements

Tokenization:
- Handle numbers and English text better
- Improve handling of Hindi conjuncts
- Add support for rare character combinations
Training:
- Add parallel processing for large datasets
- Implement vocabulary pruning
- Add support for incremental training
Analysis:
- Add more detailed token statistics
- Improve visualization of merge patterns
- Add model comparison tools

UI for the tokenizer

Take content from any place with pure hindi articles.
Tokenize the content
it will show the original text, original tokens, bpe tokens, original encoded tokens, token numbers, stats, token details

UI Screenshot: