Spaces:
Sleeping
Sleeping
title: Awadhi BPE Tokenizer | |
colorFrom: blue | |
colorTo: red | |
sdk: gradio | |
sdk_version: "4.19.1" | |
app_file: app.py | |
pinned: false | |
license: mit | |
python_version: "3.10" | |
app_port: 7860 | |
tags: | |
- awadhi | |
- tokenizer | |
- bpe | |
- text-compression | |
datasets: | |
- sunderkand_awdhi | |
# Awadhi BPE Tokenizer | |
This space provides a Byte Pair Encoding (BPE) implementation for Awadhi text compression. It features: | |
- Custom BPE implementation for Awadhi text | |
- Vocabulary size < 5000 tokens | |
- Compression ratio > 3.2 | |
- Interactive web interface | |
## Usage | |
1. Enter Awadhi text in the input box | |
2. Click "Tokenize" | |
3. View tokenization results and statistics | |
## Implementation Details | |
- Uses character-level tokenization as base | |
- Implements BPE merging strategy | |
- Handles UTF-8 encoded Awadhi text | |
- Provides compression statistics | |
## Model Details | |
- Base tokenization: Character-level | |
- Maximum vocabulary size: 4500 tokens | |
- Training corpus: Sunderkand in Awadhi | |
- Compression target: > 3.2x | |
## Technical Requirements | |
- Python 3.10+ | |
- PyTorch | |
- Gradio 4.19.1+ | |
## License | |
This project is licensed under the MIT License. |