awadhi_bpe / README.md
pradeep6kumar2024's picture
Modified README.md and yaml
e9a9b38
|
raw
history blame
1.13 kB
---
title: Awadhi BPE Tokenizer
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: "4.19.1"
app_file: app.py
pinned: false
license: mit
python_version: "3.10"
app_port: 7860
tags:
- awadhi
- tokenizer
- bpe
- text-compression
datasets:
- sunderkand_awdhi
---
# Awadhi BPE Tokenizer
This space provides a Byte Pair Encoding (BPE) implementation for Awadhi text compression. It features:
- Custom BPE implementation for Awadhi text
- Vocabulary size < 5000 tokens
- Compression ratio > 3.2
- Interactive web interface
## Usage
1. Enter Awadhi text in the input box
2. Click "Tokenize"
3. View tokenization results and statistics
## Implementation Details
- Uses character-level tokenization as base
- Implements BPE merging strategy
- Handles UTF-8 encoded Awadhi text
- Provides compression statistics
## Model Details
- Base tokenization: Character-level
- Maximum vocabulary size: 4500 tokens
- Training corpus: Sunderkand in Awadhi
- Compression target: > 3.2x
## Technical Requirements
- Python 3.10+
- PyTorch
- Gradio 4.19.1+
## License
This project is licensed under the MIT License.