Spaces:

pradeep6kumar2024
/

awadhi_bpe

Sleeping

App Files Files Community

awadhi_bpe / README.md

pradeep6kumar2024's picture

pradeep6kumar2024

Modified README.md and yaml

e9a9b38 about 1 month ago

|

1.13 kB

metadata

title: Awadhi BPE Tokenizer
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 4.19.1
app_file: app.py
pinned: false
license: mit
python_version: '3.10'
app_port: 7860
tags:
  - awadhi
  - tokenizer
  - bpe
  - text-compression
datasets:
  - sunderkand_awdhi

Awadhi BPE Tokenizer

This space provides a Byte Pair Encoding (BPE) implementation for Awadhi text compression. It features:

Custom BPE implementation for Awadhi text
Vocabulary size < 5000 tokens
Compression ratio > 3.2
Interactive web interface

Usage

Enter Awadhi text in the input box
Click "Tokenize"
View tokenization results and statistics

Implementation Details

Uses character-level tokenization as base
Implements BPE merging strategy
Handles UTF-8 encoded Awadhi text
Provides compression statistics

Model Details

Base tokenization: Character-level
Maximum vocabulary size: 4500 tokens
Training corpus: Sunderkand in Awadhi
Compression target: > 3.2x

Technical Requirements

Python 3.10+
PyTorch
Gradio 4.19.1+

License

This project is licensed under the MIT License.