Spaces:

pradeep6kumar2024
/

awadhi_bpe

Sleeping

awadhi_bpe / README.md

Modified README.md and yaml

e9a9b38 3 months ago

1.13 kB

	---
	title: Awadhi BPE Tokenizer
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: "4.19.1"
	app_file: app.py
	pinned: false
	license: mit
	python_version: "3.10"
	app_port: 7860
	tags:
	- awadhi
	- tokenizer
	- bpe
	- text-compression
	datasets:
	- sunderkand_awdhi
	---

	# Awadhi BPE Tokenizer

	This space provides a Byte Pair Encoding (BPE) implementation for Awadhi text compression. It features:

	- Custom BPE implementation for Awadhi text
	- Vocabulary size < 5000 tokens
	- Compression ratio > 3.2
	- Interactive web interface

	## Usage

	1. Enter Awadhi text in the input box
	2. Click "Tokenize"
	3. View tokenization results and statistics

	## Implementation Details

	- Uses character-level tokenization as base
	- Implements BPE merging strategy
	- Handles UTF-8 encoded Awadhi text
	- Provides compression statistics

	## Model Details

	- Base tokenization: Character-level
	- Maximum vocabulary size: 4500 tokens
	- Training corpus: Sunderkand in Awadhi
	- Compression target: > 3.2x

	## Technical Requirements

	- Python 3.10+
	- PyTorch
	- Gradio 4.19.1+

	## License

	This project is licensed under the MIT License.