|
--- |
|
title: Chunking |
|
emoji: ๐ |
|
colorFrom: pink |
|
colorTo: pink |
|
sdk: gradio |
|
sdk_version: 4.0.2 |
|
app_file: app.py |
|
pinned: false |
|
license: apache-2.0 |
|
--- |
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
#LLMlocal's Text Tokenization Tool |
|
|
|
##Introduction |
|
|
|
LLMlocal's Text Tokenization Tool is a user-friendly application designed to tokenize large bodies of text using various methods. This tool is built using Gradio, a Python library that allows the easy creation of web interfaces for machine learning models. |
|
|
|
##Features |
|
Tokenization Method Selection: Users can select different text tokenization methods. The current version supports the RecursiveCharacterTextSplitter method. |
|
Customizable Parameters: Users have the flexibility to set parameters like chunk size, chunk overlap, and the number of chunks to display. |
|
Interactive Interface: The tool features an intuitive interface with dropdowns, textboxes, and number inputs for easy interaction. |
|
Installation |
|
Before using the tool, ensure you have Python and the necessary packages installed. You can install the required packages using the following command: |
|
|
|
bash' |
|
pip install gradio pandas langchain' |
|
|
|
##Usage |
|
To use the tool, follow these simple steps: |
|
|
|
Launch the Tool: Run the provided Python script. This will launch the Gradio interface in your default web browser. |
|
Select a Tokenization Method: Choose the desired method from the dropdown menu. |
|
Enter Text: Type or paste the text you want to tokenize in the textbox. |
|
Set Parameters: Adjust the chunk size, chunk overlap, and the number of chunks as per your requirements. |
|
View Results: The tokenized text will be displayed in a table format with details such as chunk number, text chunk, character count, and token count. |
|
Contributing |
|
Feedback and contributions to enhance the tool are always welcome. Please feel free to raise issues or submit pull requests on the repository. |
|
|
|
##License |
|
This tool is open-source and available under apache-2.0 . |
|
|
|
Acknowledgments |
|
Thanks to the developers of Gradio, Pandas, and Langchain for providing the libraries that made this tool possible. |
|
|