Spaces:

alibayram
/

turkish_tiktokenizer

Running

App Files Files Community

alibayram commited on Aug 15

Commit

0e68577

1 Parent(s): f8c9370

Refactor tokenize_and_display function to remove theme parameter and update README for Gradio integration, including usage instructions and feature highlights.

Browse files

Files changed (3) hide show

README.md +45 -83
app.py +12 -13
requirements.txt +1 -1

README.md CHANGED Viewed

@@ -1,113 +1,75 @@
 ---
-title: Turkish Tiktokenizer
-emoji: 👁
-colorFrom: red
-colorTo: red
-sdk: streamlit
-sdk_version: 1.41.1
 app_file: app.py
 pinned: false
 license: cc-by-nc-nd-4.0
 short_description: Turkish Morphological Tokenizer
 ---
-# Turkish Tiktokenizer Web App
-A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
 ## Features
-- 🔤 Turkish text tokenization with morphological analysis
-- 🎨 Color-coded token visualization
-- 🔢 Token count and ID display
-- 📊 Special token highlighting (uppercase, space, newline, etc.)
-- 🔄 Version selection from GitHub commit history
-- 🌐 Direct integration with GitHub repository
-## Demo
-You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
-## Installation
-1. Clone the repository:
-```bash
-git clone https://github.com/malibayram/tokenizer.git
-cd tokenizer/streamlit_app
-```
-2. Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-## Usage
-1. Run the Streamlit app:
-```bash
-streamlit run app.py
-```
-2. Open your browser and navigate to http://localhost:8501
-3. Enter Turkish text in the input area and click "Tokenize"
-## How It Works
-1. **Text Input**: Enter Turkish text in the left panel
-2. **Tokenization**: Click the "Tokenize" button to process the text
-3. **Visualization**:
-   - Token count is displayed at the top
-   - Tokens are shown with color-coding:
-     - Special tokens (uppercase, space, etc.) have predefined colors
-     - Regular tokens get unique colors for easy identification
-   - Token IDs are displayed below the visualization
-## Code Structure
-- `app.py`: Main Streamlit application
-  - UI components and layout
-  - GitHub integration
-  - Tokenization logic
-  - Color generation and visualization
 - `requirements.txt`: Python dependencies
-## Technical Details
-- **Tokenizer Source**: Fetched directly from GitHub repository
-- **Caching**: Uses Streamlit's caching for better performance
-- **Color Generation**: HSV-based algorithm for visually distinct colors
-- **Session State**: Maintains text and results between interactions
-- **Error Handling**: Graceful handling of GitHub API and tokenization errors
-## Deployment to Hugging Face Spaces
-1. Create a new Space:
-   - Go to https://huggingface.co/spaces
-   - Click "Create new Space"
-   - Select "Streamlit" as the SDK
-   - Choose a name for your Space
-2. Upload files:
-   - `app.py`
-   - `requirements.txt`
-3. The app will automatically deploy and be available at your Space's URL
-## Contributing
-1. Fork the repository
-2. Create your feature branch
-3. Commit your changes
-4. Push to the branch
-5. Create a Pull Request
 ## License
-MIT License - see the [LICENSE](../LICENSE) file for details
-## Acknowledgments
-- Built by dqbd
-- Created with the generous help from Diagram
-- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)

 ---
+title: Turkish Tokenizer
+colorFrom: blue
+colorTo: blue
+sdk: gradio
+sdk_version: 4.0.0
 app_file: app.py
 pinned: false
 license: cc-by-nc-nd-4.0
 short_description: Turkish Morphological Tokenizer
 ---
+A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.
 ## Features
+- **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens
+- **Visual Tokenization**: Color-coded token display with interactive highlighting
+- **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution
+- **Real-time Processing**: Instant tokenization with live statistics
+- **Example Texts**: Pre-loaded Turkish examples for testing
+## How to Use
+1. Enter Turkish text in the input field
+2. Click "🚀 Tokenize" to process the text
+3. View the color-coded tokens in the visualization
+4. Check the statistics for detailed analysis
+5. See the encoded token IDs and decoded text
+## Token Types
+- **🔴 Roots (ROOT)**: Base word forms
+- **🔵 Suffixes (SUFFIX)**: Turkish grammatical suffixes
+- **🟡 BPE**: Byte Pair Encoding tokens for subword units
+## Examples
+Try these example texts:
+- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
+- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
+- "KitapOkumak çok güzeldir ve bilgi verir."
+## Technical Details
+This tokenizer uses:
+- Custom morphological analysis for Turkish
+- JSON-based vocabulary files
+- Gradio for the web interface
+- Advanced tokenization algorithms
+## Files
+- `app.py`: Main Gradio application
+- `tr_tokenizer.py`: Core tokenization logic
+- `tr_decoder.py`: Text decoding functionality
+- `*.json`: Vocabulary and token data files
 - `requirements.txt`: Python dependencies
+## Local Development
+To run locally:
+```bash
+pip install -r requirements.txt
+python app.py
+```
+The app will be available at `http://localhost:7860`
 ## License
+This project is open source and available under the MIT License.

app.py CHANGED Viewed

@@ -20,13 +20,13 @@ dark_color_map = {
     TokenType.BPE.name: "#FFE66D",       # Darker Yellow
 }
-def tokenize_and_display(text, theme="light"):
     """
     Tokenizes the input text and prepares it for display in Gradio's HighlightedText component.
     """
     if not text:
         # Return a structure that matches all outputs to avoid errors
-        return [], "", "", "", theme
     tokens, _ = tokenizer.tokenize_text(text)
@@ -59,9 +59,9 @@ def tokenize_and_display(text, theme="light"):
     <div style="background:{bg_col};padding:20px;border-radius:12px;margin:20px 0;">
         <h4 style="color:{text_col};margin-bottom:15px;">📊 Tokenization Statistics</h4>
         <div style="display:grid;grid-template-columns:repeat(auto-fit,minmax(150px,1fr));gap:15px;margin-bottom:20px;">
-            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#3b82f6;">{total_chars}</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Characters</div></div>
-            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#10b981;">{total_tokens}</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Tokens</div></div>
-            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#f59e0b;">{compression_ratio:.1f}%</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Compression</div></div>
         </div>
         <div>
             <h5 style="color:{text_col};margin-bottom:10px;">Token Type Distribution:</h5>
@@ -72,7 +72,7 @@ def tokenize_and_display(text, theme="light"):
             </div>
         </div>
     </div>"""
-    return highlighted_tokens, str(encoded_ids), decoded_text, stats_html, theme
 # Custom CSS for better styling
 custom_css = """
@@ -145,8 +145,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
     gr.Markdown("--- \n **Turkish Tokenizer Pro** - Advanced tokenization for Turkish text.")
     # --- Event Handlers ---
-    def process_with_theme(text, theme):
-        return tokenize_and_display(text, theme)
     def clear_all():
         return "", [], "", "", ""
@@ -154,8 +154,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
     # Connect the buttons to the functions
     process_button.click(
         fn=process_with_theme,
-        inputs=[input_text, theme_state],
-        outputs=[highlighted_output, encoded_output, decoded_output, stats_output, theme_state]
     )
     clear_button.click(
@@ -165,9 +165,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
     # Auto-process on load with a default example
     demo.load(
-        fn=lambda theme: tokenize_and_display("Merhaba Dünya!", theme),
-        inputs=[theme_state],
-        outputs=[highlighted_output, encoded_output, decoded_output, stats_output, theme_state]
     )
 if __name__ == "__main__":

     TokenType.BPE.name: "#FFE66D",       # Darker Yellow
 }
+def tokenize_and_display(text):
     """
     Tokenizes the input text and prepares it for display in Gradio's HighlightedText component.
     """
     if not text:
         # Return a structure that matches all outputs to avoid errors
+        return [], "", "", ""
     tokens, _ = tokenizer.tokenize_text(text)
     <div style="background:{bg_col};padding:20px;border-radius:12px;margin:20px 0;">
         <h4 style="color:{text_col};margin-bottom:15px;">📊 Tokenization Statistics</h4>
         <div style="display:grid;grid-template-columns:repeat(auto-fit,minmax(150px,1fr));gap:15px;margin-bottom:20px;">
+            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#3b82f6;">{total_chars}</div><div style="color:{'#a0aec0'};font-size:14px;">Characters</div></div>
+            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#10b981;">{total_tokens}</div><div style="color:{'#a0aec0'};font-size:14px;">Tokens</div></div>
+            <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#f59e0b;">{compression_ratio:.1f}%</div><div style="color:{'#a0aec0'};font-size:14px;">Compression</div></div>
         </div>
         <div>
             <h5 style="color:{text_col};margin-bottom:10px;">Token Type Distribution:</h5>
             </div>
         </div>
     </div>"""
+    return highlighted_tokens, str(encoded_ids), decoded_text, stats_html
 # Custom CSS for better styling
 custom_css = """
     gr.Markdown("--- \n **Turkish Tokenizer Pro** - Advanced tokenization for Turkish text.")
     # --- Event Handlers ---
+    def process_with_theme(text):
+        return tokenize_and_display(text)
     def clear_all():
         return "", [], "", "", ""
     # Connect the buttons to the functions
     process_button.click(
         fn=process_with_theme,
+        inputs=[input_text],
+        outputs=[highlighted_output, encoded_output, decoded_output, stats_output]
     )
     clear_button.click(
     # Auto-process on load with a default example
     demo.load(
+        fn=lambda: tokenize_and_display("Merhaba Dünya!"),
+        outputs=[highlighted_output, encoded_output, decoded_output, stats_output]
     )
 if __name__ == "__main__":

requirements.txt CHANGED Viewed

	@@ -1 +1 @@
1	- gradio


1	+ gradio