alibayram commited on
Commit
0e68577
ยท
1 Parent(s): f8c9370

Refactor tokenize_and_display function to remove theme parameter and update README for Gradio integration, including usage instructions and feature highlights.

Browse files
Files changed (3) hide show
  1. README.md +45 -83
  2. app.py +12 -13
  3. requirements.txt +1 -1
README.md CHANGED
@@ -1,113 +1,75 @@
1
  ---
2
- title: Turkish Tiktokenizer
3
- emoji: ๐Ÿ‘
4
- colorFrom: red
5
- colorTo: red
6
- sdk: streamlit
7
- sdk_version: 1.41.1
8
  app_file: app.py
9
  pinned: false
10
  license: cc-by-nc-nd-4.0
11
  short_description: Turkish Morphological Tokenizer
12
  ---
13
 
14
- # Turkish Tiktokenizer Web App
15
-
16
- A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
17
 
18
  ## Features
19
 
20
- - ๐Ÿ”ค Turkish text tokenization with morphological analysis
21
- - ๐ŸŽจ Color-coded token visualization
22
- - ๐Ÿ”ข Token count and ID display
23
- - ๐Ÿ“Š Special token highlighting (uppercase, space, newline, etc.)
24
- - ๐Ÿ”„ Version selection from GitHub commit history
25
- - ๐ŸŒ Direct integration with GitHub repository
26
 
27
- ## Demo
28
 
29
- You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
 
 
 
 
30
 
31
- ## Installation
32
 
33
- 1. Clone the repository:
34
- ```bash
35
- git clone https://github.com/malibayram/tokenizer.git
36
- cd tokenizer/streamlit_app
37
- ```
38
-
39
- 2. Install dependencies:
40
- ```bash
41
- pip install -r requirements.txt
42
- ```
43
 
44
- ## Usage
45
 
46
- 1. Run the Streamlit app:
47
- ```bash
48
- streamlit run app.py
49
- ```
50
 
51
- 2. Open your browser and navigate to http://localhost:8501
 
 
52
 
53
- 3. Enter Turkish text in the input area and click "Tokenize"
54
 
55
- ## How It Works
56
 
57
- 1. **Text Input**: Enter Turkish text in the left panel
58
- 2. **Tokenization**: Click the "Tokenize" button to process the text
59
- 3. **Visualization**:
60
- - Token count is displayed at the top
61
- - Tokens are shown with color-coding:
62
- - Special tokens (uppercase, space, etc.) have predefined colors
63
- - Regular tokens get unique colors for easy identification
64
- - Token IDs are displayed below the visualization
65
 
66
- ## Code Structure
67
 
68
- - `app.py`: Main Streamlit application
69
- - UI components and layout
70
- - GitHub integration
71
- - Tokenization logic
72
- - Color generation and visualization
73
  - `requirements.txt`: Python dependencies
74
 
75
- ## Technical Details
76
-
77
- - **Tokenizer Source**: Fetched directly from GitHub repository
78
- - **Caching**: Uses Streamlit's caching for better performance
79
- - **Color Generation**: HSV-based algorithm for visually distinct colors
80
- - **Session State**: Maintains text and results between interactions
81
- - **Error Handling**: Graceful handling of GitHub API and tokenization errors
82
-
83
- ## Deployment to Hugging Face Spaces
84
-
85
- 1. Create a new Space:
86
- - Go to https://huggingface.co/spaces
87
- - Click "Create new Space"
88
- - Select "Streamlit" as the SDK
89
- - Choose a name for your Space
90
-
91
- 2. Upload files:
92
- - `app.py`
93
- - `requirements.txt`
94
 
95
- 3. The app will automatically deploy and be available at your Space's URL
96
 
97
- ## Contributing
 
 
 
98
 
99
- 1. Fork the repository
100
- 2. Create your feature branch
101
- 3. Commit your changes
102
- 4. Push to the branch
103
- 5. Create a Pull Request
104
 
105
  ## License
106
 
107
- MIT License - see the [LICENSE](../LICENSE) file for details
108
-
109
- ## Acknowledgments
110
-
111
- - Built by dqbd
112
- - Created with the generous help from Diagram
113
- - Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)
 
1
  ---
2
+ title: Turkish Tokenizer
3
+ colorFrom: blue
4
+ colorTo: blue
5
+ sdk: gradio
6
+ sdk_version: 4.0.0
 
7
  app_file: app.py
8
  pinned: false
9
  license: cc-by-nc-nd-4.0
10
  short_description: Turkish Morphological Tokenizer
11
  ---
12
 
13
+ A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.
 
 
14
 
15
  ## Features
16
 
17
+ - **Morphological Analysis**: Breaks down Turkish words into roots, suffixes, and BPE tokens
18
+ - **Visual Tokenization**: Color-coded token display with interactive highlighting
19
+ - **Statistics Dashboard**: Detailed analytics including compression ratios and token distribution
20
+ - **Real-time Processing**: Instant tokenization with live statistics
21
+ - **Example Texts**: Pre-loaded Turkish examples for testing
 
22
 
23
+ ## How to Use
24
 
25
+ 1. Enter Turkish text in the input field
26
+ 2. Click "๐Ÿš€ Tokenize" to process the text
27
+ 3. View the color-coded tokens in the visualization
28
+ 4. Check the statistics for detailed analysis
29
+ 5. See the encoded token IDs and decoded text
30
 
31
+ ## Token Types
32
 
33
+ - **๐Ÿ”ด Roots (ROOT)**: Base word forms
34
+ - **๐Ÿ”ต Suffixes (SUFFIX)**: Turkish grammatical suffixes
35
+ - **๐ŸŸก BPE**: Byte Pair Encoding tokens for subword units
 
 
 
 
 
 
 
36
 
37
+ ## Examples
38
 
39
+ Try these example texts:
 
 
 
40
 
41
+ - "Merhaba Dรผnya! Bu bir geliลŸmiลŸ Tรผrkรงe tokenizer testidir."
42
+ - "ฤฐstanbul'da yaลŸฤฑyorum ve Tรผrkรงe dilini รถฤŸreniyorum."
43
+ - "KitapOkumak รงok gรผzeldir ve bilgi verir."
44
 
45
+ ## Technical Details
46
 
47
+ This tokenizer uses:
48
 
49
+ - Custom morphological analysis for Turkish
50
+ - JSON-based vocabulary files
51
+ - Gradio for the web interface
52
+ - Advanced tokenization algorithms
 
 
 
 
53
 
54
+ ## Files
55
 
56
+ - `app.py`: Main Gradio application
57
+ - `tr_tokenizer.py`: Core tokenization logic
58
+ - `tr_decoder.py`: Text decoding functionality
59
+ - `*.json`: Vocabulary and token data files
 
60
  - `requirements.txt`: Python dependencies
61
 
62
+ ## Local Development
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
+ To run locally:
65
 
66
+ ```bash
67
+ pip install -r requirements.txt
68
+ python app.py
69
+ ```
70
 
71
+ The app will be available at `http://localhost:7860`
 
 
 
 
72
 
73
  ## License
74
 
75
+ This project is open source and available under the MIT License.
 
 
 
 
 
 
app.py CHANGED
@@ -20,13 +20,13 @@ dark_color_map = {
20
  TokenType.BPE.name: "#FFE66D", # Darker Yellow
21
  }
22
 
23
- def tokenize_and_display(text, theme="light"):
24
  """
25
  Tokenizes the input text and prepares it for display in Gradio's HighlightedText component.
26
  """
27
  if not text:
28
  # Return a structure that matches all outputs to avoid errors
29
- return [], "", "", "", theme
30
 
31
  tokens, _ = tokenizer.tokenize_text(text)
32
 
@@ -59,9 +59,9 @@ def tokenize_and_display(text, theme="light"):
59
  <div style="background:{bg_col};padding:20px;border-radius:12px;margin:20px 0;">
60
  <h4 style="color:{text_col};margin-bottom:15px;">๐Ÿ“Š Tokenization Statistics</h4>
61
  <div style="display:grid;grid-template-columns:repeat(auto-fit,minmax(150px,1fr));gap:15px;margin-bottom:20px;">
62
- <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#3b82f6;">{total_chars}</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Characters</div></div>
63
- <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#10b981;">{total_tokens}</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Tokens</div></div>
64
- <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#f59e0b;">{compression_ratio:.1f}%</div><div style="color:{'#64748b' if theme == 'light' else '#a0aec0'};font-size:14px;">Compression</div></div>
65
  </div>
66
  <div>
67
  <h5 style="color:{text_col};margin-bottom:10px;">Token Type Distribution:</h5>
@@ -72,7 +72,7 @@ def tokenize_and_display(text, theme="light"):
72
  </div>
73
  </div>
74
  </div>"""
75
- return highlighted_tokens, str(encoded_ids), decoded_text, stats_html, theme
76
 
77
  # Custom CSS for better styling
78
  custom_css = """
@@ -145,8 +145,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
145
  gr.Markdown("--- \n **Turkish Tokenizer Pro** - Advanced tokenization for Turkish text.")
146
 
147
  # --- Event Handlers ---
148
- def process_with_theme(text, theme):
149
- return tokenize_and_display(text, theme)
150
 
151
  def clear_all():
152
  return "", [], "", "", ""
@@ -154,8 +154,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
154
  # Connect the buttons to the functions
155
  process_button.click(
156
  fn=process_with_theme,
157
- inputs=[input_text, theme_state],
158
- outputs=[highlighted_output, encoded_output, decoded_output, stats_output, theme_state]
159
  )
160
 
161
  clear_button.click(
@@ -165,9 +165,8 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Turkish Tokenizer", css=custom_css
165
 
166
  # Auto-process on load with a default example
167
  demo.load(
168
- fn=lambda theme: tokenize_and_display("Merhaba Dรผnya!", theme),
169
- inputs=[theme_state],
170
- outputs=[highlighted_output, encoded_output, decoded_output, stats_output, theme_state]
171
  )
172
 
173
  if __name__ == "__main__":
 
20
  TokenType.BPE.name: "#FFE66D", # Darker Yellow
21
  }
22
 
23
+ def tokenize_and_display(text):
24
  """
25
  Tokenizes the input text and prepares it for display in Gradio's HighlightedText component.
26
  """
27
  if not text:
28
  # Return a structure that matches all outputs to avoid errors
29
+ return [], "", "", ""
30
 
31
  tokens, _ = tokenizer.tokenize_text(text)
32
 
 
59
  <div style="background:{bg_col};padding:20px;border-radius:12px;margin:20px 0;">
60
  <h4 style="color:{text_col};margin-bottom:15px;">๐Ÿ“Š Tokenization Statistics</h4>
61
  <div style="display:grid;grid-template-columns:repeat(auto-fit,minmax(150px,1fr));gap:15px;margin-bottom:20px;">
62
+ <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#3b82f6;">{total_chars}</div><div style="color:{'#a0aec0'};font-size:14px;">Characters</div></div>
63
+ <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#10b981;">{total_tokens}</div><div style="color:{'#a0aec0'};font-size:14px;">Tokens</div></div>
64
+ <div style="background:{card_col};padding:15px;border-radius:8px;text-align:center;border:1px solid {border_col};"><div style="font-size:24px;font-weight:bold;color:#f59e0b;">{compression_ratio:.1f}%</div><div style="color:{'#a0aec0'};font-size:14px;">Compression</div></div>
65
  </div>
66
  <div>
67
  <h5 style="color:{text_col};margin-bottom:10px;">Token Type Distribution:</h5>
 
72
  </div>
73
  </div>
74
  </div>"""
75
+ return highlighted_tokens, str(encoded_ids), decoded_text, stats_html
76
 
77
  # Custom CSS for better styling
78
  custom_css = """
 
145
  gr.Markdown("--- \n **Turkish Tokenizer Pro** - Advanced tokenization for Turkish text.")
146
 
147
  # --- Event Handlers ---
148
+ def process_with_theme(text):
149
+ return tokenize_and_display(text)
150
 
151
  def clear_all():
152
  return "", [], "", "", ""
 
154
  # Connect the buttons to the functions
155
  process_button.click(
156
  fn=process_with_theme,
157
+ inputs=[input_text],
158
+ outputs=[highlighted_output, encoded_output, decoded_output, stats_output]
159
  )
160
 
161
  clear_button.click(
 
165
 
166
  # Auto-process on load with a default example
167
  demo.load(
168
+ fn=lambda: tokenize_and_display("Merhaba Dรผnya!"),
169
+ outputs=[highlighted_output, encoded_output, decoded_output, stats_output]
 
170
  )
171
 
172
  if __name__ == "__main__":
requirements.txt CHANGED
@@ -1 +1 @@
1
- gradio
 
1
+ gradio