Commit
·
8f88df6
1
Parent(s):
d2eec13
chore: Update leaderboard description and notes in app.py
Browse files
app.py
CHANGED
@@ -211,15 +211,17 @@ def refresh():
|
|
211 |
|
212 |
leaderboard_description = """The `Total Number of Tokens` in this leaderboard is based on the total number of tokens got from the Arabic section of [rasaif-translations](https://huggingface.co/datasets/MohamedRashad/rasaif-translations) dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).
|
213 |
|
214 |
-
**A tokenizer that scores high in this leaderboard
|
215 |
|
216 |
-
## Updates
|
217 |
1. New datasets is added for the evaluation (e.g. [arabic-quotes](https://huggingface.co/datasets/HeshamHaroon/arabic-quotes), [Moroccan_Arabic_Wikipedia_20230101_nobots](https://huggingface.co/datasets/SaiedAlshahrani/Moroccan_Arabic_Wikipedia_20230101_nobots)).
|
218 |
1. `Fertility Score` is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).
|
219 |
1. `Tokenize Tashkeel` is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (`✅` for yes, `❌` for no).
|
220 |
1. `Vocab Size` is the total number of tokens in the tokenizer's vocabulary (e.g. `10000` tokens).
|
221 |
1. `Tokenizer Class` is the class of the tokenizer (e.g. `BertTokenizer` or `GPT2Tokenizer`)
|
222 |
1. `Total Number of Tokens` is the total number of tokens in the dataset after tokenization (Lower is better).
|
|
|
|
|
223 |
"""
|
224 |
|
225 |
with gr.Blocks() as demo:
|
|
|
211 |
|
212 |
leaderboard_description = """The `Total Number of Tokens` in this leaderboard is based on the total number of tokens got from the Arabic section of [rasaif-translations](https://huggingface.co/datasets/MohamedRashad/rasaif-translations) dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).
|
213 |
|
214 |
+
**A tokenizer that scores high in this leaderboard should be efficient in parsing Arabic in its different dialects and forms.**
|
215 |
|
216 |
+
## Updates/Notes:
|
217 |
1. New datasets is added for the evaluation (e.g. [arabic-quotes](https://huggingface.co/datasets/HeshamHaroon/arabic-quotes), [Moroccan_Arabic_Wikipedia_20230101_nobots](https://huggingface.co/datasets/SaiedAlshahrani/Moroccan_Arabic_Wikipedia_20230101_nobots)).
|
218 |
1. `Fertility Score` is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).
|
219 |
1. `Tokenize Tashkeel` is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (`✅` for yes, `❌` for no).
|
220 |
1. `Vocab Size` is the total number of tokens in the tokenizer's vocabulary (e.g. `10000` tokens).
|
221 |
1. `Tokenizer Class` is the class of the tokenizer (e.g. `BertTokenizer` or `GPT2Tokenizer`)
|
222 |
1. `Total Number of Tokens` is the total number of tokens in the dataset after tokenization (Lower is better).
|
223 |
+
|
224 |
+
**Note**: Press `Refresh` to get the latest data available in the leaderboard (The initial state may be deceiving).
|
225 |
"""
|
226 |
|
227 |
with gr.Blocks() as demo:
|