aloobun
/

IN-Llama-3-Tokenizer

Model card Files Files and versions

aloobun commited on Dec 10, 2024

Commit

5c664a3

·

verified ·

1 Parent(s): c43b202

Update README.md

Files changed (1) hide show

README.md +13 -1

README.md CHANGED Viewed

@@ -75,4 +75,16 @@ Script ensures:
 - New tokens are correctly integrated.
 - Token mappings, etc
-I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.

 - New tokens are correctly integrated.
 - Token mappings, etc
+I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.
+Here's a comparison of sub word **fertility** scores between [sarvam-1](https://huggingface.co/sarvamai/sarvam-1) and this model.
+|     |sarvam-1|IN-Llama-3-Tokenizer|
+|--------|------|---------|
+|Bengali|1.7  |3.52     |
+|Gujrati|2.784313  |3.588235     |
+|Hindi|1.583333   |2.933333     |
+|Kannada|2.571428  |3.976190    |
+|Malayalam|3.487804  |4.365853    |
+|Tamil|2.767441  |3.860465    |
+|Telugu|2.372093  |3.511627     |