Update README.md
Browse files
README.md
CHANGED
|
@@ -75,4 +75,16 @@ Script ensures:
|
|
| 75 |
- New tokens are correctly integrated.
|
| 76 |
- Token mappings, etc
|
| 77 |
|
| 78 |
-
I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
- New tokens are correctly integrated.
|
| 76 |
- Token mappings, etc
|
| 77 |
|
| 78 |
+
I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.
|
| 79 |
+
|
| 80 |
+
Here's a comparison of sub word **fertility** scores between [sarvam-1](https://huggingface.co/sarvamai/sarvam-1) and this model.
|
| 81 |
+
|
| 82 |
+
| |sarvam-1|IN-Llama-3-Tokenizer|
|
| 83 |
+
|--------|------|---------|
|
| 84 |
+
|Bengali|1.7 |3.52 |
|
| 85 |
+
|Gujrati|2.784313 |3.588235 |
|
| 86 |
+
|Hindi|1.583333 |2.933333 |
|
| 87 |
+
|Kannada|2.571428 |3.976190 |
|
| 88 |
+
|Malayalam|3.487804 |4.365853 |
|
| 89 |
+
|Tamil|2.767441 |3.860465 |
|
| 90 |
+
|Telugu|2.372093 |3.511627 |
|