Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ Meltemi is built on top of [Mistral 7B](https://huggingface.co/mistralai/Mistral
|
|
16 |
|
17 |
# Model Information
|
18 |
|
19 |
-
- Vocabulary extension of the Mistral 7B tokenizer with Greek tokens
|
20 |
- 8192 context length
|
21 |
- We extend the pretraining of Mistral 7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **55 billion tokens**.
|
22 |
* This corpus includes 43.3 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
|
|
|
16 |
|
17 |
# Model Information
|
18 |
|
19 |
+
- Vocabulary extension of the Mistral 7B tokenizer with Greek tokens for lower costs and faster inference (**1.52** vs. 6.80 tokens/word for Greek)
|
20 |
- 8192 context length
|
21 |
- We extend the pretraining of Mistral 7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **55 billion tokens**.
|
22 |
* This corpus includes 43.3 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
|