droussis commited on
Commit
9f48821
·
verified ·
1 Parent(s): 9fa6e51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ Meltemi is built on top of [Mistral 7B](https://huggingface.co/mistralai/Mistral
16
 
17
  # Model Information
18
 
19
- - Vocabulary extension of the Mistral 7B tokenizer with Greek tokens
20
  - 8192 context length
21
  - We extend the pretraining of Mistral 7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **55 billion tokens**.
22
  * This corpus includes 43.3 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
 
16
 
17
  # Model Information
18
 
19
+ - Vocabulary extension of the Mistral 7B tokenizer with Greek tokens for lower costs and faster inference (**1.52** vs. 6.80 tokens/word for Greek)
20
  - 8192 context length
21
  - We extend the pretraining of Mistral 7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **55 billion tokens**.
22
  * This corpus includes 43.3 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).