XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
This model is fine tuned with The Latin Library - 15M Token
The dataset was cleaned:
- Removal of all "pseudo-Latin" text ("Lorem ipsum ...").
- Use of CLTK for sentence splitting and normalisation.
- deduplication of the corpus
- lowercase all text
- Downloads last month
- 127
Model tree for Cicciokr/XLM-Roberta-Base-Latin-Uncased
Base model
FacebookAI/xlm-roberta-base