license: apache-2.0 | |
This model is fine tuned with: | |
- The Latin Library - 15M Token | |
- Perseus Project - 15M Token | |
The dataset was cleaned: | |
- Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). | |
- Use of CLTK for sentence splitting and normalisation. | |
- deduplication of the corpus | |
- lowercase all text |