Updates README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,22 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- la
|
5 |
+
- el
|
6 |
+
- fr
|
7 |
+
- en
|
8 |
+
- de
|
9 |
+
- it
|
10 |
+
base_model:
|
11 |
+
- FacebookAI/xlm-roberta-base
|
12 |
+
---
|
13 |
+
|
14 |
+
# Model Description
|
15 |
+
|
16 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
17 |
+
|
18 |
+
This model checkpoint was created by further pre-training XLM-RoBERTa-base on a 1.4B tokens corpus of classical texts mainly written in Ancient Greek, Latin, French, German, English and Italian.
|
19 |
+
The corpus notably contains data from [Brill-KIEM](https://github.com/kiem-group/pdfParser), various ancient sources from the Internet Archive, the [Corpus Thomisticum](https://www.corpusthomisticum.org/), [Open Greek and Latin](https://www.opengreekandlatin.org/), [JSTOR](https://about.jstor.org/whats-in-jstor/text-mining-support/), [Persée](https://www.persee.fr/), Propylaeum, [Remacle](https://remacle.org/) or Wikipedia.
|
20 |
+
The model can be used as a checkpoint for further pre-training or as a base model for fine-tuning.
|
21 |
+
The model was evaluated on classics-related named-entity recognition and part-of-speech tagging and surpassed XLM-RoBERTa-Base on all task.
|
22 |
+
It also performed significantly better than similar models retrained from scratch on the same corpus.
|