Update README.md
Browse files
README.md
CHANGED
@@ -6,8 +6,7 @@ This model is fine tuned with:
|
|
6 |
- Perseus Project - 15M Token
|
7 |
|
8 |
The dataset was cleaned:
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
lowercase all text
|
|
|
6 |
- Perseus Project - 15M Token
|
7 |
|
8 |
The dataset was cleaned:
|
9 |
+
- Removal of all "pseudo-Latin" text ("Lorem ipsum ...").
|
10 |
+
- Use of CLTK for sentence splitting and normalisation.
|
11 |
+
- deduplication of the corpus
|
12 |
+
- lowercase all text
|
|