Update README.md
Browse files
README.md
CHANGED
@@ -368,6 +368,8 @@ Feel free to click the expand button below to see the full list of sources.
|
|
368 |
| The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
|
369 |
| Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
|
370 |
|
|
|
|
|
371 |
<details>
|
372 |
<summary>References</summary>
|
373 |
|
@@ -561,7 +563,7 @@ especially if the content originates from less-regulated sources or user-generat
|
|
561 |
|
562 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
563 |
- Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
|
564 |
-
- Domain-specific or language-specific raw crawls (p.e. Spanish Crawling).
|
565 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
566 |
(p.e. CATalog).
|
567 |
|
|
|
368 |
| The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
|
369 |
| Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
|
370 |
|
371 |
+
To consult the data summary document with the respective licences, please send an e-mail to [email protected].
|
372 |
+
|
373 |
<details>
|
374 |
<summary>References</summary>
|
375 |
|
|
|
563 |
|
564 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
565 |
- Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
|
566 |
+
- Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
|
567 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
568 |
(p.e. CATalog).
|
569 |
|