Update README.md
Browse files
README.md
CHANGED
@@ -371,6 +371,8 @@ Feel free to click the expand button below to see the full list of sources.
|
|
371 |
| The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
|
372 |
| Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
|
373 |
|
|
|
|
|
374 |
<details>
|
375 |
<summary>References</summary>
|
376 |
|
@@ -565,7 +567,7 @@ especially if the content originates from less-regulated sources or user-generat
|
|
565 |
|
566 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
567 |
- Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
|
568 |
-
- Domain-specific or language-specific raw crawls (p.e. Spanish Crawling).
|
569 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
570 |
(p.e. CATalog).
|
571 |
|
|
|
371 |
| The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
|
372 |
| Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
|
373 |
|
374 |
+
To consult the data summary document with the respective licences, please send an e-mail to [email protected].
|
375 |
+
|
376 |
<details>
|
377 |
<summary>References</summary>
|
378 |
|
|
|
567 |
|
568 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
569 |
- Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
|
570 |
+
- Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
|
571 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
572 |
(p.e. CATalog).
|
573 |
|