Text Generation
Transformers
Safetensors
llama
text-generation-inference
jsaizant commited on
Commit
479f0b8
·
verified ·
1 Parent(s): 7fd03b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -52
README.md CHANGED
@@ -466,28 +466,26 @@ We provide an extense Datasheet section following the best practices defined by
466
 
467
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
468
 
469
- The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
470
- European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
471
- languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
472
 
473
- We detected that there is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of
474
- our efforts in the creation of this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR
475
- (Brack et al., 2024), which includes 151 languages and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in
476
- Catalan in the world.
477
 
478
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
479
 
480
- The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
481
- Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
482
- and the use of HPC. In particular, it was created by the unit's data team, the main contributors being Javier Saiz, Ferran Espuña, and
483
- Jorge Palomar.
484
 
485
- However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
486
- and public institutions, which can be found in detail in the acknowledgements.
487
 
488
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
489
 
490
- This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
491
 
492
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
493
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -528,14 +526,14 @@ sources were sampled in proportion to their occurrence.
528
 
529
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
530
 
531
- Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some
532
- documents required optical character recognition (OCR) to extract text from non-text formats such as PDFs.
533
 
534
  **Is there a label or target associated with each instance? If so, please provide a description.**
535
 
536
- Each instance is labeled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional
537
- labels were automatically assigned to detect specific types of content harmful or toxic content and to assign preliminary indicators of
538
- undesired qualities —very short documents, high density of symbols, etc.— which were used for filtering instances.
539
 
540
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
541
 
@@ -547,12 +545,12 @@ Instances are related through shared metadata, such as source and language ident
547
 
548
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
549
 
550
- The dataset is split randomly into training, validation, and test sets.
551
 
552
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
553
 
554
- Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in
555
- web-sourced instances where SEO techniques and templates contribute to repeated textual patterns. Some instances may also be duplicated
556
  across sources due to format variations.
557
 
558
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
@@ -576,10 +574,10 @@ The dataset does not explicitly identify any subpopulations.
576
 
577
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
578
 
579
- Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as
580
- names, IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the
581
- combination of multiple data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are
582
- made to filter or anonymize sensitive data during pre-processing, but some identifiable information may remain in the dataset.
583
 
584
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
585
 
@@ -592,29 +590,28 @@ especially if the content originates from less-regulated sources or user-generat
592
  **How was the data collected?**
593
 
594
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
595
- - Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
596
- - Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
597
- - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
598
- (p.e. CATalog).
599
 
600
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
601
 
602
- According to the three groups previously defined, these are the mechanisms used in each of them:
603
- - Open direct download. Validation: data integrity tests.
604
- - Ad-hoc scrapers or crawlers. Validation: software unit and data integrity tests.
605
- - Direct download via FTP, SFTP, API or S3. Validation: data integrity tests.
606
 
607
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
608
 
609
- The sampling strategy was to use the whole dataset resulting from the filtering explained in the preprocessing/cleaning/labelling section,
610
- with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official
611
- languages of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a
612
- code document, evenly distributed among all programming languages).
613
 
614
  **Who was involved in the data collection process and how were they compensated?**
615
 
616
- This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed
617
- entirely by members of the LangTech data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
618
  consideration for acquiring data from suppliers.
619
 
620
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
@@ -633,12 +630,9 @@ ethical and legal point of view, respectively.
633
 
634
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
635
 
636
- Instances of text documents were not altered, but web-sourced documents were filtered based on specific criteria along two dimensions:
637
- - Quality: documents with a score lower than 0.8, based on undesired qualities, such as documents with low number of lines, very short
638
- sentences, presence of long footers and headers, and high percentage of punctuation, obtained through CURATE (Palomar-Giner et al., 2024)
639
- were filtered out.
640
- - Harmful or adult content: documents originating from Colossal OSCAR were filtered using LLM-Datasets (Ostendorff et al., 2024) based on
641
- the perplexity from a language model (‘harmful_pp’ field) provided by the Ungoliant pipeline (Abadji et al., 2021).
642
 
643
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
644
 
@@ -646,7 +640,7 @@ The original raw data was not kept.
646
 
647
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
648
 
649
- Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for Spanish Crawling and CATalog,
650
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
651
 
652
  #### Uses
@@ -697,11 +691,10 @@ The dataset will not be updated.
697
 
698
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
699
 
700
- The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly
701
- available in web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data
702
- retention on an individual basis. However, efforts are made to mitigate the risks associated with sensitive information through
703
- pre-processing and filtering to remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential
704
- privacy and ethical issues.
705
 
706
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
707
 
 
466
 
467
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
468
 
469
+ The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of European languages (35)
470
+ and programming languages (92). We also want to represent the co-official languages of Spain: Spanish, Catalan, Galician and Basque. For this reason, we oversample
471
+ these languages by a factor of 2.
472
 
473
+ There is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of our efforts in the creation of
474
+ this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR (Brack et al., 2024), which includes 151 languages
475
+ and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in Catalan in the world.
 
476
 
477
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
478
 
479
+ The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS),
480
+ which aims to advance the field of natural language processing through cutting-edge research and development and the use of HPC. In particular, it was created by
481
+ the unit's data team, the main contributors being José Javier Saiz, Ferran Espuña and Jorge Palomar.
 
482
 
483
+ However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners and public institutions,
484
+ which can be found in detail in the acknowledgements.
485
 
486
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
487
 
488
+ This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
489
 
490
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
491
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 
526
 
527
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
528
 
529
+ Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some documents required
530
+ optical character recognition (OCR) to extract text from non-text formats such as PDFs.
531
 
532
  **Is there a label or target associated with each instance? If so, please provide a description.**
533
 
534
+ Each instance is labelled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional labels were
535
+ automatically assigned to detect specific types of content -harmful or toxic content- and to assign preliminary indicators of undesired qualities -very
536
+ short documents, high density of symbols, etc.- which were used for filtering instances.
537
 
538
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
539
 
 
545
 
546
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
547
 
548
+ The dataset is randomly divided into training, validation and test sets, where the validation and test sets are each 1% of the total corpus.
549
 
550
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
551
 
552
+ Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in web-sourced
553
+ instances where search engine optimization techniques and templates contribute to repeated textual patterns. Some instances may be also duplicated
554
  across sources due to format variations.
555
 
556
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
 
574
 
575
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
576
 
577
+ Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as names,
578
+ IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the combination of multiple
579
+ data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are made to filter or anonymize
580
+ sensitive data (Mina et al., 2024), but some identifiable information may remain in the dataset.
581
 
582
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
583
 
 
590
  **How was the data collected?**
591
 
592
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
593
+ - Web-sourced datasets with some preprocessing available under permissive license.
594
+ - Domain-specific or language-specific raw crawls.
595
+ - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects (e.g. CATalog).
 
596
 
597
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
598
 
599
+ The data collection process was carried out using three different mechanisms, each corresponding to one of the groups defined in the previous answer. The specific methods used and their respective validation procedures are outlined below:
600
+ - Open Direct Download: Data were obtained directly from publicly accessible sources, such as websites or repositories that provide open data downloads. We validate the data with a data integrity check, which ensures that the downloaded files are complete, uncorrupted and in the expected format and structure.
601
+ - Ad hoc scrapers or crawlers: Custom web scraping scripts or crawlers were used to extract data from various online sources where direct downloads were not available. These scripts navigate web pages, extract relevant data and store it in a structured format. We validate this method with software unit tests to evaluate the functionality of individual components of the scraping programs, checking for errors or unexpected behaviour. In addition, data integrity tests were performed to verify that the collected data remained complete throughout the extraction and storage process.
602
+ - Direct download via FTP, SFTP, API or S3: Some datasets were acquired using secure transfer protocols such as FTP (File Transfer Protocol), SFTP (Secure File Transfer Protocol), or API (Application Programming Interface) requests from cloud storage services such as Amazon S3. As with the open direct download method, data integrity tests were used to validate the completeness of the files to ensure that the files were not altered or corrupted during the transfer process.
603
 
604
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
605
 
606
+ The sampling strategy was to use the whole dataset resulting from the filtering explained in the 'preprocessing/cleaning/labelling' section,
607
+ with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official languages
608
+ of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a code document,
609
+ evenly distributed among all programming languages).
610
 
611
  **Who was involved in the data collection process and how were they compensated?**
612
 
613
+ This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed entirely
614
+ by members of the Language Technologies data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
615
  consideration for acquiring data from suppliers.
616
 
617
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
 
630
 
631
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
632
 
633
+ No changes were made to the content of individual text document instances. However, the web-sourced documents underwent a filtering process based on specific criteria along two key dimensions:
634
+ - Quality filtering: The text processing pipeline CURATE (Palomar et. al, 2024) calculates a quality score for each document based on a set of filtering criteria that identify undesirable textual characteristics. Any document with a score below the 0.8 threshold was excluded from the dataset.
635
+ - Harmful or adult content filtering: To reduce the amount of harmful or inappropriate material in the dataset, documents from Colossal OSCAR were filtered using the Ungoliant pipeline (Abadji et al., 2021), which uses the 'harmful\_pp' field, a perplexity-based score generated by a language model.
 
 
 
636
 
637
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
638
 
 
640
 
641
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
642
 
643
+ Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated datasets,
644
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
645
 
646
  #### Uses
 
691
 
692
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
693
 
694
+ The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly available in
695
+ web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data retention on an
696
+ individual basis. However, efforts are made to mitigate the risks associated with sensitive information through pre-processing and filtering to
697
+ remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential privacy and ethical issues.
 
698
 
699
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
700