Yeop9690 commited on
Commit
ff153c4
·
1 Parent(s): 8de70b8

Initial commit of the custom summarization dataset

Browse files
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Dataset Card for Custom Text Dataset
3
+
4
+ ## Dataset Name
5
+ - Custom Text Summarization Dataset (CNN/DailyMail Subset)
6
+
7
+
8
+ ## Overview
9
+ This dataset contains a subset of the CNN/DailyMail news dataset, which is used for training text summarization models. The dataset consists of articles paired with human-generated summaries. It is widely used in the development of natural language processing models for summarization tasks.
10
+
11
+ - **Number of examples**: 287,113 (training set), 13,368 (validation set), 11,490 (test set)
12
+ - **Languages**: English
13
+
14
+ ## Composition
15
+ - **Source**: CNN and DailyMail news articles
16
+ - **Size**: 1% subset of the full dataset
17
+ - **Text Fields**: Each example consists of:
18
+ - `article`: The news article text
19
+ - `highlights`: The human-generated summary of the article
20
+
21
+
22
+ ## Collection Process
23
+ The dataset was collected by scraping news articles from CNN and DailyMail websites. The articles were paired with manually written summaries to form training examples. This dataset was originally prepared for the task of abstractive text summarization.
24
+
25
+
26
+ ## Preprocessing
27
+ - Tokenization using a pretrained tokenizer (e.g., T5 tokenizer)
28
+ - Maximum token length capped at 512 for both input and output sequences
29
+ - Lowercasing of all texts to maintain consistency
30
+ - Special tokens for start and end of sequences
31
+
32
+
33
+ ## How to Use
34
+ ```python
35
+ from datasets import load_dataset
36
+ dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")
37
+
38
+ ```
39
+
40
+ ## Evaluation
41
+ - ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
42
+ - BLEU (Bilingual Evaluation Understudy)
43
+
44
+ ## Limitations
45
+ - Data bias: The dataset is composed of news articles from only two major sources, CNN and DailyMail, which may introduce a specific writing style and focus into the summaries.
46
+ - Domain-specific issues: The dataset is limited to news articles and may not generalize well to other domains such as scientific texts or casual conversations.
47
+
48
+
49
+ ## Ethical Considerations
50
+ - Privacy: Since the dataset consists of publicly available news articles, privacy concerns are minimal. However, users should be cautious when generating summaries for sensitive or private information.
51
+ - Bias: News articles from CNN and DailyMail may reflect specific political or cultural biases, which could influence the summaries generated by models trained on this dataset.
test/dataset_dict.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"splits": ["test"]}
test/test/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e6aa13a3e10a33624931f6c220c9618528323886bd7b7ac334af681b8dc0646
3
+ size 346576
test/test/dataset_info.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "sentence": {
6
+ "feature": {
7
+ "dtype": "string",
8
+ "_type": "Value"
9
+ },
10
+ "_type": "Sequence"
11
+ },
12
+ "labels": {
13
+ "feature": {
14
+ "dtype": "string",
15
+ "_type": "Value"
16
+ },
17
+ "_type": "Sequence"
18
+ }
19
+ },
20
+ "homepage": "",
21
+ "license": ""
22
+ }
test/test/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "a966e5e39a3a551f",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }
train/dataset_dict.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"splits": ["train"]}
train/train/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3b84a293ed7afd9641f578c760558feab774e12174775ffef3bd6d130873903
3
+ size 1400
train/train/dataset_info.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "sentence": {
6
+ "dtype": "string",
7
+ "_type": "Value"
8
+ },
9
+ "labels": {
10
+ "dtype": "string",
11
+ "_type": "Value"
12
+ }
13
+ },
14
+ "homepage": "",
15
+ "license": ""
16
+ }
train/train/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "a1df46296853828f",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }