Initial commit of the custom summarization dataset

Browse files

Files changed (9) hide show

README.md +51 -0
test/dataset_dict.json +1 -0
test/test/data-00000-of-00001.arrow +3 -0
test/test/dataset_info.json +22 -0
test/test/state.json +13 -0
train/dataset_dict.json +1 -0
train/train/data-00000-of-00001.arrow +3 -0
train/train/dataset_info.json +16 -0
train/train/state.json +13 -0

README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Dataset Card for Custom Text Dataset
+## Dataset Name
+- Custom Text Summarization Dataset (CNN/DailyMail Subset)
+## Overview
+This dataset contains a subset of the CNN/DailyMail news dataset, which is used for training text summarization models. The dataset consists of articles paired with human-generated summaries. It is widely used in the development of natural language processing models for summarization tasks.
+- **Number of examples**: 287,113 (training set), 13,368 (validation set), 11,490 (test set)
+- **Languages**: English
+## Composition
+- **Source**: CNN and DailyMail news articles
+- **Size**: 1% subset of the full dataset
+- **Text Fields**: Each example consists of:
+  - `article`: The news article text
+  - `highlights`: The human-generated summary of the article
+## Collection Process
+The dataset was collected by scraping news articles from CNN and DailyMail websites. The articles were paired with manually written summaries to form training examples. This dataset was originally prepared for the task of abstractive text summarization.
+## Preprocessing
+- Tokenization using a pretrained tokenizer (e.g., T5 tokenizer)
+- Maximum token length capped at 512 for both input and output sequences
+- Lowercasing of all texts to maintain consistency
+- Special tokens for start and end of sequences
+## How to Use
+```python
+from datasets import load_dataset
+dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")
+```
+## Evaluation
+- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
+- BLEU (Bilingual Evaluation Understudy)
+## Limitations
+- Data bias: The dataset is composed of news articles from only two major sources, CNN and DailyMail, which may introduce a specific writing style and focus into the summaries.
+- Domain-specific issues: The dataset is limited to news articles and may not generalize well to other domains such as scientific texts or casual conversations.
+## Ethical Considerations
+- Privacy: Since the dataset consists of publicly available news articles, privacy concerns are minimal. However, users should be cautious when generating summaries for sensitive or private information.
+- Bias: News articles from CNN and DailyMail may reflect specific political or cultural biases, which could influence the summaries generated by models trained on this dataset.

test/dataset_dict.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"splits": ["test"]}

test/test/data-00000-of-00001.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e6aa13a3e10a33624931f6c220c9618528323886bd7b7ac334af681b8dc0646
+size 346576

test/test/dataset_info.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "sentence": {
+      "feature": {
+        "dtype": "string",
+        "_type": "Value"
+      },
+      "_type": "Sequence"
+    },
+    "labels": {
+      "feature": {
+        "dtype": "string",
+        "_type": "Value"
+      },
+      "_type": "Sequence"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}

test/test/state.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "a966e5e39a3a551f",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}

train/dataset_dict.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"splits": ["train"]}

train/train/data-00000-of-00001.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3b84a293ed7afd9641f578c760558feab774e12174775ffef3bd6d130873903
+size 1400

train/train/dataset_info.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "sentence": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "labels": {
+      "dtype": "string",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}

train/train/state.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "a1df46296853828f",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}