hwkhw
/

custom_summarization_dataset

hwkhw commited on Sep 19, 2024

Commit

bf07749

1 Parent(s): 62b4b29

init commit

Files changed (9) hide show

README.md ADDED Viewed

+# Dataset Card for Custom Text Dataset
+## Dataset Name
+  [CNN/Daily mail]
+## Overview
+  (https://github.com/abisee/cnn-dailymail): CNN과 Daily Mail의 저널리스트가 작성한 300,000개가 넘는 고유한 뉴스 기사가 포함된 영어 dataset.
+ * 본 데이터의 1.0.0 버전은 Apache-2.0 License를 따르며, 데이터 생성을 위한 코드는 MIT License를 따른다.
+## Composition
+  CNN/Daily mail dataset에는 2가지 데이터가 존재합니다.
+  - article : 뉴스/기사
+  - highlights : 요약
+## Collection Process
+  CNN과 Daily Mail의 저널리스트가 작성한 300,000개가 넘는 고유한 뉴스 기사
+## Preprocessing
+  특별한 전처리 없음
+## How to Use
+```python train.py
+  python evaluation.py
+```
+## Evaluation
+  모델이 "문장을 얼마나 잘 요약하는"
+  - ROUGE Score와 BLEU Score를 통해 성능을 확인합니다.
+  - Pipeline과 search strategy로 확장된 예측 결과를 확인합니다.
+  - ROUGE, BLEU score를 계산하는 compute_metric function을 정의합니다.
+## Limitations
+## Ethical Considerations

test/dataset_dict.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"splits": ["test"]}

test/test/data-00000-of-00001.arrow ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e6aa13a3e10a33624931f6c220c9618528323886bd7b7ac334af681b8dc0646
+size 346576

test/test/dataset_info.json ADDED Viewed

+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "sentence": {
+      "feature": {
+        "dtype": "string",
+        "_type": "Value"
+      },
+      "_type": "Sequence"
+    },
+    "labels": {
+      "feature": {
+        "dtype": "string",
+        "_type": "Value"
+      },
+      "_type": "Sequence"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}

test/test/state.json ADDED Viewed

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "a966e5e39a3a551f",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}

train/dataset_dict.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"splits": ["train"]}

train/train/data-00000-of-00001.arrow ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3b84a293ed7afd9641f578c760558feab774e12174775ffef3bd6d130873903
+size 1400

train/train/dataset_info.json ADDED Viewed

+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "sentence": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "labels": {
+      "dtype": "string",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}

train/train/state.json ADDED Viewed

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "a1df46296853828f",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}