Update README.md
Browse files
README.md
CHANGED
@@ -173,7 +173,7 @@ Hey, look how easy it is to write LaTeX equations in here \\(Ax = b\\) or even $
|
|
173 |
In order to evaluate the model, summaries were generated by each of its summarization methods, which
|
174 |
used as source texts documents achieved from existing datasets. The chosen datasets for evaluation were the following:
|
175 |
|
176 |
-
- **Scientific Papers (arXiv + PubMed)**: Cohan et al. (2018) found out that there were only
|
177 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
178 |
extractive humam summaries. In order to fill the gap and to provide a dataset with long text
|
179 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
@@ -181,19 +181,19 @@ papers from arXiv and PubMed databases. Scientific papers are specially convenie
|
|
181 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
182 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
183 |
abstract.
|
184 |
-
- **BIGPATENT**: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
|
185 |
examples for the task of abstractive summarization. The data dataset is built using Google
|
186 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
187 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
188 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
189 |
structure and the summary content arises in the begining of the document.
|
190 |
-
- **CNN Corpus**: Lins et al. (
|
191 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
192 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
193 |
extractive and one abstractive. The encompassing of extractive gold-standard summaries is
|
194 |
also an advantage of this particular dataset over others with similar goals, which usually only
|
195 |
contain abstractive ones.
|
196 |
-
- **CNN/Daily Mail**: Hermann et al. (2015) intended to develop a consistent method for what
|
197 |
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
|
198 |
text via Natural Language Processing techniques. In order to perform that task, they collected
|
199 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
@@ -201,7 +201,7 @@ to be the key aspect in understanding a text, namely the answering of somewhat c
|
|
201 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
202 |
from it to develop their model and include in their dataset the human made summaries for each
|
203 |
news article.
|
204 |
-
- **XSum**: Narayan et al. (
|
205 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
206 |
kind of ATS that is aimed at answering the question “What is the document about?”. The data
|
207 |
was obtained from BBC articles and each one of them is accompanied by a short gold-standard
|
|
|
173 |
In order to evaluate the model, summaries were generated by each of its summarization methods, which
|
174 |
used as source texts documents achieved from existing datasets. The chosen datasets for evaluation were the following:
|
175 |
|
176 |
+
- **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
|
177 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
178 |
extractive humam summaries. In order to fill the gap and to provide a dataset with long text
|
179 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
|
|
181 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
182 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
183 |
abstract.
|
184 |
+
- **BIGPATENT**: [Sharma et al. (2019)](https://arxiv.org/pdf/1906.03741) introduced the BIGPATENT dataset that provides goods
|
185 |
examples for the task of abstractive summarization. The data dataset is built using Google
|
186 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
187 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
188 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
189 |
structure and the summary content arises in the begining of the document.
|
190 |
+
- **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
|
191 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
192 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
193 |
extractive and one abstractive. The encompassing of extractive gold-standard summaries is
|
194 |
also an advantage of this particular dataset over others with similar goals, which usually only
|
195 |
contain abstractive ones.
|
196 |
+
- **CNN/Daily Mail**: [Hermann et al. (2015)](https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) intended to develop a consistent method for what
|
197 |
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
|
198 |
text via Natural Language Processing techniques. In order to perform that task, they collected
|
199 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
|
|
201 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
202 |
from it to develop their model and include in their dataset the human made summaries for each
|
203 |
news article.
|
204 |
+
- **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
|
205 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
206 |
kind of ATS that is aimed at answering the question “What is the document about?”. The data
|
207 |
was obtained from BBC articles and each one of them is accompanied by a short gold-standard
|