Update README.md
Browse files
README.md
CHANGED
@@ -20,25 +20,18 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
|
|
20 |
|
21 |
Disclaimer:
|
22 |
|
23 |
-
## According to the abstract,
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
of the
|
33 |
-
|
34 |
-
|
35 |
-
Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
|
36 |
-
asuccessful case of artificial intelligence in a federal government application.
|
37 |
-
|
38 |
-
This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
|
39 |
-
supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
|
40 |
-
[this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
|
41 |
-
and English.
|
42 |
|
43 |
## Model description
|
44 |
|
@@ -65,19 +58,19 @@ methods used for text summarization will be described indvidually in the followi
|
|
65 |
|
66 |
## Methods
|
67 |
|
68 |
-
| Method | Kind of ATS |
|
69 |
-
|
70 |
-
| SumyRandom | Extractive |
|
71 |
-
|
|
72 |
-
| SumyLsa | Extractive |
|
73 |
-
| SumyLexRank | Extractive |
|
74 |
-
| SumyTextRank | Extractive |
|
75 |
-
| SumySumBasic | Extractive |
|
76 |
-
| SumyKL | Extractive |
|
77 |
-
| SumyReduction | Extractive |
|
78 |
-
| BART-Large CNN | Abstractive | | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) |
|
79 |
-
| Pegasus-XSUM | Abstractive | | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) |
|
80 |
-
| mT5 Multilingual XLSUM | Abstractive | | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)|
|
81 |
|
82 |
|
83 |
## Model variations
|
@@ -179,60 +172,7 @@ output = model(encoded_input)
|
|
179 |
|
180 |
### Limitations and bias
|
181 |
|
182 |
-
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
183 |
-
predictions:
|
184 |
-
|
185 |
-
```python
|
186 |
-
>>> from transformers import pipeline
|
187 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
188 |
-
>>> unmasker("The man worked as a [MASK].")
|
189 |
-
|
190 |
-
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
|
191 |
-
'score': 0.09747550636529922,
|
192 |
-
'token': 10533,
|
193 |
-
'token_str': 'carpenter'},
|
194 |
-
{'sequence': '[CLS] the man worked as a waiter. [SEP]',
|
195 |
-
'score': 0.0523831807076931,
|
196 |
-
'token': 15610,
|
197 |
-
'token_str': 'waiter'},
|
198 |
-
{'sequence': '[CLS] the man worked as a barber. [SEP]',
|
199 |
-
'score': 0.04962705448269844,
|
200 |
-
'token': 13362,
|
201 |
-
'token_str': 'barber'},
|
202 |
-
{'sequence': '[CLS] the man worked as a mechanic. [SEP]',
|
203 |
-
'score': 0.03788609802722931,
|
204 |
-
'token': 15893,
|
205 |
-
'token_str': 'mechanic'},
|
206 |
-
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
207 |
-
'score': 0.037680890411138535,
|
208 |
-
'token': 18968,
|
209 |
-
'token_str': 'salesman'}]
|
210 |
-
|
211 |
-
>>> unmasker("The woman worked as a [MASK].")
|
212 |
-
|
213 |
-
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
|
214 |
-
'score': 0.21981462836265564,
|
215 |
-
'token': 6821,
|
216 |
-
'token_str': 'nurse'},
|
217 |
-
{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
|
218 |
-
'score': 0.1597415804862976,
|
219 |
-
'token': 13877,
|
220 |
-
'token_str': 'waitress'},
|
221 |
-
{'sequence': '[CLS] the woman worked as a maid. [SEP]',
|
222 |
-
'score': 0.1154729500412941,
|
223 |
-
'token': 10850,
|
224 |
-
'token_str': 'maid'},
|
225 |
-
{'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
|
226 |
-
'score': 0.037968918681144714,
|
227 |
-
'token': 19215,
|
228 |
-
'token_str': 'prostitute'},
|
229 |
-
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
230 |
-
'score': 0.03042375110089779,
|
231 |
-
'token': 5660,
|
232 |
-
'token_str': 'cook'}]
|
233 |
-
```
|
234 |
|
235 |
-
This bias will also affect all fine-tuned versions of this model.
|
236 |
|
237 |
## Training data
|
238 |
|
|
|
20 |
|
21 |
Disclaimer:
|
22 |
|
23 |
+
## According to the abstract of the literature review,
|
24 |
+
|
25 |
+
We provide a literature review about Automatic Text Summarization systems. We consider a citation-based approach. We start with some popular and well-known
|
26 |
+
papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we
|
27 |
+
knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we
|
28 |
+
present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive
|
29 |
+
review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
|
30 |
+
exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
|
31 |
+
|
32 |
+
This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
|
33 |
+
summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
|
34 |
+
It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
## Model description
|
37 |
|
|
|
58 |
|
59 |
## Methods
|
60 |
|
61 |
+
| Method | Kind of ATS | Documentation | Source Article |
|
62 |
+
|:----------------------:|:-----------:|:---------------:|:--------------:|
|
63 |
+
| SumyRandom | Extractive | [Sumy GitHub](https://github.com/miso-belica/sumy/) | None (picks out random sentences from source text) |
|
64 |
+
| SumyLuhn | Extractive | Ibid. | (Luhn, 1958) |
|
65 |
+
| SumyLsa | Extractive | Ibid. | [(Steinberger et al., 2004)](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) |
|
66 |
+
| SumyLexRank | Extractive | Ibid. | (Erkan and Radev, 2004) |
|
67 |
+
| SumyTextRank | Extractive | Ibid. | (Mihalcea and Tarau, 2004) |
|
68 |
+
| SumySumBasic | Extractive | Ibid. | None (often used as a baseline model in the literature) |
|
69 |
+
| SumyKL | Extractive | Ibid. | (Haghighi and Vanderwende, 2009) |
|
70 |
+
| SumyReduction | Extractive | Ibid. | None. |
|
71 |
+
| BART-Large CNN | Abstractive | | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) | (Lewis et al., 2019) |
|
72 |
+
| Pegasus-XSUM | Abstractive | | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) | (Zhang et al., 2020) |
|
73 |
+
| mT5 Multilingual XLSUM | Abstractive | | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| (Raffel et al., 2019) |
|
74 |
|
75 |
|
76 |
## Model variations
|
|
|
172 |
|
173 |
### Limitations and bias
|
174 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
|
|
|
176 |
|
177 |
## Training data
|
178 |
|