|
--- |
|
language: |
|
- ko |
|
--- |
|
# NewsKoT5 |
|
The training data for this T5 model consists of Korean news articles (29GB). However, the performance has not been fine-tuned through the use of small batches and a limited number of training steps, so it may not be fully optimized. |
|
|
|
## Quick tour |
|
```python |
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("BM-K/NewsKoT5-small") |
|
model = T5ForConditionalGeneration.from_pretrained("BM-K/NewsKoT5-small") |
|
|
|
input_ids = tokenizer("ํ๊ตญํ๋ฐ์ฌ์ฒด ๋๋ฆฌํธ๊ฐ ์ค์ฉ๊ธ <extra_id_0> ๋ฐ์ฌ์ฒด๋ก์ โ๋ฐ๋ทโ๋ฅผ ์ฑ๊ณต์ ์ผ๋ก <extra_id_1>", return_tensors="pt").input_ids |
|
labels = tokenizer("<extra_id_0> ์์ฑ <extra_id_1> ๋ง์ณค๋ค <extra_id_2>", return_tensors="pt").input_ids |
|
|
|
outputs = model(input_ids=input_ids, |
|
labels=labels) |
|
``` |
|
|
|
## News Summarization Performance (F1-score) |
|
After restoring the model's tokenized output to the original text, Rouge performance was evaluated by comparing it to the reference and hypothesis tokenized using [mecab](https://konlpy.org/ko/v0.4.0/). |
|
|
|
- Dacon ํ๊ตญ์ด ๋ฌธ์ ์์ฑ์์ฝ AI ๊ฒฝ์ง๋ํ [Dataset](https://dacon.io/competitions/official/235673/overview/description) |
|
- Training: 29,432 |
|
- Validation: 7,358 |
|
- Test: 9,182 |
|
|
|
| | #Param | rouge-1 |rouge-2|rouge-l| |
|
|-------|--------:|--------:|--------:|--------:| |
|
| pko-t5-small | 95M | 51.48 | 33.18 | 44.96 | |
|
| NewsT5-small | 61M | 52.15 | 33.59 | 45.41 | |
|
|
|
- AI-Hub ๋ฌธ์์์ฝ ํ
์คํธ [Dataset](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=97) |
|
- Training: 245,626 |
|
- Validation: 20,296 |
|
- Test: 9,931 |
|
|
|
| | #Param | rouge-1 |rouge-2|rouge-l| |
|
|-------|--------:|--------:|--------:|--------:| |
|
| pko-t5-small | 95M | 53.44 | 34.03 | 45.36 | |
|
| NewsT5-small | 61M | 53.74 | 34.27 | 45.52 | |
|
|
|
- [pko-t5-small](https://github.com/paust-team/pko-t5) |