File size: 2,156 Bytes
05340ec
 
 
 
 
26e1084
01e2eca
2c271ab
26e1084
a2b3d50
9c34cbc
a2b3d50
 
8fb2694
26e1084
218ff4f
 
 
 
 
26e1084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73c8053
 
01e2eca
 
 
 
 
 
 
 
 
 
 
a2b3d50
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language:
- ko
library_name: transformers
pipeline_tag: token-classification
---
# Korean Spacing Model
ํ•œ๊ตญ์–ด RoBERTa๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋งŒ๋“  ๋„์–ด์“ฐ๊ธฐ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

<a href="https://github.com/kwon13/robust-spacing">
  <img src="https://img.shields.io/badge/GitHub-181717?style=flat-square&logo=GitHub&logoColor=white"/>
</a>   

```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("fiveflow/roberta-base-spacing")
roberta = AutoModelForTokenClassification.from_pretrained("fiveflow/roberta-base-spacing")

org_text = "ํƒ„์†Œ์ค‘๋ฆฝ๊ณผESG๊ฒฝ์˜์—๋Œ€ํ•œ์‚ฌํšŒ์ ์š”๊ตฌํ™•๋Œ€".replace(" ", "") # ๊ณต๋ฐฑ์ œ๊ฑฐ
label = ["UNK", "PAD", "O", "B", "I", "E", "S"]
# char ๋‹จ์œ„๋กœ ํ† ํฐํ™”
token_list = [tokenizer.cls_token_id]
for char in org_text:
    token_list.append(tokenizer.encode(char)[1]) 
token_list.append(tokenizer.eos_token_id)
tkd = torch.tensor(token_list).unsqueeze(0)

output = roberta(tkd).logits

_, pred_idx = torch.max(output, dim=2)
tags = [label[idx] for idx in pred_idx.squeeze()][1:-1]
pred_sent = ""
for char_idx, spc_idx in enumerate(pred_idx.squeeze()[1:-1]):
    # "E" tag ๋‹จ์œ„๋กœ ๋„์–ด์“ฐ๊ธฐ
    if label[spc_idx] == "E": pred_sent += org_text[char_idx] + " "
    else: pred_sent += org_text[char_idx]

print(pred_sent.strip())
# 'ํƒ„์†Œ์ค‘๋ฆฝ๊ณผ ESG ๊ฒฝ์˜์— ๋Œ€ํ•œ ์‚ฌํšŒ์  ์š”๊ตฌ ํ™•๋Œ€'
```

```bibtex
@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation},
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

```