Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Bad_text_classifier
|
2 |
+
|
3 |
+
## Model ์๊ฐ
|
4 |
+
์ธํฐ๋ท ์์ ํผ์ ธ์๋ ์ฌ๋ฌ ๋๊ธ, ์ฑํ
์ด ๋ฏผ๊ฐํ ๋ด์ฉ์ธ์ง ์๋์ง๋ฅผ ํ๋ณํ๋ ๋ชจ๋ธ์ ๊ณต๊ฐํฉ๋๋ค. ํด๋น ๋ชจ๋ธ์ ๊ณต๊ฐ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด label์ ์์ ํ๊ณ ๋ฐ์ดํฐ๋ค์ ํฉ์ณ ๊ตฌ์ฑํด finetuning์ ์งํํ์์ต๋๋ค. ํด๋น ๋ชจ๋ธ์ด ์ธ์ ๋ ๋ชจ๋ ๋ฌธ์ฅ์ ์ ํํ ํ๋จ์ด ๊ฐ๋ฅํ ๊ฒ์ ์๋๋ผ๋ ์ ์ํดํด ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค.
|
5 |
+
```
|
6 |
+
NOTE)
|
7 |
+
๊ณต๊ฐ ๋ฐ์ดํฐ์ ์ ์๊ถ ๋ฌธ์ ๋ก ์ธํด ๋ชจ๋ธ ํ์ต์ ์ฌ์ฉ๋ ๋ณํ๋ ๋ฐ์ดํฐ๋ ๊ณต๊ฐ ๋ถ๊ฐ๋ฅํ๋ค๋ ์ ์ ๋ฐํ๋๋ค.
|
8 |
+
๋ํ ํด๋น ๋ชจ๋ธ์ ์๊ฒฌ์ ์ ์๊ฒฌ๊ณผ ๋ฌด๊ดํ๋ค๋ ์ ์ ๋ฏธ๋ฆฌ ๋ฐํ๋๋ค.
|
9 |
+
```
|
10 |
+
|
11 |
+
## Dataset
|
12 |
+
### data label
|
13 |
+
* **0 : bad sentence**
|
14 |
+
* **1 : not bad sentence**
|
15 |
+
### ์ฌ์ฉํ dataset
|
16 |
+
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
|
17 |
+
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
|
18 |
+
### dataset ๊ฐ๊ณต ๋ฐฉ๋ฒ
|
19 |
+
๊ธฐ์กด ์ด์ง ๋ถ๋ฅ๊ฐ ์๋์๋ ๋ ๋ฐ์ดํฐ๋ฅผ ์ด์ง ๋ถ๋ฅ ํํ๋ก labeling์ ๋ค์ ํด์ค ๋ค, Korean HateSpeech Dataset์ค label 1(not bad sentence)๋ง์ ์ถ๋ ค ๊ฐ๊ณต๋ Korean Unsmile Dataset์ ํฉ์ณ ์ฃผ์์ต๋๋ค.
|
20 |
+
</br>
|
21 |
+
|
22 |
+
**Korean Unsmile Dataset์ clean์ผ๋ก labeling ๋์ด์๋ ๋ฐ์ดํฐ ์ค ๋ช๊ฐ์ ๋ฐ์ดํฐ๋ฅผ 0 (bad sentence)์ผ๋ก ์์ ํ์์ต๋๋ค.**
|
23 |
+
* "~๋
ธ"๊ฐ ํฌํจ๋ ๋ฌธ์ฅ ์ค, "์ด๊ธฐ", "๋
ธ๋ฌด"๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
|
24 |
+
* "์ข", "๋ด" ๋ฑ ์ฑ ๊ด๋ จ ๋์์ค๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
|
25 |
+
</br></br>
|
26 |
+
|
27 |
+
## Model Training
|
28 |
+
* huggingface transformers์ ElectraForSequenceClassification๋ฅผ ์ฌ์ฉํด finetuning์ ์ํํ์์ต๋๋ค.
|
29 |
+
* ํ๊ตญ์ด ๊ณต๊ฐ Electra ๋ชจ๋ธ ์ค 3๊ฐ์ง ๋ชจ๋ธ์ ์ฌ์ฉํด ๊ฐ๊ฐ ํ์ต์์ผ์ฃผ์์ต๋๋ค.
|
30 |
+
### use model
|
31 |
+
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
|
32 |
+
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
|
33 |
+
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
|
34 |
+
|
35 |
+
### how to train?
|
36 |
+
```BASH
|
37 |
+
python codes/model_source/train_torch_sch.py \
|
38 |
+
--learning_rate=3e-06 \
|
39 |
+
--use_float_16=True \
|
40 |
+
--weight-decay=0.001 \
|
41 |
+
--base_save_ckpt_path=BASE_SAVE_CHPT_PATH \
|
42 |
+
--epochs=10 \
|
43 |
+
--batch_size=128 \
|
44 |
+
--model_type=MODEL_TYPE
|
45 |
+
```
|
46 |
+
### parameters
|
47 |
+
| parameter | type | description | default |
|
48 |
+
| ---------- | ---------- | ---------- | --------- |
|
49 |
+
| learning_rate | float | decise learning rate for train | 5e-05 |
|
50 |
+
| use_float_16 | bool | decise to apply float 16 or not | False |
|
51 |
+
| weight_decay | float | define weight decay lambda | None |
|
52 |
+
| base_ckpt_save_path | str | base path that will be saved trained checkpoints | None |
|
53 |
+
| epochs | int | full train epochs | 5 |
|
54 |
+
| batch_size | int | batch size using in train time | 64 |
|
55 |
+
| model_type | int | used to choose what electra model using for training | 0 |
|
56 |
+
```
|
57 |
+
NOTE) train dataset, valid dataset์ train_torch_sch.py ๋ด์ config ๋ถ๋ถ์์ ์ง์ ํ์ค ์ ์์ต๋๋ค
|
58 |
+
```
|
59 |
+
</br>
|
60 |
+
|
61 |
+
## How to use model?
|
62 |
+
```PYTHON
|
63 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
64 |
+
|
65 |
+
model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
|
66 |
+
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
|
67 |
+
```
|
68 |
+
</br>
|
69 |
+
|
70 |
+
## Predict model
|
71 |
+
์ฌ์ฉ์๊ฐ ํ
์คํธ ํด๋ณด๊ณ ์ถ์ ๋ฌธ์ฅ์ ๋ฃ์ด predict๋ฅผ ์ํํด ๋ณผ ์ ์์ต๋๋ค.
|
72 |
+
```BASH
|
73 |
+
python codes/model_source/utils/predict.py \
|
74 |
+
--input_text=INPUT_TEXT \
|
75 |
+
--base_ckpt=BASE_CKPT
|
76 |
+
```
|
77 |
+
### parameters
|
78 |
+
| parameter | type | description | default |
|
79 |
+
| ---------- | ---------- | ---------- | --------- |
|
80 |
+
| input_text | str | user input text | "๋ฐ๊ฐ์ต๋๋ค. JminJ์
๋๋ค!" |
|
81 |
+
| base_ckpt | str | base path that saved trained checkpoints | False |
|
82 |
+
</br>
|
83 |
+
|
84 |
+
## Model Valid Accuracy
|
85 |
+
| mdoel | accuracy |
|
86 |
+
| ---------- | ---------- |
|
87 |
+
| kcElectra_base_fp16_wd_custom_dataset | 0.8849 |
|
88 |
+
| tunibElectra_base_fp16_wd_custom_dataset | 0.8726 |
|
89 |
+
| koElectra_base_fp16_wd_custom_dataset | 0.8434 |
|
90 |
+
```
|
91 |
+
Note)
|
92 |
+
๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋ก ํ์ต๋์์ต๋๋ค.
|
93 |
+
```
|
94 |
+
</br>
|
95 |
+
|
96 |
+
## Contact
|
97 | |
98 |
+
</br></br>
|
99 |
+
|
100 |
+
## Github
|
101 |
+
* https://github.com/JminJ/Bad_text_classifier
|
102 |
+
</br></br>
|
103 |
+
|
104 |
+
## Reference
|
105 |
+
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
|
106 |
+
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
|
107 |
+
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
|
108 |
+
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
|
109 |
+
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
|
110 |
+
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)
|