File size: 2,507 Bytes
a9e8324
2abf6f5
a9e8324
 
 
 
 
 
 
 
2abf6f5
 
 
 
a9e8324
 
 
2abf6f5
82707f6
a0f47bd
2abf6f5
 
82707f6
a0f47bd
f94254f
a0f47bd
82707f6
a0f47bd
82707f6
 
 
 
 
a0f47bd
f94254f
a0f47bd
82707f6
a0f47bd
82707f6
 
 
a0f47bd
f94254f
a0f47bd
82707f6
a0f47bd
82707f6
 
 
 
 
a0f47bd
82707f6
a0f47bd
f94254f
a0f47bd
 
82707f6
a0f47bd
82707f6
a0f47bd
f94254f
 
a0f47bd
82707f6
f94254f
 
a0f47bd
82707f6
 
a0f47bd
82707f6
f94254f
 
 
a0f47bd
82707f6
f94254f
82707f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
library_name: transformers
tags:
  - text-classification
  - spam-detection
  - sms
  - bert
  - multilingual
datasets:
  - sms-spam-cleaned-dataset
language:
  - ko
base_model: bert-base-multilingual-cased
model_architecture: bert
license: apache-2.0
---


# SMS ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ

ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊ธ€ SMS๋ฅผ ์ง์ ‘ ๊ฐ€๊ณตํ•˜์—ฌ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์ด ๊ถ๊ธˆํ•˜์‹œ๋ฉด, ๋ฌธ์˜ ์ฃผ์„ธ์š”.

์ด ๋ชจ๋ธ์€ SMS ์ŠคํŒธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋ฏธ์„ธ ์กฐ์ •๋œ **BERT ๊ธฐ๋ฐ˜ ๋‹ค๊ตญ์–ด ๋ชจ๋ธ**์ž…๋‹ˆ๋‹ค. SMS ๋ฉ”์‹œ์ง€๋ฅผ **ham(๋น„์ŠคํŒธ)** ๋˜๋Š” **spam(์ŠคํŒธ)**์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ **`bert-base-multilingual-cased`** ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

---

## ๋ชจ๋ธ ์„ธ๋ถ€์ •๋ณด

- **๊ธฐ๋ณธ ๋ชจ๋ธ**: `bert-base-multilingual-cased`
- **ํƒœ์Šคํฌ**: ๋ฌธ์žฅ ๋ถ„๋ฅ˜(Sequence Classification)
- **์ง€์› ์–ธ์–ด**: ๋‹ค๊ตญ์–ด
- **๋ผ๋ฒจ ์ˆ˜**: 2 (`ham`, `spam`)
- **๋ฐ์ดํ„ฐ์…‹**: ํด๋ฆฐ๋œ SMS ์ŠคํŒธ ๋ฐ์ดํ„ฐ์…‹

---

## ๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด

ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ `ham`(๋น„์ŠคํŒธ) ๋˜๋Š” `spam`(์ŠคํŒธ)์œผ๋กœ ๋ผ๋ฒจ๋ง๋œ SMS ๋ฉ”์‹œ์ง€๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ ํ›„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:
- **ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ**: 80%
- **๊ฒ€์ฆ ๋ฐ์ดํ„ฐ**: 20%

---

## ํ•™์Šต ์„ค์ •

- **ํ•™์Šต๋ฅ (Learning Rate)**: 2e-5
- **๋ฐฐ์น˜ ํฌ๊ธฐ(Batch Size)**: 8 (๋””๋ฐ”์ด์Šค ๋‹น)
- **์—ํฌํฌ(Epochs)**: 1
- **ํ‰๊ฐ€ ์ „๋žต**: ์—ํฌํฌ ๋‹จ์œ„
- **ํ† ํฌ๋‚˜์ด์ €**: `bert-base-multilingual-cased`

์ด ๋ชจ๋ธ์€ Hugging Face์˜ `Trainer` API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

---


## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์ด ๋ชจ๋ธ์€ Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained("blockenters/sms-spam-classifier")
model = AutoModelForSequenceClassification.from_pretrained("blockenters/sms-spam-classifier")

# ์ž…๋ ฅ ์ƒ˜ํ”Œ
text = "์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ๋ฌด๋ฃŒ ๋ฐœ๋ฆฌ ์—ฌํ–‰ ํ‹ฐ์ผ“์„ ๋ฐ›์œผ์…จ์Šต๋‹ˆ๋‹ค. WIN์ด๋ผ๊ณ  ํšŒ์‹ ํ•˜์„ธ์š”."

# ํ† ํฐํ™” ๋ฐ ์˜ˆ์ธก
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ
label_map = {0: "ham", 1: "spam"}
print(f"์˜ˆ์ธก ๊ฒฐ๊ณผ: {label_map[predictions.item()]}")