sms-spam-classifier / README.md
blockenters's picture
Update README.md
b588a12 verified
---
library_name: transformers
tags:
- text-classification
- spam-detection
- sms
- bert
- multilingual
datasets:
- sms-spam-cleaned-dataset
language:
- ko
base_model: bert-base-multilingual-cased
model_architecture: bert
license: apache-2.0
---
# SMS ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ
ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊ธ€ SMS๋ฅผ ์ง์ ‘ ๊ฐ€๊ณตํ•˜์—ฌ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์ด ๊ถ๊ธˆํ•˜์‹œ๋ฉด, ๋ฌธ์˜ ์ฃผ์„ธ์š”.
์ด ๋ชจ๋ธ์€ SMS ์ŠคํŒธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋ฏธ์„ธ ์กฐ์ •๋œ **BERT ๊ธฐ๋ฐ˜ ๋‹ค๊ตญ์–ด ๋ชจ๋ธ**์ž…๋‹ˆ๋‹ค. SMS ๋ฉ”์‹œ์ง€๋ฅผ **ham(๋น„์ŠคํŒธ)** ๋˜๋Š” **spam(์ŠคํŒธ)**์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ **`bert-base-multilingual-cased`** ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
---
## ๋ชจ๋ธ ์„ธ๋ถ€์ •๋ณด
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: `bert-base-multilingual-cased`
- **ํƒœ์Šคํฌ**: ๋ฌธ์žฅ ๋ถ„๋ฅ˜(Sequence Classification)
- **์ง€์› ์–ธ์–ด**: ๋‹ค๊ตญ์–ด
- **๋ผ๋ฒจ ์ˆ˜**: 2 (`ham`, `spam`)
- **๋ฐ์ดํ„ฐ์…‹**: ํด๋ฆฐ๋œ SMS ์ŠคํŒธ ๋ฐ์ดํ„ฐ์…‹
---
## ๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด
ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ `ham`(๋น„์ŠคํŒธ) ๋˜๋Š” `spam`(์ŠคํŒธ)์œผ๋กœ ๋ผ๋ฒจ๋ง๋œ SMS ๋ฉ”์‹œ์ง€๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ ํ›„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:
- **ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ**: 80%
- **๊ฒ€์ฆ ๋ฐ์ดํ„ฐ**: 20%
---
## ํ•™์Šต ์„ค์ •
- **ํ•™์Šต๋ฅ (Learning Rate)**: 2e-5
- **๋ฐฐ์น˜ ํฌ๊ธฐ(Batch Size)**: 8 (๋””๋ฐ”์ด์Šค ๋‹น)
- **์—ํฌํฌ(Epochs)**: 1
- **ํ‰๊ฐ€ ์ „๋žต**: ์—ํฌํฌ ๋‹จ์œ„
- **ํ† ํฌ๋‚˜์ด์ €**: `bert-base-multilingual-cased`
์ด ๋ชจ๋ธ์€ Hugging Face์˜ `Trainer` API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
---
## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
์ด ๋ชจ๋ธ์€ Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained("blockenters/sms-spam-classifier")
model = AutoModelForSequenceClassification.from_pretrained("blockenters/sms-spam-classifier")
# ์ž…๋ ฅ ์ƒ˜ํ”Œ
text = "์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ๋ฌด๋ฃŒ ๋ฐœ๋ฆฌ ์—ฌํ–‰ ํ‹ฐ์ผ“์„ ๋ฐ›์œผ์…จ์Šต๋‹ˆ๋‹ค. WIN์ด๋ผ๊ณ  ํšŒ์‹ ํ•˜์„ธ์š”."
# ํ† ํฐํ™” ๋ฐ ์˜ˆ์ธก
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
# ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ
label_map = {0: "ham", 1: "spam"}
print(f"์˜ˆ์ธก ๊ฒฐ๊ณผ: {label_map[predictions.item()]}")