|
--- |
|
library_name: transformers |
|
tags: |
|
- text-classification |
|
- spam-detection |
|
- sms |
|
- bert |
|
- multilingual |
|
datasets: |
|
- sms-spam-cleaned-dataset |
|
language: |
|
- ko |
|
base_model: bert-base-multilingual-cased |
|
model_architecture: bert |
|
license: apache-2.0 |
|
--- |
|
|
|
|
|
# SMS ์คํธ ๋ถ๋ฅ๊ธฐ |
|
|
|
ํ์ต์ ์ฌ์ฉ๋ ๋ฐ์ดํฐ๋ฅผ ํ๊ธ SMS๋ฅผ ์ง์ ๊ฐ๊ณตํ์ฌ ๋ง๋ค์์ต๋๋ค. ๋ฐ์ดํฐ์
์ด ๊ถ๊ธํ์๋ฉด, ๋ฌธ์ ์ฃผ์ธ์. |
|
|
|
์ด ๋ชจ๋ธ์ SMS ์คํธ ํ์ง๋ฅผ ์ํด ๋ฏธ์ธ ์กฐ์ ๋ **BERT ๊ธฐ๋ฐ ๋ค๊ตญ์ด ๋ชจ๋ธ**์
๋๋ค. SMS ๋ฉ์์ง๋ฅผ **ham(๋น์คํธ)** ๋๋ **spam(์คํธ)**์ผ๋ก ๋ถ๋ฅํ ์ ์์ต๋๋ค. Hugging Face Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ **`bert-base-multilingual-cased`** ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ์ต๋์์ต๋๋ค. |
|
|
|
--- |
|
|
|
## ๋ชจ๋ธ ์ธ๋ถ์ ๋ณด |
|
|
|
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: `bert-base-multilingual-cased` |
|
- **ํ์คํฌ**: ๋ฌธ์ฅ ๋ถ๋ฅ(Sequence Classification) |
|
- **์ง์ ์ธ์ด**: ๋ค๊ตญ์ด |
|
- **๋ผ๋ฒจ ์**: 2 (`ham`, `spam`) |
|
- **๋ฐ์ดํฐ์
**: ํด๋ฆฐ๋ SMS ์คํธ ๋ฐ์ดํฐ์
|
|
|
|
--- |
|
|
|
## ๋ฐ์ดํฐ์
์ ๋ณด |
|
|
|
ํ๋ จ ๋ฐ ํ๊ฐ์ ์ฌ์ฉ๋ ๋ฐ์ดํฐ์
์ `ham`(๋น์คํธ) ๋๋ `spam`(์คํธ)์ผ๋ก ๋ผ๋ฒจ๋ง๋ SMS ๋ฉ์์ง๋ฅผ ํฌํจํ๊ณ ์์ต๋๋ค. ๋ฐ์ดํฐ๋ ์ ์ฒ๋ฆฌ๋ฅผ ๊ฑฐ์น ํ ๋ค์๊ณผ ๊ฐ์ด ๋ถ๋ฆฌ๋์์ต๋๋ค: |
|
- **ํ๋ จ ๋ฐ์ดํฐ**: 80% |
|
- **๊ฒ์ฆ ๋ฐ์ดํฐ**: 20% |
|
|
|
--- |
|
|
|
## ํ์ต ์ค์ |
|
|
|
- **ํ์ต๋ฅ (Learning Rate)**: 2e-5 |
|
- **๋ฐฐ์น ํฌ๊ธฐ(Batch Size)**: 8 (๋๋ฐ์ด์ค ๋น) |
|
- **์ํฌํฌ(Epochs)**: 1 |
|
- **ํ๊ฐ ์ ๋ต**: ์ํฌํฌ ๋จ์ |
|
- **ํ ํฌ๋์ด์ **: `bert-base-multilingual-cased` |
|
|
|
์ด ๋ชจ๋ธ์ Hugging Face์ `Trainer` API๋ฅผ ์ฌ์ฉํ์ฌ ํจ์จ์ ์ผ๋ก ๋ฏธ์ธ ์กฐ์ ๋์์ต๋๋ค. |
|
|
|
--- |
|
|
|
|
|
## ์ฌ์ฉ ๋ฐฉ๋ฒ |
|
|
|
์ด ๋ชจ๋ธ์ Hugging Face Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ํตํด ๋ฐ๋ก ์ฌ์ฉํ ์ ์์ต๋๋ค: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
# ๋ชจ๋ธ๊ณผ ํ ํฌ๋์ด์ ๋ก๋ |
|
tokenizer = AutoTokenizer.from_pretrained("blockenters/sms-spam-classifier") |
|
model = AutoModelForSequenceClassification.from_pretrained("blockenters/sms-spam-classifier") |
|
|
|
# ์
๋ ฅ ์ํ |
|
text = "์ถํํฉ๋๋ค! ๋ฌด๋ฃ ๋ฐ๋ฆฌ ์ฌํ ํฐ์ผ์ ๋ฐ์ผ์
จ์ต๋๋ค. WIN์ด๋ผ๊ณ ํ์ ํ์ธ์." |
|
|
|
# ํ ํฐํ ๋ฐ ์์ธก |
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
outputs = model(**inputs) |
|
predictions = outputs.logits.argmax(dim=-1) |
|
|
|
# ์์ธก ๊ฒฐ๊ณผ ๋์ฝ๋ฉ |
|
label_map = {0: "ham", 1: "spam"} |
|
print(f"์์ธก ๊ฒฐ๊ณผ: {label_map[predictions.item()]}") |
|
|