nahiar commited on
Commit
d05b9c9
·
verified ·
1 Parent(s): f256a7f

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +193 -3
  2. config.json +40 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +67 -0
  7. vocab.txt +0 -0
README.md CHANGED
@@ -1,3 +1,193 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ license: mit
5
+ tags:
6
+ - text-classification
7
+ - bert
8
+ - spam-detection
9
+ - indonesian
10
+ - twitter
11
+ - retrained
12
+ datasets:
13
+ - nahiar/spam_detection_v2
14
+ pipeline_tag: text-classification
15
+ inference: true
16
+ base_model: nahiar/spam-detection-bert-v1
17
+ model_type: bert
18
+ library_name: transformers
19
+ widget:
20
+ - text: "lacak hp hilang by no hp / imei lacak penipu/scammer/tabrak lari/terror/revengeporn sadap / hack / pulihkan akun"
21
+ example_title: "Spam Example"
22
+ - text: "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo"
23
+ example_title: "Ham Example"
24
+ - text: "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
25
+ example_title: "Obvious Spam"
26
+ model-index:
27
+ - name: spam-detection-bert
28
+ results:
29
+ - task:
30
+ type: text-classification
31
+ name: Text Classification
32
+ dataset:
33
+ name: Indonesian Spam Detection Dataset v2
34
+ type: nahiar/spam_detection_v2
35
+ metrics:
36
+ - name: Accuracy
37
+ type: accuracy
38
+ value: 0.99
39
+ - name: F1 Score (Weighted)
40
+ type: f1
41
+ value: 0.99
42
+ - name: Precision (HAM)
43
+ type: precision
44
+ value: 0.99
45
+ - name: Recall (HAM)
46
+ type: recall
47
+ value: 1.00
48
+ - name: Precision (SPAM)
49
+ type: precision
50
+ value: 1.00
51
+ - name: Recall (SPAM)
52
+ type: recall
53
+ value: 0.83
54
+ ---
55
+
56
+ # Indonesian Spam Detection BERT
57
+
58
+ Model BERT untuk deteksi spam dalam bahasa Indonesia dengan akurasi **99%**. Model ini telah di-retrain dengan dataset yang telah diperbarui dan dilabeli ulang untuk performa yang optimal pada konten Indonesia.
59
+
60
+ ## Quick Start
61
+
62
+ ```python
63
+ from transformers import pipeline
64
+
65
+ # Cara termudah menggunakan model
66
+ classifier = pipeline("text-classification",
67
+ model="nahiar/spam-detection-bert",
68
+ tokenizer="nahiar/spam-detection-bert")
69
+
70
+ # Test dengan teks
71
+ texts = [
72
+ "lacak hp hilang by no hp / imei lacak penipu/scammer/tabrak lari/terror/revengeporn sadap / hack / pulihkan akun",
73
+ "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo",
74
+ "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
75
+ ]
76
+
77
+ results = classifier(texts)
78
+ for text, result in zip(texts, results):
79
+ print(f"Text: {text}")
80
+ print(f"Result: {result['label']} (confidence: {result['score']:.4f})")
81
+ print("---")
82
+ ```
83
+
84
+ ## Model Details
85
+
86
+ - **Base Model**: nahiar/spam-detection-bert-v1 (fine-tuned from cahya/bert-base-indonesian-1.5G)
87
+ - **Task**: Binary Text Classification (Spam vs Ham)
88
+ - **Language**: Indonesian (Bahasa Indonesia)
89
+ - **Model Size**: ~110M parameters
90
+ - **Max Sequence Length**: 512 tokens
91
+ - **Training Epochs**: 3
92
+ - **Batch Size**: 16
93
+ - **Learning Rate**: 2e-5
94
+
95
+ ## Performance
96
+
97
+ | Metric | HAM | SPAM | Overall |
98
+ | -------------------- | ---- | ---- | ------- |
99
+ | Precision | 99% | 100% | 99% |
100
+ | Recall | 100% | 83% | 99% |
101
+ | F1-Score | 99% | 91% | 99% |
102
+ | **Overall Accuracy** | - | - | **99%** |
103
+
104
+ ### Confusion Matrix
105
+
106
+ - True HAM correctly predicted: 430/430 (100%)
107
+ - True SPAM correctly predicted: 25/30 (83%)
108
+ - False Positives (HAM predicted as SPAM): 0
109
+ - False Negatives (SPAM predicted as HAM): 5
110
+
111
+ ## Dataset
112
+
113
+ Model v2 ini dilatih ulang menggunakan dataset yang telah diperbarui dan dilabeli ulang secara manual:
114
+
115
+ - **Dataset**: spam_re_labelled_vNew.csv
116
+ - **Total Samples**: 460 pesan
117
+ - **Distribution**: 430 HAM, 30 SPAM
118
+ - **Encoding**: Latin-1
119
+ - **Quality**: Manual re-labeling untuk akurasi yang lebih tinggi
120
+
121
+ **Updated**: Januari 2025
122
+
123
+ ## Key Features
124
+
125
+ ✅ **Re-trained** dengan dataset yang telah dilabeli ulang secara manual
126
+ ✅ **High accuracy** (99%) pada deteksi spam dengan konteks Indonesia
127
+ ✅ **Better handling** untuk pesan dengan format yang kompleks
128
+ ✅ **Enhanced performance** pada teks dengan campuran formal dan informal
129
+ ✅ **Optimized** untuk konten media sosial Indonesia
130
+
131
+ ## Label Mapping
132
+
133
+ ```
134
+ 0: "HAM" (tidak spam)
135
+ 1: "SPAM" (spam)
136
+ ```
137
+
138
+ ## Training Process
139
+
140
+ Model ini di-retrain menggunakan:
141
+
142
+ - **Optimizer**: AdamW
143
+ - **Learning Rate**: 2e-5
144
+ - **Epochs**: 3
145
+ - **Batch Size**: 16
146
+ - **Max Length**: 128 tokens
147
+ - **Train/Validation Split**: 80/20
148
+
149
+ ## Usage Example
150
+
151
+ ```python
152
+ import torch
153
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
154
+
155
+ # Load model dan tokenizer
156
+ tokenizer = AutoTokenizer.from_pretrained("nahiar/spam-detection-bert")
157
+ model = AutoModelForSequenceClassification.from_pretrained("nahiar/spam-detection-bert")
158
+
159
+ def predict_spam(text):
160
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
161
+ outputs = model(**inputs)
162
+ probs = torch.softmax(outputs.logits, dim=1)
163
+ predicted_label = torch.argmax(probs, dim=1).item()
164
+ confidence = probs[0][predicted_label].item()
165
+ label_map = {0: "HAM", 1: "SPAM"}
166
+ return label_map[predicted_label], confidence
167
+
168
+ # Test
169
+ text = "Dapatkan uang dengan mudah! Klik link ini sekarang!"
170
+ result, confidence = predict_spam(text)
171
+ print(f"Prediksi: {result} (Confidence: {confidence:.4f})")
172
+ ```
173
+
174
+ ## Citation
175
+
176
+ ```bibtex
177
+ @misc{nahiar_spam_detection_bert,
178
+ title={Indonesian Spam Detection BERT},
179
+ author={Raihan Hidayatullah Djunaedi},
180
+ year={2025},
181
+ url={https://huggingface.co/nahiar/spam-detection-bert}
182
+ }
183
+ ```
184
+
185
+ ## Changelog
186
+
187
+ ### Current Version (January 2025)
188
+
189
+ - Re-trained model dengan dataset yang telah dilabeli ulang secara manual
190
+ - Enhanced handling untuk konten Indonesia yang kompleks
191
+ - Better performance pada deteksi spam dengan konteks lokal Indonesia
192
+ - Optimized untuk konten media sosial (Twitter, Instagram, dll)
193
+ - Improved accuracy dengan distribusi dataset yang lebih balanced
config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["BertForSequenceClassification"],
3
+ "attention_probs_dropout_prob": 0.1,
4
+ "classifier_dropout": null,
5
+ "finetuning_task": "text-classification",
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "id2label": {
11
+ "0": "ham",
12
+ "1": "spam"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "label2id": {
17
+ "ham": 0,
18
+ "spam": 1
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "pipeline_tag": "text-classification",
27
+ "position_embedding_type": "absolute",
28
+ "problem_type": "single_label_classification",
29
+ "task_specific_params": {
30
+ "text-classification": {
31
+ "num_labels": 2,
32
+ "problem_type": "single_label_classification"
33
+ }
34
+ },
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.53.3",
37
+ "type_vocab_size": 2,
38
+ "use_cache": true,
39
+ "vocab_size": 32000
40
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a14db256163318a863b779fa72b2cd497ad6dd4bccb6c9f478bdfdf9d69b298
3
+ size 442499064
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[UNK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[SEP]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[PAD]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[CLS]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_length": 128,
52
+ "model_max_length": 512,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "pipeline_tag": "text-classification",
59
+ "sep_token": "[SEP]",
60
+ "stride": 0,
61
+ "strip_accents": null,
62
+ "tokenize_chinese_chars": true,
63
+ "tokenizer_class": "BertTokenizer",
64
+ "truncation_side": "right",
65
+ "truncation_strategy": "longest_first",
66
+ "unk_token": "[UNK]"
67
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff