NightPrince commited on
Commit
3951155
Β·
verified Β·
1 Parent(s): 9b57c88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -181
README.md CHANGED
@@ -1,181 +1,154 @@
1
- # Toxic-Predict
2
-
3
- Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
4
-
5
- ---
6
-
7
- ## 🚩 Project Context
8
-
9
- This project is part of the **Cellula Internship** proposal:
10
- **"Safe and Responsible Multi-Modal Toxic Content Moderation"**
11
- The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
12
-
13
- ---
14
-
15
- ## Project Structure
16
-
17
- ```
18
- .
19
- β”œβ”€β”€ app.py
20
- β”œβ”€β”€ run.py
21
- β”œβ”€β”€ test.py
22
- β”œβ”€β”€ requirements.txt
23
- β”œβ”€β”€ README.md
24
- β”œβ”€β”€ data/
25
- β”‚ β”œβ”€β”€ cellula-toxic.csv
26
- β”‚ β”œβ”€β”€ cleaned.csv
27
- β”‚ β”œβ”€β”€ eval.csv
28
- β”‚ β”œβ”€β”€ test.csv
29
- β”‚ β”œβ”€β”€ tokenizer.pkl
30
- β”‚ └── train.csv
31
- β”œβ”€β”€ models/
32
- β”‚ β”œβ”€β”€ model.py
33
- β”‚ β”œβ”€β”€ toxic_classifier.h5
34
- β”‚ └── toxic_classifier.keras
35
- β”œβ”€β”€ notebooks/
36
- β”‚ β”œβ”€β”€ Preprocessing.ipynb
37
- β”‚ └── tokenization.ipynb
38
- └── src/
39
- β”œβ”€β”€ preprocess.py
40
- └── tokenize_and_split.py
41
- ```
42
-
43
- ---
44
-
45
- ## Features
46
-
47
- - Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
48
- - Data cleaning, preprocessing, and label encoding
49
- - Tokenization and sequence padding for text data
50
- - Deep learning and transformer-based models for multi-class toxicity classification
51
- - Evaluation metrics: classification report and confusion matrix
52
- - Jupyter notebooks for data exploration and model development
53
- - Streamlit web app for demo and deployment
54
-
55
- ---
56
-
57
- ## Setup
58
-
59
- 1. **Clone the repository**
60
- ```sh
61
- git clone https://github.com/yourusername/toxic-classification.git
62
- cd toxic-predict
63
- ```
64
-
65
- 2. **Install dependencies**
66
- ```sh
67
- pip install -r requirements.txt
68
- ```
69
-
70
- 3. **Prepare data**
71
- - Place your data files in the `data/` directory if not already present.
72
-
73
- 4. **Train the model**
74
- - Use the scripts in `src/` or the Jupyter notebooks in `notebooks/` to preprocess data and train the model.
75
-
76
- 5. **Run predictions**
77
- - Use `app.py` or `run.py` to run inference on new data.
78
-
79
- ---
80
-
81
- ## Usage
82
-
83
- - **Preprocessing and Tokenization:**
84
- See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
85
- - **Model Training:**
86
- Model architecture and training code are in `models/model.py`.
87
- - **Inference:**
88
- Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.
89
-
90
- ---
91
-
92
- ## Data
93
-
94
- - CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
95
- - Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
96
- - 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
97
-
98
- ---
99
-
100
- ## Model
101
-
102
- - Deep learning model built with Keras (TensorFlow backend).
103
- - Multi-class classification with label encoding for toxicity categories.
104
- - Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
105
-
106
- ---
107
-
108
- ## Evaluation
109
-
110
- - Classification report and confusion matrix are generated for model evaluation.
111
- - See the evaluation steps in `notebooks/Preprocessing.ipynb`.
112
-
113
- ---
114
-
115
- language: en
116
-
117
- ## πŸ€— Hugging Face Inference
118
-
119
- This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)
120
-
121
- ### Inference API Usage
122
-
123
- You can use the Hugging Face Inference API or widget with two fields:
124
-
125
- - `text`: The main query or post text
126
- - `image_desc`: The image description (if any)
127
-
128
- **Example (Python):**
129
-
130
- ```python
131
- from huggingface_hub import InferenceClient
132
- client = InferenceClient("NightPrince/Toxic_Classification")
133
- result = client.text_classification({
134
- "text": "This is a dangerous post",
135
- "image_desc": "Knife shown in the image"
136
- })
137
- print(result) # {'label': 'toxic', 'score': 0.98}
138
- ```
139
-
140
- ### Custom Pipeline Details
141
-
142
- - The model uses a custom `pipeline.py` for multi-input inference.
143
- - The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
144
- - Class names are mapped using `label_map.json`.
145
-
146
- **Files in the repo:**
147
- - `pipeline.py` (custom inference logic)
148
- - `tokenizer.json` (Keras tokenizer)
149
- - `label_map.json` (class code to name mapping)
150
- - TensorFlow SavedModel files (`saved_model.pb`, `variables/`)
151
-
152
- **Requirements:**
153
- ```
154
- tensorflow
155
- keras
156
- numpy
157
- ```
158
-
159
- ---
160
-
161
- ---
162
-
163
- ## πŸ“š Resources
164
-
165
- - [Cellula Internship Project Proposal](#)
166
- - [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
167
- - [Llama Guard](https://llama.meta.com/llama-guard/)
168
- - [DistilBERT](https://huggingface.co/distilbert-base-uncased)
169
- - [Streamlit](https://streamlit.io/)
170
-
171
- ---
172
-
173
- ## License
174
-
175
- MIT License
176
-
177
- ---
178
-
179
- **Author:** Yahya Muhammad Alnwsany
180
- **Contact:** [email protected]
181
- **Portfolio:** https://nightprincey.github.io/Portfolio/
 
1
+
2
+ ---
3
+ language: en
4
+ tags:
5
+ - toxic-content
6
+ - text-classification
7
+ - keras
8
+ - tensorflow
9
+ - deep-learning
10
+ - safety
11
+ - multiclass
12
+ license: mit
13
+ datasets:
14
+ - custom
15
+ metrics:
16
+ - accuracy
17
+ - f1
18
+ pipeline_tag: text-classification
19
+ model-index:
20
+ - name: Toxic_Classification
21
+ results: []
22
+ ---
23
+
24
+
25
+ # Toxic-Predict
26
+
27
+ Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
28
+
29
+ ---
30
+
31
+ ## 🚩 Project Context
32
+
33
+ This project is part of the **Cellula Internship** proposal:
34
+ **"Safe and Responsible Multi-Modal Toxic Content Moderation"**
35
+ The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
36
+
37
+ ---
38
+
39
+
40
+ ## Features
41
+
42
+ - Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
43
+ - Data cleaning, preprocessing, and label encoding
44
+ - Tokenization and sequence padding for text data
45
+ - Deep learning and transformer-based models for multi-class toxicity classification
46
+ - Evaluation metrics: classification report and confusion matrix
47
+ - Jupyter notebooks for data exploration and model development
48
+ - Streamlit web app for demo and deployment
49
+
50
+ ---
51
+
52
+ ---
53
+
54
+ ## Usage
55
+
56
+ - **Preprocessing and Tokenization:**
57
+ See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
58
+ - **Model Training:**
59
+ Model architecture and training code are in `models/model.py`.
60
+ - **Inference:**
61
+ Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.
62
+
63
+ ---
64
+
65
+ ## Data
66
+
67
+ - CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
68
+ - Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
69
+ - 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
70
+
71
+ ---
72
+
73
+ ## Model
74
+
75
+ - Deep learning model built with Keras (TensorFlow backend).
76
+ - Multi-class classification with label encoding for toxicity categories.
77
+ - Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
78
+
79
+ ---
80
+
81
+ ## Evaluation
82
+
83
+ - Classification report and confusion matrix are generated for model evaluation.
84
+ - See the evaluation steps in `notebooks/Preprocessing.ipynb`.
85
+
86
+ ---
87
+
88
+ language: en
89
+
90
+ ## πŸ€— Hugging Face Inference
91
+
92
+ This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)
93
+
94
+ ### Inference API Usage
95
+
96
+ You can use the Hugging Face Inference API or widget with two fields:
97
+
98
+ - `text`: The main query or post text
99
+ - `image_desc`: The image description (if any)
100
+
101
+ **Example (Python):**
102
+
103
+ ```python
104
+ from huggingface_hub import InferenceClient
105
+ client = InferenceClient("NightPrince/Toxic_Classification")
106
+ result = client.text_classification({
107
+ "text": "This is a dangerous post",
108
+ "image_desc": "Knife shown in the image"
109
+ })
110
+ print(result) # {'label': 'toxic', 'score': 0.98}
111
+ ```
112
+
113
+ ### Custom Pipeline Details
114
+
115
+ - The model uses a custom `pipeline.py` for multi-input inference.
116
+ - The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
117
+ - Class names are mapped using `label_map.json`.
118
+
119
+ **Files in the repo:**
120
+ - `pipeline.py` (custom inference logic)
121
+ - `tokenizer.json` (Keras tokenizer)
122
+ - `label_map.json` (class code to name mapping)
123
+ - TensorFlow SavedModel files (`saved_model.pb`, `variables/`)
124
+
125
+ **Requirements:**
126
+ ```
127
+ tensorflow
128
+ keras
129
+ numpy
130
+ ```
131
+
132
+ ---
133
+
134
+ ---
135
+
136
+ ## πŸ“š Resources
137
+
138
+ - [Cellula Internship Project Proposal](#)
139
+ - [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
140
+ - [Llama Guard](https://llama.meta.com/llama-guard/)
141
+ - [DistilBERT](https://huggingface.co/distilbert-base-uncased)
142
+ - [Streamlit](https://streamlit.io/)
143
+
144
+ ---
145
+
146
+ ## License
147
+
148
+ MIT License
149
+
150
+ ---
151
+
152
+ **Author:** Yahya Muhammad Alnwsany
153
+ **Contact:** [email protected]
154
+ **Portfolio:** https://nightprincey.github.io/Portfolio/