|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- Twitter |
|
- Spam detection |
|
base_model: FacebookAI/xlm-roberta-large |
|
inference: True |
|
--- |
|
|
|
# Spam detection of Tweets |
|
This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0). |
|
|
|
## Training Dataset |
|
|
|
This was finetuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`FacebookAI/xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large) as the base model. |
|
|
|
## How to use model |
|
|
|
Here is some starter code that you can use to detect spam tweets from a dataset of text-based tweets. |
|
|
|
```python |
|
def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24): |
|
''' |
|
Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model. |
|
|
|
Parameters: |
|
----------- |
|
df : pandas.DataFrame |
|
DataFrame containing the texts to classify. |
|
|
|
text_col : str |
|
Name of the column in that contains the text data to be classified. |
|
|
|
model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification" |
|
Path to the pre-trained model for sequence classification. |
|
|
|
batch_size : int, optional, default=24 |
|
Batch size for loading and processing data in batches. Adjust based on available GPU memory. |
|
|
|
Returns: |
|
-------- |
|
pandas.DataFrame |
|
The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text. |
|
|
|
''' |
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda") |
|
model.eval() # Set model to evaluation mode |
|
|
|
# Prepare the text data for classification |
|
df["text"] = df[text_col].astype(str) # Ensure text is in string format |
|
|
|
# Convert the data to a Hugging Face Dataset and tokenize |
|
text_dataset = Dataset.from_pandas(df) |
|
|
|
def tokenize_function(example): |
|
return tokenizer( |
|
example["text"], |
|
padding="max_length", |
|
truncation=True, |
|
max_length=512 |
|
) |
|
|
|
text_dataset = text_dataset.map(tokenize_function, batched=True) |
|
text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask']) |
|
|
|
# DataLoader for the text data |
|
text_loader = DataLoader(text_dataset, batch_size=batch_size) |
|
|
|
# Make predictions |
|
predictions = [] |
|
with torch.no_grad(): |
|
for batch in tqdm_notebook(text_loader): |
|
input_ids = batch['input_ids'].to("cuda") |
|
attention_mask = batch['attention_mask'].to("cuda") |
|
|
|
# Forward pass |
|
outputs = model(input_ids=input_ids, attention_mask=attention_mask) |
|
logits = outputs.logits |
|
preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels |
|
predictions.extend(preds) |
|
|
|
# Map predictions to labels |
|
id2label = {0: "Quality", 1: "Spam"} |
|
predicted_labels = [id2label[pred] for pred in predictions] |
|
|
|
# Add predictions to the original DataFrame |
|
df["spam_prediction"] = predicted_labels |
|
|
|
return df |
|
|
|
spam_df_classification = classify_texts(df, "text_col") |
|
print(spam_df_classification) |
|
|
|
``` |
|
|
|
## Metrics |
|
|
|
Based on a 80-10-10 train-val-test split, the following results were obtained on the test set: |
|
- Accuracy: 0.974555 |
|
- Precision: 0.97457 |
|
- Recall: 0.97455 |
|
- F1-Score: 0.97455 |
|
|
|
|
|
|
|
## Code |
|
|
|
Code used to train these models are available on GitHub at [github.com/cja5553/Twitter_spam_detection](https://github.com/cja5553/Twitter_spam_detection) |
|
|
|
## Questions? |
|
contact me at [email protected] |