--- license: mit language: - en library_name: transformers tags: - Twitter - Spam detection base_model: FacebookAI/xlm-roberta-large inference: True --- # Spam detection of Tweets This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0). ## Training Dataset This was finetuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`FacebookAI/xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large) as the base model. ## How to use model Here is some starter code that you can use to detect spam tweets from a dataset of text-based tweets. ```python def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24): ''' Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model. Parameters: ----------- df : pandas.DataFrame DataFrame containing the texts to classify. text_col : str Name of the column in that contains the text data to be classified. model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification" Path to the pre-trained model for sequence classification. batch_size : int, optional, default=24 Batch size for loading and processing data in batches. Adjust based on available GPU memory. Returns: -------- pandas.DataFrame The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text. ''' # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda") model.eval() # Set model to evaluation mode # Prepare the text data for classification df["text"] = df[text_col].astype(str) # Ensure text is in string format # Convert the data to a Hugging Face Dataset and tokenize text_dataset = Dataset.from_pandas(df) def tokenize_function(example): return tokenizer( example["text"], padding="max_length", truncation=True, max_length=512 ) text_dataset = text_dataset.map(tokenize_function, batched=True) text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask']) # DataLoader for the text data text_loader = DataLoader(text_dataset, batch_size=batch_size) # Make predictions predictions = [] with torch.no_grad(): for batch in tqdm_notebook(text_loader): input_ids = batch['input_ids'].to("cuda") attention_mask = batch['attention_mask'].to("cuda") # Forward pass outputs = model(input_ids=input_ids, attention_mask=attention_mask) logits = outputs.logits preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels predictions.extend(preds) # Map predictions to labels id2label = {0: "Quality", 1: "Spam"} predicted_labels = [id2label[pred] for pred in predictions] # Add predictions to the original DataFrame df["spam_prediction"] = predicted_labels return df spam_df_classification = classify_texts(df, "text_col") print(spam_df_classification) ``` ## Metrics Based on a 80-10-10 train-val-test split, the following results were obtained on the test set: - Accuracy: 0.974555 - Precision: 0.97457 - Recall: 0.97455 - F1-Score: 0.97455 ## Code Code used to train these models are available on GitHub at [github.com/cja5553/Twitter_spam_detection](https://github.com/cja5553/Twitter_spam_detection) ## Questions? contact me at alba@wustl.edu