cja5553
/

xlm-roberta-Twitter-spam-classification

Text Classification

Inference Endpoints

Model card Files Files and versions Community

xlm-roberta-Twitter-spam-classification / README.md

cja5553's picture

Update README.md

a4229f9 verified 3 months ago

|

history blame contribute delete

3.79 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- Twitter
	- Spam detection
	base_model: FacebookAI/xlm-roberta-large
	inference: True
	---

	# Spam detection of Tweets
	This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0).

	## Training Dataset

	This was finetuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`FacebookAI/xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large) as the base model.

	## How to use model

	Here is some starter code that you can use to detect spam tweets from a dataset of text-based tweets.

	```python
	def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24):
	'''
	Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.

	Parameters:
	-----------
	df : pandas.DataFrame
	DataFrame containing the texts to classify.

	text_col : str
	Name of the column in that contains the text data to be classified.

	model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification"
	Path to the pre-trained model for sequence classification.

	batch_size : int, optional, default=24
	Batch size for loading and processing data in batches. Adjust based on available GPU memory.

	Returns:
	--------
	pandas.DataFrame
	The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.

	'''
	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
	model.eval() # Set model to evaluation mode

	# Prepare the text data for classification
	df["text"] = df[text_col].astype(str) # Ensure text is in string format

	# Convert the data to a Hugging Face Dataset and tokenize
	text_dataset = Dataset.from_pandas(df)

	def tokenize_function(example):
	return tokenizer(
	example["text"],
	padding="max_length",
	truncation=True,
	max_length=512
	)

	text_dataset = text_dataset.map(tokenize_function, batched=True)
	text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

	# DataLoader for the text data
	text_loader = DataLoader(text_dataset, batch_size=batch_size)

	# Make predictions
	predictions = []
	with torch.no_grad():
	for batch in tqdm_notebook(text_loader):
	input_ids = batch['input_ids'].to("cuda")
	attention_mask = batch['attention_mask'].to("cuda")

	# Forward pass
	outputs = model(input_ids=input_ids, attention_mask=attention_mask)
	logits = outputs.logits
	preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels
	predictions.extend(preds)

	# Map predictions to labels
	id2label = {0: "Quality", 1: "Spam"}
	predicted_labels = [id2label[pred] for pred in predictions]

	# Add predictions to the original DataFrame
	df["spam_prediction"] = predicted_labels

	return df

	spam_df_classification = classify_texts(df, "text_col")
	print(spam_df_classification)

	```

	## Metrics

	Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
	- Accuracy: 0.974555
	- Precision: 0.97457
	- Recall: 0.97455
	- F1-Score: 0.97455



	## Code

	Code used to train these models are available on GitHub at [github.com/cja5553/Twitter_spam_detection](https://github.com/cja5553/Twitter_spam_detection)

	## Questions?
	contact me at [email protected]