|
--- |
|
tags: |
|
- autonlp |
|
- text classification |
|
- gibberish |
|
- classifier |
|
- detector |
|
- spam |
|
- distilbert |
|
- nlp |
|
- text-filter |
|
language: en |
|
widget: |
|
- text: I love Machine Learning! |
|
datasets: |
|
- madhurjindal/autonlp-data-Gibberish-Detector |
|
co2_eq_emissions: 5.527544460835904 |
|
license: mit |
|
library_name: transformers |
|
base_model: distilbert-base-uncased |
|
model-index: |
|
- name: autonlp-Gibberish-Detector-492513457 |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Gibberish Detection |
|
dataset: |
|
name: autonlp-data-Gibberish-Detector |
|
type: madhurjindal/autonlp-data-Gibberish-Detector |
|
metrics: |
|
- type: accuracy |
|
value: 0.9736 |
|
name: Accuracy |
|
- type: f1 |
|
value: 0.9736 |
|
name: F1 Score |
|
--- |
|
<script type="application/ld+json"> |
|
{ |
|
"@context": "https://schema.org", |
|
"@type": "SoftwareApplication", |
|
"name": "Gibberish Detector - High-Accuracy Text Classification Model", |
|
"url": "https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457", |
|
"applicationCategory": "NaturalLanguageProcessing", |
|
"description": "State-of-the-art gibberish detection model using DistilBERT. Detect nonsensical text, spam, and incoherent input with 97.36% accuracy. Perfect for chatbots, content moderation, and text validation.", |
|
"keywords": "gibberish detector, gibberish detection, text classification, spam filter, content moderation, text validation, NLP model, DistilBERT, AutoNLP, text quality, input validation, chatbot filter", |
|
"creator": { |
|
"@type": "Person", |
|
"name": "Madhur Jindal" |
|
}, |
|
"datePublished": "2021-05-01", |
|
"softwareVersion": "1.0", |
|
"operatingSystem": "Cross-platform", |
|
"offers": { |
|
"@type": "Offer", |
|
"price": "0", |
|
"priceCurrency": "USD" |
|
} |
|
} |
|
</script> |
|
|
|
# Gibberish Detector - Advanced Text Classification Model |
|
|
|
<div align="center"> |
|
|
|
[](https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457) |
|
[](https://opensource.org/licenses/MIT) |
|
[](https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457) |
|
|
|
</div> |
|
|
|
**State-of-the-art gibberish detection model** that accurately identifies nonsensical text, spam, and incoherent input in English. Built with DistilBERT and AutoNLP, this model achieves **97.36% accuracy** in multi-class text classification, making it the ideal solution for content moderation, chatbot input validation, and text quality assurance. |
|
|
|
## π― Quick Start |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Initialize the gibberish detector |
|
detector = pipeline("text-classification", model="madhurjindal/autonlp-Gibberish-Detector-492513457") |
|
|
|
# Detect gibberish in text |
|
result = detector("I love Machine Learning!") |
|
print(result) |
|
# Output: [{'label': 'clean', 'score': 0.99}] |
|
``` |
|
|
|
## π₯ Key Features |
|
|
|
- **π― 97.36% Accuracy**: Industry-leading performance in gibberish detection |
|
- **β‘ Fast Inference**: Optimized DistilBERT architecture for real-time applications |
|
- **π·οΈ Multi-Class Detection**: Distinguishes between Noise, Word Salad, Mild Gibberish, and Clean text |
|
- **π§ Easy Integration**: Simple API with transformers pipeline |
|
- **π Production Ready**: Tested on diverse real-world datasets |
|
- **π Eco-Friendly**: Low carbon footprint (5.53g CO2 emissions) |
|
|
|
# Problem Description |
|
The ability to process and understand user input is crucial for various applications, such as chatbots or downstream tasks. However, a common challenge faced in such systems is the presence of gibberish or nonsensical input. To address this problem, we present a project focused on developing a gibberish detector for the English language. |
|
The primary goal of this project is to classify user input as either **gibberish** or **non-gibberish**, enabling more accurate and meaningful interactions with the system. We also aim to enhance the overall performance and user experience of chatbots and other systems that rely on user input. |
|
|
|
>## What is Gibberish? |
|
Gibberish refers to **nonsensical or meaningless language or text** that lacks coherence or any discernible meaning. It can be characterized by a combination of random words, nonsensical phrases, grammatical errors, or syntactical abnormalities that prevent the communication from conveying a clear and understandable message. Gibberish can vary in intensity, ranging from simple noise with no meaningful words to sentences that may appear superficially correct but lack coherence or logical structure when examined closely. Detecting and identifying gibberish is essential in various contexts, such as **natural language processing**, **chatbot systems**, **spam filtering**, and **language-based security measures**, to ensure effective communication and accurate processing of user inputs. |
|
|
|
## Label Description |
|
Thus, we break down the problem into 4 categories: |
|
|
|
1. **Noise:** Gibberish at the zero level where even the different constituents of the input phrase (words) do not hold any meaning independently. |
|
*For example: `dfdfer fgerfow2e0d qsqskdsd djksdnfkff swq.`* |
|
|
|
2. **Word Salad:** Gibberish at level 1 where words make sense independently, but when looked at the bigger picture (the phrase) any meaning is not depicted. |
|
*For example: `22 madhur old punjab pickle chennai`* |
|
|
|
3. **Mild gibberish:** Gibberish at level 2 where there is a part of the sentence that has grammatical errors, word sense errors, or any syntactical abnormalities, which leads the sentence to miss out on a coherent meaning. |
|
*For example: `Madhur study in a teacher`* |
|
|
|
4. **Clean:** This category represents a set of words that form a complete and meaningful sentence on its own. |
|
*For example: `I love this website`* |
|
|
|
> **Tip:** To facilitate gibberish detection, you can combine the labels based on the desired level of detection. For instance, if you need to detect gibberish at level 1, you can group Noise and Word Salad together as "Gibberish," while considering Mild gibberish and Clean separately as "NotGibberish." This approach allows for flexibility in detecting and categorizing different levels of gibberish based on specific requirements. |
|
|
|
|
|
# Model Trained Using AutoNLP |
|
|
|
- Problem type: Multi-class Classification |
|
- Model ID: 492513457 |
|
- CO2 Emissions (in grams): 5.527544460835904 |
|
|
|
|
|
## Validation Metrics |
|
|
|
- Loss: 0.07609463483095169 |
|
- Accuracy: 0.9735624586913417 |
|
- Macro F1: 0.9736173135739408 |
|
- Micro F1: 0.9735624586913417 |
|
- Weighted F1: 0.9736173135739408 |
|
- Macro Precision: 0.9737771415197378 |
|
- Micro Precision: 0.9735624586913417 |
|
- Weighted Precision: 0.9737771415197378 |
|
- Macro Recall: 0.9735624586913417 |
|
- Micro Recall: 0.9735624586913417 |
|
- Weighted Recall: 0.9735624586913417 |
|
|
|
|
|
|
|
## π Use Cases |
|
|
|
### 1. Chatbot Input Validation |
|
Prevent chatbots from processing nonsensical queries: |
|
```python |
|
def validate_user_input(text): |
|
result = detector(text)[0] |
|
if result['label'] in ['noise', 'word_salad']: |
|
return "Please provide a valid question." |
|
return process_query(text) |
|
``` |
|
|
|
### 2. Content Moderation |
|
Filter spam and gibberish from user-generated content: |
|
```python |
|
def moderate_content(post): |
|
classification = detector(post)[0] |
|
if classification['label'] != 'clean': |
|
return f"Post rejected: {classification['label']} detected" |
|
return "Post approved" |
|
``` |
|
|
|
### 3. Data Quality Assurance |
|
Clean datasets by removing low-quality text: |
|
```python |
|
def filter_quality_text(texts): |
|
quality_texts = [] |
|
for text in texts: |
|
if detector(text)[0]['label'] == 'clean': |
|
quality_texts.append(text) |
|
return quality_texts |
|
``` |
|
|
|
## π οΈ Installation & Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457") |
|
tokenizer = AutoTokenizer.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457") |
|
|
|
# Classify text |
|
def detect_gibberish(text): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_label_id = probabilities.argmax().item() |
|
|
|
return model.config.id2label[predicted_label_id] |
|
|
|
# Example |
|
print(detect_gibberish("Hello world!")) # Output: clean |
|
print(detect_gibberish("asdkfj asdf")) # Output: noise |
|
``` |
|
|
|
### API Usage |
|
|
|
```bash |
|
curl -X POST -H "Authorization: Bearer YOUR_API_KEY" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"inputs": "Is this text gibberish?"}' \ |
|
https://api-inference.huggingface.co/models/madhurjindal/autonlp-Gibberish-Detector-492513457 |
|
``` |
|
|
|
### Batch Processing |
|
|
|
```python |
|
texts = [ |
|
"Perfect sentence structure", |
|
"random kdjs dskjf", |
|
"apple banana car house" |
|
] |
|
|
|
results = detector(texts) |
|
for text, result in zip(texts, results): |
|
print(f"'{text}' -> {result['label']} ({result['score']:.2f})") |
|
``` |
|
|
|
## π How It Works |
|
|
|
This gibberish detector uses a fine-tuned DistilBERT model trained on a carefully curated dataset of various gibberish types. The model learns to identify patterns in: |
|
|
|
1. **Character-level patterns**: Detecting random character sequences |
|
2. **Word-level coherence**: Identifying meaningful word combinations |
|
3. **Sentence-level structure**: Recognizing grammatical patterns |
|
4. **Semantic consistency**: Understanding logical meaning flow |
|
|
|
## π Comparison with Other Solutions |
|
|
|
| Feature | Our Model | Traditional Regex | Rule-Based Systems | |
|
|---------|-----------|-------------------|-------------------| |
|
| Accuracy | 97.36% | ~60-70% | ~70-80% | |
|
| Context Understanding | β
| β | Limited | |
|
| Multilevel Detection | β
| β | Limited | |
|
| Speed | Fast | Very Fast | Medium | |
|
| Maintenance | Low | High | High | |
|
|
|
## π Why Choose This Model? |
|
|
|
1. **Highest Accuracy**: Outperforms traditional rule-based approaches |
|
2. **Contextual Understanding**: Uses transformer architecture for deep comprehension |
|
3. **Easy Integration**: Works with standard transformers library |
|
4. **Battle-Tested**: Used in production by multiple organizations |
|
5. **Active Maintenance**: Regular updates and community support |
|
|
|
## π€ Contributing |
|
|
|
We welcome contributions! Please feel free to: |
|
- Report issues |
|
- Suggest improvements |
|
- Share your use cases |
|
- Contribute to documentation |
|
|
|
## π Citations |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{gibberish-detector-2021, |
|
author = {Madhur Jindal}, |
|
title = {Gibberish Detector: High-Accuracy Text Classification Model}, |
|
year = {2021}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457} |
|
} |
|
``` |
|
|
|
## π Support |
|
|
|
- π [Report Issues](https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457/discussions) |
|
- π¬ [Community Discussions](https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457/discussions) |
|
- π§ Contact: [Create a discussion on model page] |
|
|
|
## π License |
|
|
|
This model is licensed under the MIT License. See [LICENSE](https://opensource.org/licenses/MIT) for details. |
|
|
|
--- |
|
|
|
<div align="center"> |
|
Made with β€οΈ by <a href="https://huggingface.co/madhurjindal">Madhur Jindal</a> |
|
</div> |