Spaces:

Ahmedik95316
/

Fake-News-Detection-with-MLOps

Running

App Files Files Community

Fake-News-Detection-with-MLOps / README.md

Ahmedik95316

Update README.md

740af5b 4 months ago

preview code

raw

history blame

15.4 kB

metadata

title: Fake News Detection MLOps Web App
emoji: 📈
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
short_description: MLOps fake news detector with drift monitoring
license: mit

📰 Advanced Fake News Detection System

A comprehensive, production-ready MLOps system for fake news detection featuring automated data collection, advanced model training, statistical validation, drift monitoring, and real-time inference. Built with modern DevOps practices and deployed on HuggingFace Spaces.

🔗 Live Application: https://huggingface.co/spaces/Ahmedik95316/Fake-News-Detection-MLOs-Web-App

🚀 System Overview

This system represents a complete MLOps implementation with enterprise-grade reliability and monitoring capabilities:

Core Capabilities

Real-time Inference: FastAPI backend with rate limiting, health checks, and comprehensive error handling
Advanced Model Training: Hyperparameter tuning, cross-validation, and statistical model comparison
Automated Data Collection: Robust web scraping from Reuters, BBC, NPR with quality validation
Intelligent Fake News Generation: Template-based synthesis with content quality scoring
Statistical Model Validation: McNemar's test and paired t-tests for promotion decisions
Multi-Method Drift Detection: Jensen-Shannon divergence, KS tests, and performance monitoring
Production Monitoring: Resource tracking, performance metrics, and automated alerting
Advanced Interactive Dashboard: Complete UI overhaul with 5 specialized tabs for different workflows
Batch Processing & Analysis: Upload and analyze multiple articles simultaneously with downloadable results
Real-time Analytics: System metrics dashboard showing API requests, unique clients, model status, and performance trends
Custom Model Training: Built-in functionality to upload datasets and retrain models with progress tracking
System Status Monitoring: Comprehensive health dashboard with resource usage, file system status, and initialization tools

Technical Architecture

ML Pipeline: Advanced feature engineering with TF-IDF, n-grams, and statistical feature selection
Model Management: A/B testing between candidate and production models with automated promotion
Data Pipeline: Automated scraping, generation, validation, and quality assessment
Monitoring Stack: Comprehensive drift detection with multiple statistical methods
Infrastructure: Containerized deployment with health checks and graceful error handling

🧠 How It Works

Automated MLOps Pipeline

Data Collection & Validation
- Hourly scraping of real news from major outlets (Reuters, BBC, NPR, AP)
- Quality validation with content scoring and duplicate detection
- Template-based fake news generation with believability assessment
- Statistical data quality checks and preprocessing
Advanced Model Training
- Hyperparameter optimization using GridSearchCV with cross-validation
- Multiple model comparison (Logistic Regression, Random Forest)
- Advanced feature engineering with TF-IDF and n-gram analysis
- Comprehensive evaluation with precision, recall, F1, and ROC-AUC metrics
Statistical Model Validation
- McNemar's test for paired model comparison
- Practical significance testing with configurable thresholds
- A/B testing framework for model promotion decisions
- Performance degradation detection with confidence intervals
Multi-Dimensional Drift Monitoring
- Distribution Drift: Jensen-Shannon divergence and Kolmogorov-Smirnov tests
- Feature Drift: Population Stability Index and feature importance tracking
- Performance Drift: Model accuracy and prediction confidence monitoring
- Statistical Distance: Euclidean, cosine similarity, and Bhattacharyya distance
Production Monitoring & Alerting
- Real-time system resource monitoring (CPU, memory, disk)
- Automated health checks with service recovery
- Performance metrics tracking with anomaly detection
- Structured logging with audit trails and error tracking

📂 System Architecture

Core Services

`/app/`

fastapi_server.py – Production-ready API with rate limiting, background tasks, and health monitoring
streamlit_app.py – Multi-tab dashboard with batch processing, analytics, and real-time monitoring
initialize_system.py – System initialization with automatic setup and validation

`/data/`

prepare_datasets.py – Advanced data preprocessing with quality validation and statistical analysis
scrape_real_news.py – Robust web scraping with rate limiting, error handling, and content validation
generate_fake_news.py – Sophisticated fake news generation with template variety and quality scoring

`/model/`

train.py – Advanced training pipeline with hyperparameter tuning and model comparison
retrain.py – Statistical model validation with A/B testing and automated promotion
Model artifacts: Production and candidate models with versioning and metadata

`/monitor/`

monitor_drift.py – Multi-method drift detection with statistical validation and alerting

`/scheduler/`

schedule_tasks.py – Fault-tolerant scheduler with retry logic and dependency management

Infrastructure Files

Dockerfile – Multi-stage containerization with security and health checks
start.sh – Robust startup script with service orchestration
requirements.txt – Pinned dependencies for reproducible deployments

💡 Advanced Features

Enhanced User Interface

The updated system features a completely redesigned multi-tab interface:

🔍 Prediction Tab

Single article analysis with confidence scoring
Support for text input and file upload
Real-time prediction with detailed confidence gauges
Analysis history tracking

📊 Batch Analysis Tab

Upload CSV files for bulk processing
Progress tracking with real-time status updates
Summary statistics and visualization of results
Downloadable analysis reports with timestamps

📈 Analytics Tab

System Metrics: Total API requests, unique clients, processing statistics
Model Performance: Current model version, accuracy trends, confidence patterns
Usage Analytics: Request patterns, client distribution, error rates
Interactive Visualizations: Real-time charts and performance dashboards

🎯 Custom Model Training Tab

Dataset Upload: Support for custom CSV training data with validation
Training Configuration: Adjustable parameters (test size, max features, cross-validation)
Real-time Progress: Live training progress with step-by-step status updates
Performance Validation: Automatic accuracy reporting and model comparison
Automatic Integration: Seamless deployment of trained models to production API

⚙️ System Status Tab

Model Health: Current model status, last health check, availability metrics
System Resources: Real-time CPU, memory, and disk usage monitoring
API Health: Request rates, response times, error tracking
Model Information: Version details, training metrics, accuracy scores, timestamps
Recent Activity: Live activity feed with color-coded event logging
File System Status: Critical file availability and integrity checks
System Initialization: One-click system setup and recovery tools

Fault Tolerance: Automatic retry with exponential backoff and circuit breakers
Resource Management: CPU/memory monitoring with automatic throttling
Service Recovery: Health checks with automatic restart capabilities
Data Integrity: Validation pipelines with quality scoring and anomaly detection

Enterprise-Grade Reliability

Model Validation: McNemar's test and paired statistical comparisons
Drift Detection: Multiple statistical methods with significance testing
Performance Tracking: Confidence intervals and trend analysis
Quality Metrics: Comprehensive evaluation with cross-validation

Production Monitoring

Real-time Dashboards: Live performance metrics and system health
Automated Alerting: Threshold-based notifications with severity classification
Audit Logging: Complete operation history with structured metadata
Performance Analytics: Trend analysis and capacity planning metrics

Scalable Architecture

Microservices Design: Loosely coupled components with clear interfaces
Container Orchestration: Docker with health checks and resource limits
Horizontal Scaling: Stateless design supporting multiple instances
Cloud-Native: Designed for containerized cloud deployment

🔧 Technical Implementation

Machine Learning Stack

Feature Engineering: TF-IDF with n-grams, stop word removal, and feature selection
Model Selection: Automated comparison between Logistic Regression and Random Forest
Hyperparameter Optimization: GridSearchCV with stratified cross-validation
Evaluation Framework: Comprehensive metrics including ROC-AUC and confusion matrices

Data Engineering Pipeline

Quality Validation: Content length, structure, and linguistic analysis
Preprocessing: Text cleaning, normalization, and encoding standardization
Duplicate Detection: Hash-based deduplication with similarity scoring
Batch Processing: Efficient handling of large datasets with memory optimization

Infrastructure & DevOps

Containerization: Multi-stage Docker builds with security scanning
Service Mesh: FastAPI + Streamlit with internal communication
Health Monitoring: Endpoint monitoring with automatic recovery
Logging Infrastructure: Structured logging with log rotation and archival

Security & Compliance

Input Validation: Comprehensive sanitization and content filtering
Rate Limiting: Per-client throttling with sliding window algorithms
Error Handling: Graceful degradation without information leakage
Resource Protection: Memory limits and CPU throttling

📊 Performance Metrics

Model Performance

Accuracy: Consistently achieving >95% accuracy on test sets
F1 Score: Balanced precision-recall performance across both classes
Inference Speed: <100ms response time for single predictions
Batch Processing: Support for up to 10 concurrent predictions

System Performance

Uptime: >99.9% availability with automatic recovery
Scalability: Handles 100+ requests per hour with rate limiting
Resource Efficiency: <2GB memory footprint with optimization
Response Time: <30s for model retraining and promotion

Data Pipeline Metrics

Collection Rate: 20+ new articles per hour from multiple sources
Quality Score: >90% content quality after validation pipeline
Processing Speed: Real-time data ingestion with <5s latency
Storage Efficiency: Compressed storage with automatic cleanup

🛠 Deployment & Operation

Pre-built Docker Image Deployment

# Run this Space with Docker
docker run -it -p 7860:7860 --platform=linux/amd64 \
    registry.hf.space/ahmedik95316-end-to-end-fake-news-detection-syst-7d32f4f:latest

Local Development

Method 1 : Using Shell Script

# Install dependencies
pip install -r requirements.txt

# Initialize system
python initialize_system.py

# Start services
./start.sh

Method 2 : Using Docker To Build and Run the Image

# Build and run the complete system
docker build -t fake-news-detector .
docker run -p 7860:7860 -p 8000:8000 fake-news-detector

Health Monitoring

The system includes comprehensive health checks:

API endpoint monitoring at /health
System resource tracking
Model performance validation
Automated service recovery

🔍 Advanced Analytics

Drift Detection Dashboard

Real-time monitoring of:

Feature distribution changes
Model performance degradation
Data quality metrics
Prediction confidence trends

Performance Analytics

Comprehensive tracking of:

Model accuracy over time
Processing speed metrics
Resource utilization patterns
Error rates and recovery times

Business Intelligence

Strategic insights including:

Content source reliability
Prediction confidence patterns
System usage analytics
Performance optimization recommendations

🎯 Technical Achievements

MLOps Excellence

Complete automation from data collection to model deployment
Statistical validation ensuring model improvement before promotion
Comprehensive monitoring with multiple drift detection methods
Production-ready infrastructure with enterprise-grade reliability

Engineering Quality

Extensive error handling with graceful degradation
Comprehensive test coverage with validation pipelines
Security-first design with input validation and rate limiting
Scalable architecture supporting horizontal expansion

Innovation & Research

Novel approach to automated fake news generation for training
Advanced drift detection combining multiple statistical methods
Real-time model performance monitoring with automated alerting
Innovative A/B testing framework for model validation

📜 System Requirements

Core Dependencies

Python 3.11+ with scientific computing stack
FastAPI for high-performance API serving
Streamlit for interactive dashboard
Scikit-learn for machine learning pipeline
Docker for containerized deployment

External Services

News source APIs (Reuters, BBC, NPR)
Statistical computing libraries (scipy, numpy)
Web scraping capabilities (newspaper3k)
Monitoring and logging infrastructure

🚀 Future Enhancements

Planned Features

Advanced NLP: Integration of transformer models and BERT embeddings
Real-time Processing: Stream processing for live news analysis
Enhanced UI: React-based dashboard with advanced visualizations
API Expansion: RESTful APIs for integration with external systems

Scalability Improvements

Kubernetes: Container orchestration for cloud deployment
Microservices: Further decomposition for independent scaling
Caching Layer: Redis integration for improved performance
Database Integration: PostgreSQL for persistent storage

Machine Learning Advances

Ensemble Methods: Multiple model voting systems
Deep Learning: Neural network integration for improved accuracy
Explainable AI: LIME/SHAP integration for prediction explanations
Active Learning: Human-in-the-loop feedback incorporation

📄 License

MIT License - see LICENSE file for details.

This project demonstrates comprehensive MLOps implementation with production-ready features including automated training, statistical validation, drift monitoring, and enterprise-grade reliability. The system showcases modern software engineering practices combined with advanced machine learning techniques for real-world deployment scenarios.