axel-darmouni's picture
Update README.md
cc63489 verified

A newer version of the Gradio SDK is available: 5.44.1

Upgrade
metadata
title: Datagouv French Data Analyst
emoji: 🌍
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: mit
short_description: Public french data analysis agent.
tags:
  - agent-demo-track

πŸ€– French Public Data Analysis Agent

AI-powered intelligent analysis of French public datasets with automated visualization generation, comprehensive DOCX reports, and interactive follow-up analysis capabilities.

Video Link

Quick Link, Fastened up to shorten watch time: https://www.loom.com/share/133940ce6f5f4708ba695e1c1b28cc10?sid=95c55c10-f297-40ad-bf82-8aa167bb108d

✨ Features

πŸ” Intelligent Dataset Discovery

  • BM25 Keyword Search: Advanced keyword matching with pre-computed search indices
  • Bilingual Query Translation: Search in French or English - queries are automatically translated using LLM
  • Quality-Weighted Random Selection: Leave query empty to randomly select high-quality datasets
  • Real-time Dataset Matching: Instant matching against 5,000+ French government datasets
  • Dynamic Dataset Search: Agent can search for alternative datasets if initial results aren't suitable

πŸ€– Automated AI Analysis

  • SmolAgents Integration: Advanced AI agent with 30+ step planning capability
  • Custom Tool Suite: Specialized tools for web scraping, data analysis, and visualization
  • Multi-step Processing: Complete pipeline from data discovery to report generation
  • Error Recovery: Smart error handling and alternative data source selection
  • Autonomous Decision Making: Agent can choose from provided results or find better alternatives

🎯 Interactive Follow-up Analysis ⭐ NEW

  • Dedicated Follow-up Agent: Specialized AI for answering questions about generated reports
  • Dataset Continuity: Automatically loads and analyzes the same dataset from previous report
  • Advanced Analytics: Correlation analysis, statistical summaries, custom filtering
  • Interactive Visualizations: Create new charts and graphs based on follow-up questions
  • Multiple Analysis Types: Support for bar charts, scatter plots, histograms, box plots, and more
  • Example-Driven Interface: Quick-start examples for common follow-up questions

πŸ“Š Advanced Visualizations

  • France Geographic Maps: Department and region-level choropleth maps
  • Multiple Chart Types: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots
  • Smart Visualization Selection: AI automatically chooses appropriate chart types
  • High-Quality PNG Output: Publication-ready visualizations
  • Follow-up Visualizations: Generate additional charts based on user questions

πŸ“„ Comprehensive Reports

  • Professional DOCX Reports: Complete analysis with embedded visualizations
  • Bilingual Support: Reports generated in the same language as your query
  • Structured Analysis: Title page, methodology, findings, and next steps
  • Direct DOCX Generation: No external dependencies required
  • Report Continuity: Follow-up analysis references previous report context

🎨 Modern Web Interface

  • Real-time Progress Tracking: Detailed step-by-step progress updates
  • Responsive Design: Beautiful, modern Gradio interface
  • Quick Start Examples: Pre-built queries for common use cases
  • Accordion Tips: Collapsible help section with usage instructions
  • Follow-up Interface: Dedicated section for asking follow-up questions
  • Visual Feedback: Progress bars and status indicators

πŸš€ Quick Start

1. Prerequisites

  • Python 3.8+
  • Gemini API key

2. Installation

# Clone the repository
git clone <repository-url>
cd datagouv-french-data-analyst

# Install dependencies
pip install -r requirements.txt

3. Environment Setup

Create a .env file in the project root:

GEMINI_API_KEY=your_Gemini_api_key_here

4. Launch the Application

Option 1: Using the launch script (Recommended)

python launch_gradio.py

Option 2: Direct launch

python app.py

The interface will be available at:

πŸ’‘ How to Use

Basic Analysis Workflow

  1. Enter Your Query: Type any search term related to French public data

    • Examples: "road traffic accidents", "education directory", "housing data"
    • Supports both French and English queries
  2. Or Use Quick Examples: Click any of the pre-built example queries:

    • πŸš— Road Traffic Accidents 2023
    • πŸŽ“ Education Directory
    • 🏠 French Vacant Housing Private Park
  3. Or Go Random: Leave the query empty to randomly select a high-quality dataset

  4. Click "πŸš€ Analyze Dataset": The AI agent begins processing (7-15 minutes)

Follow-up Analysis Workflow

After the initial analysis is complete:

  1. Follow-up Section Appears: Located below the generated visualizations

  2. Ask Follow-up Questions: Use the dedicated input field to ask questions about the report

  3. Use Example Questions: Click pre-built examples like:

    • πŸ“Š Correlation Analysis
    • πŸ“ˆ Statistical Summary
    • 🎯 Filter & Analyze
    • πŸ“‹ Dataset Overview
    • πŸ“‰ Trend Analysis
    • πŸ” Custom Visualization
  4. Get Detailed Answers: Receive both text explanations and new visualizations

Results

  • Download DOCX Report: Complete analysis with all visualizations
  • View Individual Charts: Up to 4 visualizations displayed in the interface
  • Dataset Reference: Direct link to the original data.gouv.fr page
  • Follow-up Visualizations: Additional charts generated from follow-up questions

πŸ› οΈ Technical Architecture

Core Components

πŸ“ Project Structure
β”œβ”€β”€ app.py                     # Main Gradio interface with progress tracking
β”œβ”€β”€ launch_gradio.py          # Simplified launch script
β”œβ”€β”€ agent.py                  # SmolAgents configuration and prompt generation
β”œβ”€β”€ followup_agent.py         # Follow-up analysis agent
β”œβ”€β”€ tools/                    # Custom agent tools
β”‚   β”œβ”€β”€ webpage_tools.py      # Web scraping and data extraction
β”‚   β”œβ”€β”€ exploration_tools.py  # Dataset analysis and description
β”‚   β”œβ”€β”€ drawing_tools.py      # France map generation and visualization
β”‚   β”œβ”€β”€ libreoffice_tools.py  # Document utilities (legacy)
β”‚   β”œβ”€β”€ followup_tools.py     # Follow-up analysis tools
β”‚   └── retrieval_tools.py    # Dataset search and retrieval
β”œβ”€β”€ filtered_dataset.csv      # Pre-processed dataset index (5,000+ datasets)
β”œβ”€β”€ france_data/              # Geographic data for France maps
└── generated_data/           # Output folder for reports and visualizations

Key Technologies

  • Frontend: Gradio with custom CSS and real-time progress
  • AI Agents:
    • Primary SmolAgents powered by Gemini
    • Specialized follow-up agent for interactive analysis ⭐
  • Search: BM25 keyword matching with TF-IDF preprocessing
  • Translation: LLM-powered bilingual query translation
  • Visualization: Matplotlib, Geopandas, Seaborn
  • Report Generation: python-docx for DOCX documents
  • Data Processing: Pandas, NumPy, Shapely, Scipy
  • Follow-up Analytics: Statistical analysis, correlation studies, custom filtering ⭐

Smart Features

Enhanced BM25 Search

  • Pre-computed search indices for 5,000+ datasets
  • Accent-insensitive keyword matching
  • Plural form normalization
  • Quality-score weighted ranking
  • Dynamic dataset retrieval during analysis ⭐

Follow-up Analysis System

  • Dataset Continuity: Automatically loads previous analysis dataset
  • Context Awareness: References previous report findings
  • Multi-modal Analysis: Combines statistical analysis with visualizations
  • Tool Integration: 8+ specialized follow-up tools including:
    • load_previous_dataset() - Load analysis dataset
    • get_dataset_summary() - Comprehensive dataset overview
    • create_followup_visualization() - Generate custom charts
    • analyze_column_correlation() - Statistical correlation analysis
    • create_statistical_summary() - Advanced statistical reports
    • filter_and_visualize_data() - Targeted data filtering and visualization

LLM Translation

  • Automatic French ↔ English translation
  • Query language detection
  • Bilingual result matching
  • Context-aware translations

Progress System

  • Thread-safe progress tracking
  • Queue-based status updates
  • Step-by-step visual feedback
  • Non-blocking UI execution

πŸ”§ Troubleshooting

Common Issues

  1. "No CSV/JSON files found"

    • The selected dataset doesn't contain processable files
    • Try a different query or use the random selection
    • Agent will automatically search for alternative datasets
  2. DOCX report generation fails

    • Ensure python-docx is installed correctly
    • Check the console for specific error messages
  3. Translation errors

    • Verify your API key is valid
    • Check API quota and rate limits
  4. Slow performance

    • BM25 index computation may take time on first run
    • Pre-computed indices are cached for faster subsequent searches
  5. Follow-up analysis errors

    • Ensure the initial analysis completed successfully
    • Check that dataset files exist in generated_data/ folder
    • Verify follow-up question is clear and specific

Performance Optimization

  • Pre-compute BM25: Run the search once to generate bm25_data.pkl
  • Use SSD storage: Faster file I/O for large datasets
  • Monitor API usage: API calls for translation and agent execution
  • Clean generated_data: Remove old files to improve follow-up performance

πŸ“Š Dataset Coverage

  • 5,000+ Datasets: Pre-filtered French government datasets
  • Data Sources: data.gouv.fr, INSEE, regional authorities
  • File Formats: CSV, JSON, Excel, XML
  • Topics: All major sectors of French public administration
  • Quality Scores: Datasets ranked by completeness and usability
  • Real-time Search: Agent can discover additional datasets during analysis

πŸš€ Advanced Usage

Follow-up Analysis Examples

Correlation Analysis:

Show me the correlation between two numerical columns with a scatter plot

Statistical Summary:

Create a comprehensive statistical summary with visualization for unemployment rates

Custom Filtering:

Filter accidents data by night time conditions and create a visualization

Trend Analysis:

Create a line chart showing accident trends over the months

Custom Tool Development

Add new tools to the tools/ directory following the SmolAgents tool pattern.

BM25 Index Optimization

Regenerate search indices with:

# Run once to create optimized search index
python -c "from app import initialize_models; initialize_models()"

Batch Processing

Process multiple datasets programmatically using the agent directly.

πŸ“‹ Dependencies

The project requires the following Python packages (see requirements.txt):

pandas, shapely, geopandas, numpy, rtree, pyproj
matplotlib, requests, duckduckgo-search
smolagents[toolkit], smolagents[litellm]
dotenv, beautifulsoup4, reportlab>=3.6.0
scikit-learn, gradio, python-docx
scipy, openpyxl, unidecode, rank_bm25

πŸ“„ License

This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add your improvements
  4. Submit a pull request

πŸŽ‰ Ready to explore French public data with AI? Launch the interface and start analyzing!

πŸ”₯ NEW: Try the follow-up analysis feature to dive deeper into your reports!