A newer version of the Gradio SDK is available:
5.44.1
title: Datagouv French Data Analyst
emoji: π
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: mit
short_description: Public french data analysis agent.
tags:
- agent-demo-track
π€ French Public Data Analysis Agent
AI-powered intelligent analysis of French public datasets with automated visualization generation, comprehensive DOCX reports, and interactive follow-up analysis capabilities.
Video Link
Quick Link, Fastened up to shorten watch time: https://www.loom.com/share/133940ce6f5f4708ba695e1c1b28cc10?sid=95c55c10-f297-40ad-bf82-8aa167bb108d
β¨ Features
π Intelligent Dataset Discovery
- BM25 Keyword Search: Advanced keyword matching with pre-computed search indices
- Bilingual Query Translation: Search in French or English - queries are automatically translated using LLM
- Quality-Weighted Random Selection: Leave query empty to randomly select high-quality datasets
- Real-time Dataset Matching: Instant matching against 5,000+ French government datasets
- Dynamic Dataset Search: Agent can search for alternative datasets if initial results aren't suitable
π€ Automated AI Analysis
- SmolAgents Integration: Advanced AI agent with 30+ step planning capability
- Custom Tool Suite: Specialized tools for web scraping, data analysis, and visualization
- Multi-step Processing: Complete pipeline from data discovery to report generation
- Error Recovery: Smart error handling and alternative data source selection
- Autonomous Decision Making: Agent can choose from provided results or find better alternatives
π― Interactive Follow-up Analysis β NEW
- Dedicated Follow-up Agent: Specialized AI for answering questions about generated reports
- Dataset Continuity: Automatically loads and analyzes the same dataset from previous report
- Advanced Analytics: Correlation analysis, statistical summaries, custom filtering
- Interactive Visualizations: Create new charts and graphs based on follow-up questions
- Multiple Analysis Types: Support for bar charts, scatter plots, histograms, box plots, and more
- Example-Driven Interface: Quick-start examples for common follow-up questions
π Advanced Visualizations
- France Geographic Maps: Department and region-level choropleth maps
- Multiple Chart Types: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots
- Smart Visualization Selection: AI automatically chooses appropriate chart types
- High-Quality PNG Output: Publication-ready visualizations
- Follow-up Visualizations: Generate additional charts based on user questions
π Comprehensive Reports
- Professional DOCX Reports: Complete analysis with embedded visualizations
- Bilingual Support: Reports generated in the same language as your query
- Structured Analysis: Title page, methodology, findings, and next steps
- Direct DOCX Generation: No external dependencies required
- Report Continuity: Follow-up analysis references previous report context
π¨ Modern Web Interface
- Real-time Progress Tracking: Detailed step-by-step progress updates
- Responsive Design: Beautiful, modern Gradio interface
- Quick Start Examples: Pre-built queries for common use cases
- Accordion Tips: Collapsible help section with usage instructions
- Follow-up Interface: Dedicated section for asking follow-up questions
- Visual Feedback: Progress bars and status indicators
π Quick Start
1. Prerequisites
- Python 3.8+
- Gemini API key
2. Installation
# Clone the repository
git clone <repository-url>
cd datagouv-french-data-analyst
# Install dependencies
pip install -r requirements.txt
3. Environment Setup
Create a .env
file in the project root:
GEMINI_API_KEY=your_Gemini_api_key_here
4. Launch the Application
Option 1: Using the launch script (Recommended)
python launch_gradio.py
Option 2: Direct launch
python app.py
The interface will be available at:
- Local: http://localhost:7860
- Public: Shareable URL provided automatically
π‘ How to Use
Basic Analysis Workflow
Enter Your Query: Type any search term related to French public data
- Examples: "road traffic accidents", "education directory", "housing data"
- Supports both French and English queries
Or Use Quick Examples: Click any of the pre-built example queries:
- π Road Traffic Accidents 2023
- π Education Directory
- π French Vacant Housing Private Park
Or Go Random: Leave the query empty to randomly select a high-quality dataset
Click "π Analyze Dataset": The AI agent begins processing (7-15 minutes)
Follow-up Analysis Workflow
After the initial analysis is complete:
Follow-up Section Appears: Located below the generated visualizations
Ask Follow-up Questions: Use the dedicated input field to ask questions about the report
Use Example Questions: Click pre-built examples like:
- π Correlation Analysis
- π Statistical Summary
- π― Filter & Analyze
- π Dataset Overview
- π Trend Analysis
- π Custom Visualization
Get Detailed Answers: Receive both text explanations and new visualizations
Results
- Download DOCX Report: Complete analysis with all visualizations
- View Individual Charts: Up to 4 visualizations displayed in the interface
- Dataset Reference: Direct link to the original data.gouv.fr page
- Follow-up Visualizations: Additional charts generated from follow-up questions
π οΈ Technical Architecture
Core Components
π Project Structure
βββ app.py # Main Gradio interface with progress tracking
βββ launch_gradio.py # Simplified launch script
βββ agent.py # SmolAgents configuration and prompt generation
βββ followup_agent.py # Follow-up analysis agent
βββ tools/ # Custom agent tools
β βββ webpage_tools.py # Web scraping and data extraction
β βββ exploration_tools.py # Dataset analysis and description
β βββ drawing_tools.py # France map generation and visualization
β βββ libreoffice_tools.py # Document utilities (legacy)
β βββ followup_tools.py # Follow-up analysis tools
β βββ retrieval_tools.py # Dataset search and retrieval
βββ filtered_dataset.csv # Pre-processed dataset index (5,000+ datasets)
βββ france_data/ # Geographic data for France maps
βββ generated_data/ # Output folder for reports and visualizations
Key Technologies
- Frontend: Gradio with custom CSS and real-time progress
- AI Agents:
- Primary SmolAgents powered by Gemini
- Specialized follow-up agent for interactive analysis β
- Search: BM25 keyword matching with TF-IDF preprocessing
- Translation: LLM-powered bilingual query translation
- Visualization: Matplotlib, Geopandas, Seaborn
- Report Generation: python-docx for DOCX documents
- Data Processing: Pandas, NumPy, Shapely, Scipy
- Follow-up Analytics: Statistical analysis, correlation studies, custom filtering β
Smart Features
Enhanced BM25 Search
- Pre-computed search indices for 5,000+ datasets
- Accent-insensitive keyword matching
- Plural form normalization
- Quality-score weighted ranking
- Dynamic dataset retrieval during analysis β
Follow-up Analysis System
- Dataset Continuity: Automatically loads previous analysis dataset
- Context Awareness: References previous report findings
- Multi-modal Analysis: Combines statistical analysis with visualizations
- Tool Integration: 8+ specialized follow-up tools including:
load_previous_dataset()
- Load analysis datasetget_dataset_summary()
- Comprehensive dataset overviewcreate_followup_visualization()
- Generate custom chartsanalyze_column_correlation()
- Statistical correlation analysiscreate_statistical_summary()
- Advanced statistical reportsfilter_and_visualize_data()
- Targeted data filtering and visualization
LLM Translation
- Automatic French β English translation
- Query language detection
- Bilingual result matching
- Context-aware translations
Progress System
- Thread-safe progress tracking
- Queue-based status updates
- Step-by-step visual feedback
- Non-blocking UI execution
π§ Troubleshooting
Common Issues
"No CSV/JSON files found"
- The selected dataset doesn't contain processable files
- Try a different query or use the random selection
- Agent will automatically search for alternative datasets
DOCX report generation fails
- Ensure python-docx is installed correctly
- Check the console for specific error messages
Translation errors
- Verify your API key is valid
- Check API quota and rate limits
Slow performance
- BM25 index computation may take time on first run
- Pre-computed indices are cached for faster subsequent searches
Follow-up analysis errors
- Ensure the initial analysis completed successfully
- Check that dataset files exist in
generated_data/
folder - Verify follow-up question is clear and specific
Performance Optimization
- Pre-compute BM25: Run the search once to generate
bm25_data.pkl
- Use SSD storage: Faster file I/O for large datasets
- Monitor API usage: API calls for translation and agent execution
- Clean generated_data: Remove old files to improve follow-up performance
π Dataset Coverage
- 5,000+ Datasets: Pre-filtered French government datasets
- Data Sources: data.gouv.fr, INSEE, regional authorities
- File Formats: CSV, JSON, Excel, XML
- Topics: All major sectors of French public administration
- Quality Scores: Datasets ranked by completeness and usability
- Real-time Search: Agent can discover additional datasets during analysis
π Advanced Usage
Follow-up Analysis Examples
Correlation Analysis:
Show me the correlation between two numerical columns with a scatter plot
Statistical Summary:
Create a comprehensive statistical summary with visualization for unemployment rates
Custom Filtering:
Filter accidents data by night time conditions and create a visualization
Trend Analysis:
Create a line chart showing accident trends over the months
Custom Tool Development
Add new tools to the tools/
directory following the SmolAgents tool pattern.
BM25 Index Optimization
Regenerate search indices with:
# Run once to create optimized search index
python -c "from app import initialize_models; initialize_models()"
Batch Processing
Process multiple datasets programmatically using the agent directly.
π Dependencies
The project requires the following Python packages (see requirements.txt
):
pandas, shapely, geopandas, numpy, rtree, pyproj
matplotlib, requests, duckduckgo-search
smolagents[toolkit], smolagents[litellm]
dotenv, beautifulsoup4, reportlab>=3.6.0
scikit-learn, gradio, python-docx
scipy, openpyxl, unidecode, rank_bm25
π License
This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.
π€ Contributing
- Fork the repository
- Create a feature branch
- Add your improvements
- Submit a pull request
π Ready to explore French public data with AI? Launch the interface and start analyzing!
π₯ NEW: Try the follow-up analysis feature to dive deeper into your reports!