--- title: Datagouv French Data Analyst emoji: 🌍 colorFrom: pink colorTo: blue sdk: gradio sdk_version: 5.33.0 app_file: app.py pinned: false license: mit short_description: Public french data analysis agent. tags: [agent-demo-track] --- # 🤖 French Public Data Analysis Agent **AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive DOCX reports, and **interactive follow-up analysis capabilities**. # Video Link Quick Link, Fastened up to shorten watch time: https://www.loom.com/share/133940ce6f5f4708ba695e1c1b28cc10?sid=95c55c10-f297-40ad-bf82-8aa167bb108d ## ✨ Features ### 🔍 **Intelligent Dataset Discovery** - **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices - **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM - **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets - **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets - **Dynamic Dataset Search**: Agent can search for alternative datasets if initial results aren't suitable ### 🤖 **Automated AI Analysis** - **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability - **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization - **Multi-step Processing**: Complete pipeline from data discovery to report generation - **Error Recovery**: Smart error handling and alternative data source selection - **Autonomous Decision Making**: Agent can choose from provided results or find better alternatives ### 🎯 **Interactive Follow-up Analysis** ⭐ NEW - **Dedicated Follow-up Agent**: Specialized AI for answering questions about generated reports - **Dataset Continuity**: Automatically loads and analyzes the same dataset from previous report - **Advanced Analytics**: Correlation analysis, statistical summaries, custom filtering - **Interactive Visualizations**: Create new charts and graphs based on follow-up questions - **Multiple Analysis Types**: Support for bar charts, scatter plots, histograms, box plots, and more - **Example-Driven Interface**: Quick-start examples for common follow-up questions ### 📊 **Advanced Visualizations** - **France Geographic Maps**: Department and region-level choropleth maps - **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots - **Smart Visualization Selection**: AI automatically chooses appropriate chart types - **High-Quality PNG Output**: Publication-ready visualizations - **Follow-up Visualizations**: Generate additional charts based on user questions ### 📄 **Comprehensive Reports** - **Professional DOCX Reports**: Complete analysis with embedded visualizations - **Bilingual Support**: Reports generated in the same language as your query - **Structured Analysis**: Title page, methodology, findings, and next steps - **Direct DOCX Generation**: No external dependencies required - **Report Continuity**: Follow-up analysis references previous report context ### 🎨 **Modern Web Interface** - **Real-time Progress Tracking**: Detailed step-by-step progress updates - **Responsive Design**: Beautiful, modern Gradio interface - **Quick Start Examples**: Pre-built queries for common use cases - **Accordion Tips**: Collapsible help section with usage instructions - **Follow-up Interface**: Dedicated section for asking follow-up questions - **Visual Feedback**: Progress bars and status indicators ## 🚀 Quick Start ### 1. Prerequisites - Python 3.8+ - Gemini API key ### 2. Installation ```bash # Clone the repository git clone cd datagouv-french-data-analyst # Install dependencies pip install -r requirements.txt ``` ### 3. Environment Setup Create a `.env` file in the project root: ```bash GEMINI_API_KEY=your_Gemini_api_key_here ``` ### 4. Launch the Application **Option 1: Using the launch script (Recommended)** ```bash python launch_gradio.py ``` **Option 2: Direct launch** ```bash python app.py ``` The interface will be available at: - **Local**: http://localhost:7860 - **Public**: Shareable URL provided automatically ## 💡 How to Use ### Basic Analysis Workflow 1. **Enter Your Query**: Type any search term related to French public data - Examples: "road traffic accidents", "education directory", "housing data" - Supports both French and English queries 2. **Or Use Quick Examples**: Click any of the pre-built example queries: - 🚗 Road Traffic Accidents 2023 - 🎓 Education Directory - 🏠 French Vacant Housing Private Park 3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset 4. **Click "🚀 Analyze Dataset"**: The AI agent begins processing (7-15 minutes) ### Follow-up Analysis Workflow After the initial analysis is complete: 1. **Follow-up Section Appears**: Located below the generated visualizations 2. **Ask Follow-up Questions**: Use the dedicated input field to ask questions about the report 3. **Use Example Questions**: Click pre-built examples like: - 📊 Correlation Analysis - 📈 Statistical Summary - 🎯 Filter & Analyze - 📋 Dataset Overview - 📉 Trend Analysis - 🔍 Custom Visualization 4. **Get Detailed Answers**: Receive both text explanations and new visualizations ### Results - **Download DOCX Report**: Complete analysis with all visualizations - **View Individual Charts**: Up to 4 visualizations displayed in the interface - **Dataset Reference**: Direct link to the original data.gouv.fr page - **Follow-up Visualizations**: Additional charts generated from follow-up questions ## 🛠️ Technical Architecture ### Core Components ``` 📁 Project Structure ├── app.py # Main Gradio interface with progress tracking ├── launch_gradio.py # Simplified launch script ├── agent.py # SmolAgents configuration and prompt generation ├── followup_agent.py # Follow-up analysis agent ├── tools/ # Custom agent tools │ ├── webpage_tools.py # Web scraping and data extraction │ ├── exploration_tools.py # Dataset analysis and description │ ├── drawing_tools.py # France map generation and visualization │ ├── libreoffice_tools.py # Document utilities (legacy) │ ├── followup_tools.py # Follow-up analysis tools │ └── retrieval_tools.py # Dataset search and retrieval ├── filtered_dataset.csv # Pre-processed dataset index (5,000+ datasets) ├── france_data/ # Geographic data for France maps └── generated_data/ # Output folder for reports and visualizations ``` ### Key Technologies - **Frontend**: Gradio with custom CSS and real-time progress - **AI Agents**: - Primary SmolAgents powered by Gemini - Specialized follow-up agent for interactive analysis ⭐ - **Search**: BM25 keyword matching with TF-IDF preprocessing - **Translation**: LLM-powered bilingual query translation - **Visualization**: Matplotlib, Geopandas, Seaborn - **Report Generation**: python-docx for DOCX documents - **Data Processing**: Pandas, NumPy, Shapely, Scipy - **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering ⭐ ### Smart Features #### Enhanced BM25 Search - Pre-computed search indices for 5,000+ datasets - Accent-insensitive keyword matching - Plural form normalization - Quality-score weighted ranking - Dynamic dataset retrieval during analysis ⭐ #### Follow-up Analysis System - **Dataset Continuity**: Automatically loads previous analysis dataset - **Context Awareness**: References previous report findings - **Multi-modal Analysis**: Combines statistical analysis with visualizations - **Tool Integration**: 8+ specialized follow-up tools including: - `load_previous_dataset()` - Load analysis dataset - `get_dataset_summary()` - Comprehensive dataset overview - `create_followup_visualization()` - Generate custom charts - `analyze_column_correlation()` - Statistical correlation analysis - `create_statistical_summary()` - Advanced statistical reports - `filter_and_visualize_data()` - Targeted data filtering and visualization #### LLM Translation - Automatic French ↔ English translation - Query language detection - Bilingual result matching - Context-aware translations #### Progress System - Thread-safe progress tracking - Queue-based status updates - Step-by-step visual feedback - Non-blocking UI execution ## 🔧 Troubleshooting ### Common Issues 1. **"No CSV/JSON files found"** - The selected dataset doesn't contain processable files - Try a different query or use the random selection - Agent will automatically search for alternative datasets 2. **DOCX report generation fails** - Ensure python-docx is installed correctly - Check the console for specific error messages 3. **Translation errors** - Verify your API key is valid - Check API quota and rate limits 4. **Slow performance** - BM25 index computation may take time on first run - Pre-computed indices are cached for faster subsequent searches 5. **Follow-up analysis errors** - Ensure the initial analysis completed successfully - Check that dataset files exist in `generated_data/` folder - Verify follow-up question is clear and specific ### Performance Optimization - **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl` - **Use SSD storage**: Faster file I/O for large datasets - **Monitor API usage**: API calls for translation and agent execution - **Clean generated_data**: Remove old files to improve follow-up performance ## 📊 Dataset Coverage - **5,000+ Datasets**: Pre-filtered French government datasets - **Data Sources**: data.gouv.fr, INSEE, regional authorities - **File Formats**: CSV, JSON, Excel, XML - **Topics**: All major sectors of French public administration - **Quality Scores**: Datasets ranked by completeness and usability - **Real-time Search**: Agent can discover additional datasets during analysis ## 🚀 Advanced Usage ### Follow-up Analysis Examples **Correlation Analysis:** ``` Show me the correlation between two numerical columns with a scatter plot ``` **Statistical Summary:** ``` Create a comprehensive statistical summary with visualization for unemployment rates ``` **Custom Filtering:** ``` Filter accidents data by night time conditions and create a visualization ``` **Trend Analysis:** ``` Create a line chart showing accident trends over the months ``` ### Custom Tool Development Add new tools to the `tools/` directory following the SmolAgents tool pattern. ### BM25 Index Optimization Regenerate search indices with: ```python # Run once to create optimized search index python -c "from app import initialize_models; initialize_models()" ``` ### Batch Processing Process multiple datasets programmatically using the agent directly. ## 📋 Dependencies The project requires the following Python packages (see `requirements.txt`): ``` pandas, shapely, geopandas, numpy, rtree, pyproj matplotlib, requests, duckduckgo-search smolagents[toolkit], smolagents[litellm] dotenv, beautifulsoup4, reportlab>=3.6.0 scikit-learn, gradio, python-docx scipy, openpyxl, unidecode, rank_bm25 ``` ## 📄 License This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components. ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Add your improvements 4. Submit a pull request --- **🎉 Ready to explore French public data with AI? Launch the interface and start analyzing!** **🔥 NEW: Try the follow-up analysis feature to dive deeper into your reports!**