axel-darmouni's picture
Update README.md
cc63489 verified
---
title: Datagouv French Data Analyst
emoji: 🌍
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: mit
short_description: Public french data analysis agent.
tags: [agent-demo-track]
---
# πŸ€– French Public Data Analysis Agent
**AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive DOCX reports, and **interactive follow-up analysis capabilities**.
# Video Link
Quick Link, Fastened up to shorten watch time: https://www.loom.com/share/133940ce6f5f4708ba695e1c1b28cc10?sid=95c55c10-f297-40ad-bf82-8aa167bb108d
## ✨ Features
### πŸ” **Intelligent Dataset Discovery**
- **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices
- **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM
- **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets
- **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets
- **Dynamic Dataset Search**: Agent can search for alternative datasets if initial results aren't suitable
### πŸ€– **Automated AI Analysis**
- **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability
- **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization
- **Multi-step Processing**: Complete pipeline from data discovery to report generation
- **Error Recovery**: Smart error handling and alternative data source selection
- **Autonomous Decision Making**: Agent can choose from provided results or find better alternatives
### 🎯 **Interactive Follow-up Analysis** ⭐ NEW
- **Dedicated Follow-up Agent**: Specialized AI for answering questions about generated reports
- **Dataset Continuity**: Automatically loads and analyzes the same dataset from previous report
- **Advanced Analytics**: Correlation analysis, statistical summaries, custom filtering
- **Interactive Visualizations**: Create new charts and graphs based on follow-up questions
- **Multiple Analysis Types**: Support for bar charts, scatter plots, histograms, box plots, and more
- **Example-Driven Interface**: Quick-start examples for common follow-up questions
### πŸ“Š **Advanced Visualizations**
- **France Geographic Maps**: Department and region-level choropleth maps
- **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots
- **Smart Visualization Selection**: AI automatically chooses appropriate chart types
- **High-Quality PNG Output**: Publication-ready visualizations
- **Follow-up Visualizations**: Generate additional charts based on user questions
### πŸ“„ **Comprehensive Reports**
- **Professional DOCX Reports**: Complete analysis with embedded visualizations
- **Bilingual Support**: Reports generated in the same language as your query
- **Structured Analysis**: Title page, methodology, findings, and next steps
- **Direct DOCX Generation**: No external dependencies required
- **Report Continuity**: Follow-up analysis references previous report context
### 🎨 **Modern Web Interface**
- **Real-time Progress Tracking**: Detailed step-by-step progress updates
- **Responsive Design**: Beautiful, modern Gradio interface
- **Quick Start Examples**: Pre-built queries for common use cases
- **Accordion Tips**: Collapsible help section with usage instructions
- **Follow-up Interface**: Dedicated section for asking follow-up questions
- **Visual Feedback**: Progress bars and status indicators
## πŸš€ Quick Start
### 1. Prerequisites
- Python 3.8+
- Gemini API key
### 2. Installation
```bash
# Clone the repository
git clone <repository-url>
cd datagouv-french-data-analyst
# Install dependencies
pip install -r requirements.txt
```
### 3. Environment Setup
Create a `.env` file in the project root:
```bash
GEMINI_API_KEY=your_Gemini_api_key_here
```
### 4. Launch the Application
**Option 1: Using the launch script (Recommended)**
```bash
python launch_gradio.py
```
**Option 2: Direct launch**
```bash
python app.py
```
The interface will be available at:
- **Local**: http://localhost:7860
- **Public**: Shareable URL provided automatically
## πŸ’‘ How to Use
### Basic Analysis Workflow
1. **Enter Your Query**: Type any search term related to French public data
- Examples: "road traffic accidents", "education directory", "housing data"
- Supports both French and English queries
2. **Or Use Quick Examples**: Click any of the pre-built example queries:
- πŸš— Road Traffic Accidents 2023
- πŸŽ“ Education Directory
- 🏠 French Vacant Housing Private Park
3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset
4. **Click "πŸš€ Analyze Dataset"**: The AI agent begins processing (7-15 minutes)
### Follow-up Analysis Workflow
After the initial analysis is complete:
1. **Follow-up Section Appears**: Located below the generated visualizations
2. **Ask Follow-up Questions**: Use the dedicated input field to ask questions about the report
3. **Use Example Questions**: Click pre-built examples like:
- πŸ“Š Correlation Analysis
- πŸ“ˆ Statistical Summary
- 🎯 Filter & Analyze
- πŸ“‹ Dataset Overview
- πŸ“‰ Trend Analysis
- πŸ” Custom Visualization
4. **Get Detailed Answers**: Receive both text explanations and new visualizations
### Results
- **Download DOCX Report**: Complete analysis with all visualizations
- **View Individual Charts**: Up to 4 visualizations displayed in the interface
- **Dataset Reference**: Direct link to the original data.gouv.fr page
- **Follow-up Visualizations**: Additional charts generated from follow-up questions
## πŸ› οΈ Technical Architecture
### Core Components
```
πŸ“ Project Structure
β”œβ”€β”€ app.py # Main Gradio interface with progress tracking
β”œβ”€β”€ launch_gradio.py # Simplified launch script
β”œβ”€β”€ agent.py # SmolAgents configuration and prompt generation
β”œβ”€β”€ followup_agent.py # Follow-up analysis agent
β”œβ”€β”€ tools/ # Custom agent tools
β”‚ β”œβ”€β”€ webpage_tools.py # Web scraping and data extraction
β”‚ β”œβ”€β”€ exploration_tools.py # Dataset analysis and description
β”‚ β”œβ”€β”€ drawing_tools.py # France map generation and visualization
β”‚ β”œβ”€β”€ libreoffice_tools.py # Document utilities (legacy)
β”‚ β”œβ”€β”€ followup_tools.py # Follow-up analysis tools
β”‚ └── retrieval_tools.py # Dataset search and retrieval
β”œβ”€β”€ filtered_dataset.csv # Pre-processed dataset index (5,000+ datasets)
β”œβ”€β”€ france_data/ # Geographic data for France maps
└── generated_data/ # Output folder for reports and visualizations
```
### Key Technologies
- **Frontend**: Gradio with custom CSS and real-time progress
- **AI Agents**:
- Primary SmolAgents powered by Gemini
- Specialized follow-up agent for interactive analysis ⭐
- **Search**: BM25 keyword matching with TF-IDF preprocessing
- **Translation**: LLM-powered bilingual query translation
- **Visualization**: Matplotlib, Geopandas, Seaborn
- **Report Generation**: python-docx for DOCX documents
- **Data Processing**: Pandas, NumPy, Shapely, Scipy
- **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering ⭐
### Smart Features
#### Enhanced BM25 Search
- Pre-computed search indices for 5,000+ datasets
- Accent-insensitive keyword matching
- Plural form normalization
- Quality-score weighted ranking
- Dynamic dataset retrieval during analysis ⭐
#### Follow-up Analysis System
- **Dataset Continuity**: Automatically loads previous analysis dataset
- **Context Awareness**: References previous report findings
- **Multi-modal Analysis**: Combines statistical analysis with visualizations
- **Tool Integration**: 8+ specialized follow-up tools including:
- `load_previous_dataset()` - Load analysis dataset
- `get_dataset_summary()` - Comprehensive dataset overview
- `create_followup_visualization()` - Generate custom charts
- `analyze_column_correlation()` - Statistical correlation analysis
- `create_statistical_summary()` - Advanced statistical reports
- `filter_and_visualize_data()` - Targeted data filtering and visualization
#### LLM Translation
- Automatic French ↔ English translation
- Query language detection
- Bilingual result matching
- Context-aware translations
#### Progress System
- Thread-safe progress tracking
- Queue-based status updates
- Step-by-step visual feedback
- Non-blocking UI execution
## πŸ”§ Troubleshooting
### Common Issues
1. **"No CSV/JSON files found"**
- The selected dataset doesn't contain processable files
- Try a different query or use the random selection
- Agent will automatically search for alternative datasets
2. **DOCX report generation fails**
- Ensure python-docx is installed correctly
- Check the console for specific error messages
3. **Translation errors**
- Verify your API key is valid
- Check API quota and rate limits
4. **Slow performance**
- BM25 index computation may take time on first run
- Pre-computed indices are cached for faster subsequent searches
5. **Follow-up analysis errors**
- Ensure the initial analysis completed successfully
- Check that dataset files exist in `generated_data/` folder
- Verify follow-up question is clear and specific
### Performance Optimization
- **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl`
- **Use SSD storage**: Faster file I/O for large datasets
- **Monitor API usage**: API calls for translation and agent execution
- **Clean generated_data**: Remove old files to improve follow-up performance
## πŸ“Š Dataset Coverage
- **5,000+ Datasets**: Pre-filtered French government datasets
- **Data Sources**: data.gouv.fr, INSEE, regional authorities
- **File Formats**: CSV, JSON, Excel, XML
- **Topics**: All major sectors of French public administration
- **Quality Scores**: Datasets ranked by completeness and usability
- **Real-time Search**: Agent can discover additional datasets during analysis
## πŸš€ Advanced Usage
### Follow-up Analysis Examples
**Correlation Analysis:**
```
Show me the correlation between two numerical columns with a scatter plot
```
**Statistical Summary:**
```
Create a comprehensive statistical summary with visualization for unemployment rates
```
**Custom Filtering:**
```
Filter accidents data by night time conditions and create a visualization
```
**Trend Analysis:**
```
Create a line chart showing accident trends over the months
```
### Custom Tool Development
Add new tools to the `tools/` directory following the SmolAgents tool pattern.
### BM25 Index Optimization
Regenerate search indices with:
```python
# Run once to create optimized search index
python -c "from app import initialize_models; initialize_models()"
```
### Batch Processing
Process multiple datasets programmatically using the agent directly.
## πŸ“‹ Dependencies
The project requires the following Python packages (see `requirements.txt`):
```
pandas, shapely, geopandas, numpy, rtree, pyproj
matplotlib, requests, duckduckgo-search
smolagents[toolkit], smolagents[litellm]
dotenv, beautifulsoup4, reportlab>=3.6.0
scikit-learn, gradio, python-docx
scipy, openpyxl, unidecode, rank_bm25
```
## πŸ“„ License
This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Add your improvements
4. Submit a pull request
---
**πŸŽ‰ Ready to explore French public data with AI? Launch the interface and start analyzing!**
**πŸ”₯ NEW: Try the follow-up analysis feature to dive deeper into your reports!**