|
--- |
|
title: Datagouv French Data Analyst |
|
emoji: π |
|
colorFrom: pink |
|
colorTo: blue |
|
sdk: gradio |
|
sdk_version: 5.33.0 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Public french data analysis agent. |
|
tags: [agent-demo-track] |
|
--- |
|
|
|
# π€ French Public Data Analysis Agent |
|
|
|
**AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive DOCX reports, and **interactive follow-up analysis capabilities**. |
|
|
|
# Video Link |
|
|
|
Quick Link, Fastened up to shorten watch time: https://www.loom.com/share/133940ce6f5f4708ba695e1c1b28cc10?sid=95c55c10-f297-40ad-bf82-8aa167bb108d |
|
|
|
## β¨ Features |
|
|
|
### π **Intelligent Dataset Discovery** |
|
- **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices |
|
- **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM |
|
- **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets |
|
- **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets |
|
- **Dynamic Dataset Search**: Agent can search for alternative datasets if initial results aren't suitable |
|
|
|
### π€ **Automated AI Analysis** |
|
- **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability |
|
- **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization |
|
- **Multi-step Processing**: Complete pipeline from data discovery to report generation |
|
- **Error Recovery**: Smart error handling and alternative data source selection |
|
- **Autonomous Decision Making**: Agent can choose from provided results or find better alternatives |
|
|
|
### π― **Interactive Follow-up Analysis** β NEW |
|
- **Dedicated Follow-up Agent**: Specialized AI for answering questions about generated reports |
|
- **Dataset Continuity**: Automatically loads and analyzes the same dataset from previous report |
|
- **Advanced Analytics**: Correlation analysis, statistical summaries, custom filtering |
|
- **Interactive Visualizations**: Create new charts and graphs based on follow-up questions |
|
- **Multiple Analysis Types**: Support for bar charts, scatter plots, histograms, box plots, and more |
|
- **Example-Driven Interface**: Quick-start examples for common follow-up questions |
|
|
|
### π **Advanced Visualizations** |
|
- **France Geographic Maps**: Department and region-level choropleth maps |
|
- **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots |
|
- **Smart Visualization Selection**: AI automatically chooses appropriate chart types |
|
- **High-Quality PNG Output**: Publication-ready visualizations |
|
- **Follow-up Visualizations**: Generate additional charts based on user questions |
|
|
|
### π **Comprehensive Reports** |
|
- **Professional DOCX Reports**: Complete analysis with embedded visualizations |
|
- **Bilingual Support**: Reports generated in the same language as your query |
|
- **Structured Analysis**: Title page, methodology, findings, and next steps |
|
- **Direct DOCX Generation**: No external dependencies required |
|
- **Report Continuity**: Follow-up analysis references previous report context |
|
|
|
### π¨ **Modern Web Interface** |
|
- **Real-time Progress Tracking**: Detailed step-by-step progress updates |
|
- **Responsive Design**: Beautiful, modern Gradio interface |
|
- **Quick Start Examples**: Pre-built queries for common use cases |
|
- **Accordion Tips**: Collapsible help section with usage instructions |
|
- **Follow-up Interface**: Dedicated section for asking follow-up questions |
|
- **Visual Feedback**: Progress bars and status indicators |
|
|
|
## π Quick Start |
|
|
|
### 1. Prerequisites |
|
|
|
- Python 3.8+ |
|
- Gemini API key |
|
|
|
### 2. Installation |
|
|
|
```bash |
|
# Clone the repository |
|
git clone <repository-url> |
|
cd datagouv-french-data-analyst |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### 3. Environment Setup |
|
|
|
Create a `.env` file in the project root: |
|
|
|
```bash |
|
GEMINI_API_KEY=your_Gemini_api_key_here |
|
``` |
|
|
|
### 4. Launch the Application |
|
|
|
**Option 1: Using the launch script (Recommended)** |
|
```bash |
|
python launch_gradio.py |
|
``` |
|
|
|
**Option 2: Direct launch** |
|
```bash |
|
python app.py |
|
``` |
|
|
|
The interface will be available at: |
|
- **Local**: http://localhost:7860 |
|
- **Public**: Shareable URL provided automatically |
|
|
|
## π‘ How to Use |
|
|
|
### Basic Analysis Workflow |
|
|
|
1. **Enter Your Query**: Type any search term related to French public data |
|
- Examples: "road traffic accidents", "education directory", "housing data" |
|
- Supports both French and English queries |
|
|
|
2. **Or Use Quick Examples**: Click any of the pre-built example queries: |
|
- π Road Traffic Accidents 2023 |
|
- π Education Directory |
|
- π French Vacant Housing Private Park |
|
|
|
3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset |
|
|
|
4. **Click "π Analyze Dataset"**: The AI agent begins processing (7-15 minutes) |
|
|
|
### Follow-up Analysis Workflow |
|
|
|
After the initial analysis is complete: |
|
|
|
1. **Follow-up Section Appears**: Located below the generated visualizations |
|
2. **Ask Follow-up Questions**: Use the dedicated input field to ask questions about the report |
|
3. **Use Example Questions**: Click pre-built examples like: |
|
- π Correlation Analysis |
|
- π Statistical Summary |
|
- π― Filter & Analyze |
|
- π Dataset Overview |
|
- π Trend Analysis |
|
- π Custom Visualization |
|
|
|
4. **Get Detailed Answers**: Receive both text explanations and new visualizations |
|
|
|
### Results |
|
|
|
- **Download DOCX Report**: Complete analysis with all visualizations |
|
- **View Individual Charts**: Up to 4 visualizations displayed in the interface |
|
- **Dataset Reference**: Direct link to the original data.gouv.fr page |
|
- **Follow-up Visualizations**: Additional charts generated from follow-up questions |
|
|
|
## π οΈ Technical Architecture |
|
|
|
### Core Components |
|
|
|
``` |
|
π Project Structure |
|
βββ app.py # Main Gradio interface with progress tracking |
|
βββ launch_gradio.py # Simplified launch script |
|
βββ agent.py # SmolAgents configuration and prompt generation |
|
βββ followup_agent.py # Follow-up analysis agent |
|
βββ tools/ # Custom agent tools |
|
β βββ webpage_tools.py # Web scraping and data extraction |
|
β βββ exploration_tools.py # Dataset analysis and description |
|
β βββ drawing_tools.py # France map generation and visualization |
|
β βββ libreoffice_tools.py # Document utilities (legacy) |
|
β βββ followup_tools.py # Follow-up analysis tools |
|
β βββ retrieval_tools.py # Dataset search and retrieval |
|
βββ filtered_dataset.csv # Pre-processed dataset index (5,000+ datasets) |
|
βββ france_data/ # Geographic data for France maps |
|
βββ generated_data/ # Output folder for reports and visualizations |
|
``` |
|
|
|
### Key Technologies |
|
|
|
- **Frontend**: Gradio with custom CSS and real-time progress |
|
- **AI Agents**: |
|
- Primary SmolAgents powered by Gemini |
|
- Specialized follow-up agent for interactive analysis β |
|
- **Search**: BM25 keyword matching with TF-IDF preprocessing |
|
- **Translation**: LLM-powered bilingual query translation |
|
- **Visualization**: Matplotlib, Geopandas, Seaborn |
|
- **Report Generation**: python-docx for DOCX documents |
|
- **Data Processing**: Pandas, NumPy, Shapely, Scipy |
|
- **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering β |
|
|
|
### Smart Features |
|
|
|
#### Enhanced BM25 Search |
|
- Pre-computed search indices for 5,000+ datasets |
|
- Accent-insensitive keyword matching |
|
- Plural form normalization |
|
- Quality-score weighted ranking |
|
- Dynamic dataset retrieval during analysis β |
|
|
|
#### Follow-up Analysis System |
|
- **Dataset Continuity**: Automatically loads previous analysis dataset |
|
- **Context Awareness**: References previous report findings |
|
- **Multi-modal Analysis**: Combines statistical analysis with visualizations |
|
- **Tool Integration**: 8+ specialized follow-up tools including: |
|
- `load_previous_dataset()` - Load analysis dataset |
|
- `get_dataset_summary()` - Comprehensive dataset overview |
|
- `create_followup_visualization()` - Generate custom charts |
|
- `analyze_column_correlation()` - Statistical correlation analysis |
|
- `create_statistical_summary()` - Advanced statistical reports |
|
- `filter_and_visualize_data()` - Targeted data filtering and visualization |
|
|
|
#### LLM Translation |
|
- Automatic French β English translation |
|
- Query language detection |
|
- Bilingual result matching |
|
- Context-aware translations |
|
|
|
#### Progress System |
|
- Thread-safe progress tracking |
|
- Queue-based status updates |
|
- Step-by-step visual feedback |
|
- Non-blocking UI execution |
|
|
|
## π§ Troubleshooting |
|
|
|
### Common Issues |
|
|
|
1. **"No CSV/JSON files found"** |
|
- The selected dataset doesn't contain processable files |
|
- Try a different query or use the random selection |
|
- Agent will automatically search for alternative datasets |
|
|
|
2. **DOCX report generation fails** |
|
- Ensure python-docx is installed correctly |
|
- Check the console for specific error messages |
|
|
|
3. **Translation errors** |
|
- Verify your API key is valid |
|
- Check API quota and rate limits |
|
|
|
4. **Slow performance** |
|
- BM25 index computation may take time on first run |
|
- Pre-computed indices are cached for faster subsequent searches |
|
|
|
5. **Follow-up analysis errors** |
|
- Ensure the initial analysis completed successfully |
|
- Check that dataset files exist in `generated_data/` folder |
|
- Verify follow-up question is clear and specific |
|
|
|
### Performance Optimization |
|
|
|
- **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl` |
|
- **Use SSD storage**: Faster file I/O for large datasets |
|
- **Monitor API usage**: API calls for translation and agent execution |
|
- **Clean generated_data**: Remove old files to improve follow-up performance |
|
|
|
## π Dataset Coverage |
|
|
|
- **5,000+ Datasets**: Pre-filtered French government datasets |
|
- **Data Sources**: data.gouv.fr, INSEE, regional authorities |
|
- **File Formats**: CSV, JSON, Excel, XML |
|
- **Topics**: All major sectors of French public administration |
|
- **Quality Scores**: Datasets ranked by completeness and usability |
|
- **Real-time Search**: Agent can discover additional datasets during analysis |
|
|
|
## π Advanced Usage |
|
|
|
### Follow-up Analysis Examples |
|
|
|
**Correlation Analysis:** |
|
``` |
|
Show me the correlation between two numerical columns with a scatter plot |
|
``` |
|
|
|
**Statistical Summary:** |
|
``` |
|
Create a comprehensive statistical summary with visualization for unemployment rates |
|
``` |
|
|
|
**Custom Filtering:** |
|
``` |
|
Filter accidents data by night time conditions and create a visualization |
|
``` |
|
|
|
**Trend Analysis:** |
|
``` |
|
Create a line chart showing accident trends over the months |
|
``` |
|
|
|
### Custom Tool Development |
|
Add new tools to the `tools/` directory following the SmolAgents tool pattern. |
|
|
|
### BM25 Index Optimization |
|
Regenerate search indices with: |
|
```python |
|
# Run once to create optimized search index |
|
python -c "from app import initialize_models; initialize_models()" |
|
``` |
|
|
|
### Batch Processing |
|
Process multiple datasets programmatically using the agent directly. |
|
|
|
## π Dependencies |
|
|
|
The project requires the following Python packages (see `requirements.txt`): |
|
|
|
``` |
|
pandas, shapely, geopandas, numpy, rtree, pyproj |
|
matplotlib, requests, duckduckgo-search |
|
smolagents[toolkit], smolagents[litellm] |
|
dotenv, beautifulsoup4, reportlab>=3.6.0 |
|
scikit-learn, gradio, python-docx |
|
scipy, openpyxl, unidecode, rank_bm25 |
|
``` |
|
|
|
## π License |
|
|
|
This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components. |
|
|
|
## π€ Contributing |
|
|
|
1. Fork the repository |
|
2. Create a feature branch |
|
3. Add your improvements |
|
4. Submit a pull request |
|
|
|
--- |
|
|
|
**π Ready to explore French public data with AI? Launch the interface and start analyzing!** |
|
|
|
**π₯ NEW: Try the follow-up analysis feature to dive deeper into your reports!** |
|
|