Spaces:

axel-darmouni
/

datagouv-french-data-analyst

Sleeping

App Files Files Community

axel-darmouni commited on Jun 9

Commit

f584ef2

1 Parent(s): 2508004

all modifs

Browse files

Files changed (8) hide show

README.md +124 -21
agent.py +87 -37
app.py +251 -66
followup_agent.py +119 -0
tools/followup_tools.py +515 -0
tools/libreoffice_tools.py +14 -3
tools/retrieval_tools.py +277 -0
tools/webpage_tools.py +26 -0

README.md CHANGED Viewed

@@ -8,44 +8,58 @@ sdk_version: 5.33.0
 app_file: app.py
 pinned: false
 license: mit
-short_description: Agents for data analysis of French public data.
 ---
 # 🤖 French Public Data Analysis Agent
-**AI-powered intelligent analysis of French government datasets** with automated visualization generation and comprehensive PDF reports.
 ## ✨ Features
 ### 🔍 **Intelligent Dataset Discovery**
 - **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices
 - **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM
-- **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets
 - **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets
 ### 🤖 **Automated AI Analysis**
 - **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability
 - **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization
 - **Multi-step Processing**: Complete pipeline from data discovery to report generation
 - **Error Recovery**: Smart error handling and alternative data source selection
 ### 📊 **Advanced Visualizations**
 - **France Geographic Maps**: Department and region-level choropleth maps
-- **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps
 - **Smart Visualization Selection**: AI automatically chooses appropriate chart types
 - **High-Quality PNG Output**: Publication-ready visualizations
 ### 📄 **Comprehensive Reports**
 - **Professional PDF Reports**: Complete analysis with embedded visualizations
 - **Bilingual Support**: Reports generated in the same language as your query
 - **Structured Analysis**: Title page, methodology, findings, and next steps
 - **LibreOffice Integration**: Cross-platform PDF generation
 ### 🎨 **Modern Web Interface**
 - **Real-time Progress Tracking**: Detailed step-by-step progress updates
 - **Responsive Design**: Beautiful, modern Gradio interface
 - **Quick Start Examples**: Pre-built queries for common use cases
 - **Accordion Tips**: Collapsible help section with usage instructions
 ## 🚀 Quick Start
@@ -60,7 +74,7 @@ short_description: Agents for data analysis of French public data.
 ```bash
 # Clone the repository
 git clone <repository-url>
-cd gradio_hackathon_agent
 # Install dependencies
 pip install -r requirements.txt
@@ -76,8 +90,14 @@ GEMINI_API_KEY=your_gemini_api_key_here
 ### 4. Launch the Application
 ```bash
-python gradio_interface.py
 ```
 The interface will be available at:
@@ -86,26 +106,43 @@ The interface will be available at:
 ## 💡 How to Use
-### Basic Usage
 1. **Enter Your Query**: Type any search term related to French public data
    - Examples: "road traffic accidents", "education directory", "housing data"
    - Supports both French and English queries
 2. **Or Use Quick Examples**: Click any of the pre-built example queries:
-   - 🚗 Road Traffic Accidents 2005-2023
    - 🎓 Education Directory
    - 🏠 French Vacant Housing Private Park
 3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset
-4. **Click "🚀 Analyze Dataset"**: The AI agent begins processing
 ### Results
 - **Download PDF Report**: Complete analysis with all visualizations
 - **View Individual Charts**: Up to 4 visualizations displayed in the interface
 - **Dataset Reference**: Direct link to the original data.gouv.fr page
 ## 🛠️ Technical Architecture
@@ -113,34 +150,55 @@ The interface will be available at:
 ```
 📁 Project Structure
-├── app.py     # Main Gradio web interface with progress tracking
-├── agent.py               # SmolAgents configuration and prompt generation
-├── tools/                 # Custom agent tools
-│   ├── webpage_tools.py   # Web scraping and data extraction
-│   ├── exploration_tools.py # Dataset analysis and description
-│   ├── drawing_tools.py   # France map generation and visualization
-│   └── libreoffice_tools.py # PDF conversion utilities
-├── filtered_dataset.csv   # Pre-processed dataset index (5,000+ datasets)
-└── france_data/           # Geographic data for France maps
 ```
 ### Key Technologies
 - **Frontend**: Gradio with custom CSS and real-time progress
-- **AI Agent**: SmolAgents powered by an MLLM
 - **Search**: BM25 keyword matching with TF-IDF preprocessing
 - **Translation**: LLM-powered bilingual query translation
 - **Visualization**: Matplotlib, Geopandas, Seaborn
 - **PDF Generation**: python-docx + LibreOffice conversion
-- **Data Processing**: Pandas, NumPy, Shapely
 ### Smart Features
-#### BM25 Search Enhancement
 - Pre-computed search indices for 5,000+ datasets
 - Accent-insensitive keyword matching
 - Plural form normalization
 - Quality-score weighted ranking
 #### LLM Translation
 - Automatic French ↔ English translation
@@ -161,6 +219,7 @@ The interface will be available at:
 1. **"No CSV/JSON files found"**
    - The selected dataset doesn't contain processable files
    - Try a different query or use the random selection
 2. **LibreOffice PDF conversion fails**
    - Ensure LibreOffice is installed and accessible
@@ -174,11 +233,17 @@ The interface will be available at:
    - BM25 index computation may take time on first run
    - Pre-computed indices are cached for faster subsequent searches
 ### Performance Optimization
 - **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl`
 - **Use SSD storage**: Faster file I/O for large datasets
 - **Monitor API usage**: API calls for translation and agent execution
 ## 📊 Dataset Coverage
@@ -187,9 +252,32 @@ The interface will be available at:
 - **File Formats**: CSV, JSON, Excel, XML
 - **Topics**: All major sectors of French public administration
 - **Quality Scores**: Datasets ranked by completeness and usability
 ## 🚀 Advanced Usage
 ### Custom Tool Development
 Add new tools to the `tools/` directory following the SmolAgents tool pattern.
@@ -203,6 +291,19 @@ python -c "from app import initialize_models; initialize_models()"
 ### Batch Processing
 Process multiple datasets programmatically using the agent directly.
 ## 📄 License
 This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.
@@ -217,3 +318,5 @@ This project is developed for the Gradio MCP x Agents Hackathon. See individual
 ---
 **🎉 Ready to explore French public data with AI? Launch the interface and start analyzing!**

 app_file: app.py
 pinned: false
 license: mit
+short_description: AI-powered agents for comprehensive analysis of French public data with follow-up capabilities.
 ---
 # 🤖 French Public Data Analysis Agent
+**AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive PDF reports, and **interactive follow-up analysis capabilities**.
 ## ✨ Features
 ### 🔍 **Intelligent Dataset Discovery**
 - **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices
 - **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM
+- **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets
 - **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets
+- **Dynamic Dataset Search**: Agent can search for alternative datasets if initial results aren't suitable
 ### 🤖 **Automated AI Analysis**
 - **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability
 - **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization
 - **Multi-step Processing**: Complete pipeline from data discovery to report generation
 - **Error Recovery**: Smart error handling and alternative data source selection
+- **Autonomous Decision Making**: Agent can choose from provided results or find better alternatives
+### 🎯 **Interactive Follow-up Analysis** ⭐ NEW
+- **Dedicated Follow-up Agent**: Specialized AI for answering questions about generated reports
+- **Dataset Continuity**: Automatically loads and analyzes the same dataset from previous report
+- **Advanced Analytics**: Correlation analysis, statistical summaries, custom filtering
+- **Interactive Visualizations**: Create new charts and graphs based on follow-up questions
+- **Multiple Analysis Types**: Support for bar charts, scatter plots, histograms, box plots, and more
+- **Example-Driven Interface**: Quick-start examples for common follow-up questions
 ### 📊 **Advanced Visualizations**
 - **France Geographic Maps**: Department and region-level choropleth maps
+- **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps, histograms, box plots
 - **Smart Visualization Selection**: AI automatically chooses appropriate chart types
 - **High-Quality PNG Output**: Publication-ready visualizations
+- **Follow-up Visualizations**: Generate additional charts based on user questions
 ### 📄 **Comprehensive Reports**
 - **Professional PDF Reports**: Complete analysis with embedded visualizations
 - **Bilingual Support**: Reports generated in the same language as your query
 - **Structured Analysis**: Title page, methodology, findings, and next steps
 - **LibreOffice Integration**: Cross-platform PDF generation
+- **Report Continuity**: Follow-up analysis references previous report context
 ### 🎨 **Modern Web Interface**
 - **Real-time Progress Tracking**: Detailed step-by-step progress updates
 - **Responsive Design**: Beautiful, modern Gradio interface
 - **Quick Start Examples**: Pre-built queries for common use cases
 - **Accordion Tips**: Collapsible help section with usage instructions
+- **Follow-up Interface**: Dedicated section for asking follow-up questions
+- **Visual Feedback**: Progress bars and status indicators
 ## 🚀 Quick Start
 ```bash
 # Clone the repository
 git clone <repository-url>
+cd datagouv-french-data-analyst
 # Install dependencies
 pip install -r requirements.txt
 ### 4. Launch the Application
+**Option 1: Using the launch script (Recommended)**
+```bash
+python launch_gradio.py
+```
+**Option 2: Direct launch**
 ```bash
+python app.py
 ```
 The interface will be available at:
 ## 💡 How to Use
+### Basic Analysis Workflow
 1. **Enter Your Query**: Type any search term related to French public data
    - Examples: "road traffic accidents", "education directory", "housing data"
    - Supports both French and English queries
 2. **Or Use Quick Examples**: Click any of the pre-built example queries:
+   - 🚗 Road Traffic Accidents 2023
    - 🎓 Education Directory
    - 🏠 French Vacant Housing Private Park
 3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset
+4. **Click "🚀 Analyze Dataset"**: The AI agent begins processing (7-15 minutes)
+### Follow-up Analysis Workflow
+After the initial analysis is complete:
+1. **Follow-up Section Appears**: Located below the generated visualizations
+2. **Ask Follow-up Questions**: Use the dedicated input field to ask questions about the report
+3. **Use Example Questions**: Click pre-built examples like:
+   - 📊 Correlation Analysis
+   - 📈 Statistical Summary
+   - 🎯 Filter & Analyze
+   - 📋 Dataset Overview
+   - 📉 Trend Analysis
+   - 🔍 Custom Visualization
+4. **Get Detailed Answers**: Receive both text explanations and new visualizations
 ### Results
 - **Download PDF Report**: Complete analysis with all visualizations
 - **View Individual Charts**: Up to 4 visualizations displayed in the interface
 - **Dataset Reference**: Direct link to the original data.gouv.fr page
+- **Follow-up Visualizations**: Additional charts generated from follow-up questions
 ## 🛠️ Technical Architecture
 ```
 📁 Project Structure
+├── app.py                     # Main Gradio interface with progress tracking
+├── launch_gradio.py          # Simplified launch script
+├── agent.py                  # SmolAgents configuration and prompt generation
+├── followup_agent.py         # Follow-up analysis agent
+├── tools/                    # Custom agent tools
+│   ├── webpage_tools.py      # Web scraping and data extraction
+│   ├── exploration_tools.py  # Dataset analysis and description
+│   ├── drawing_tools.py      # France map generation and visualization
+│   ├── libreoffice_tools.py  # PDF conversion utilities
+│   ├── followup_tools.py     # Follow-up analysis tools
+│   └── retrieval_tools.py    # Dataset search and retrieval
+├── filtered_dataset.csv      # Pre-processed dataset index (5,000+ datasets)
+├── france_data/              # Geographic data for France maps
+└── generated_data/           # Output folder for reports and visualizations
 ```
 ### Key Technologies
 - **Frontend**: Gradio with custom CSS and real-time progress
+- **AI Agents**:
+  - Primary SmolAgents powered by Gemini 2.5 Flash
+  - Specialized follow-up agent for interactive analysis ⭐
 - **Search**: BM25 keyword matching with TF-IDF preprocessing
 - **Translation**: LLM-powered bilingual query translation
 - **Visualization**: Matplotlib, Geopandas, Seaborn
 - **PDF Generation**: python-docx + LibreOffice conversion
+- **Data Processing**: Pandas, NumPy, Shapely, Scipy
+- **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering ⭐
 ### Smart Features
+#### Enhanced BM25 Search
 - Pre-computed search indices for 5,000+ datasets
 - Accent-insensitive keyword matching
 - Plural form normalization
 - Quality-score weighted ranking
+- Dynamic dataset retrieval during analysis ⭐
+#### Follow-up Analysis System
+- **Dataset Continuity**: Automatically loads previous analysis dataset
+- **Context Awareness**: References previous report findings
+- **Multi-modal Analysis**: Combines statistical analysis with visualizations
+- **Tool Integration**: 8+ specialized follow-up tools including:
+  - `load_previous_dataset()` - Load analysis dataset
+  - `get_dataset_summary()` - Comprehensive dataset overview
+  - `create_followup_visualization()` - Generate custom charts
+  - `analyze_column_correlation()` - Statistical correlation analysis
+  - `create_statistical_summary()` - Advanced statistical reports
+  - `filter_and_visualize_data()` - Targeted data filtering and visualization
 #### LLM Translation
 - Automatic French ↔ English translation
 1. **"No CSV/JSON files found"**
    - The selected dataset doesn't contain processable files
    - Try a different query or use the random selection
+   - Agent will automatically search for alternative datasets
 2. **LibreOffice PDF conversion fails**
    - Ensure LibreOffice is installed and accessible
    - BM25 index computation may take time on first run
    - Pre-computed indices are cached for faster subsequent searches
+5. **Follow-up analysis errors**
+   - Ensure the initial analysis completed successfully
+   - Check that dataset files exist in `generated_data/` folder
+   - Verify follow-up question is clear and specific
 ### Performance Optimization
 - **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl`
 - **Use SSD storage**: Faster file I/O for large datasets
 - **Monitor API usage**: API calls for translation and agent execution
+- **Clean generated_data**: Remove old files to improve follow-up performance
 ## 📊 Dataset Coverage
 - **File Formats**: CSV, JSON, Excel, XML
 - **Topics**: All major sectors of French public administration
 - **Quality Scores**: Datasets ranked by completeness and usability
+- **Real-time Search**: Agent can discover additional datasets during analysis
 ## 🚀 Advanced Usage
+### Follow-up Analysis Examples
+**Correlation Analysis:**
+```
+Show me the correlation between two numerical columns with a scatter plot
+```
+**Statistical Summary:**
+```
+Create a comprehensive statistical summary with visualization for unemployment rates
+```
+**Custom Filtering:**
+```
+Filter accidents data by night time conditions and create a visualization
+```
+**Trend Analysis:**
+```
+Create a line chart showing accident trends over the months
+```
 ### Custom Tool Development
 Add new tools to the `tools/` directory following the SmolAgents tool pattern.
 ### Batch Processing
 Process multiple datasets programmatically using the agent directly.
+## 📋 Dependencies
+The project requires the following Python packages (see `requirements.txt`):
+```
+pandas, shapely, geopandas, numpy, rtree, pyproj
+matplotlib, requests, duckduckgo-search
+smolagents[toolkit], smolagents[litellm]
+dotenv, beautifulsoup4, reportlab>=3.6.0
+scikit-learn, gradio, pypdf2, python-docx
+scipy, openpyxl, unidecode, rank_bm25
+```
 ## 📄 License
 This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.
 ---
 **🎉 Ready to explore French public data with AI? Launch the interface and start analyzing!**
+**🔥 NEW: Try the follow-up analysis feature to dive deeper into your reports!**

agent.py CHANGED Viewed

@@ -3,6 +3,7 @@ from tools.webpage_tools import (
     visit_webpage,
     get_all_links,
     read_file_from_url,
 )
 from tools.exploration_tools import (
     get_dataset_description,
@@ -13,6 +14,12 @@ from tools.drawing_tools import (
 from tools.libreoffice_tools import (
     convert_to_pdf_with_libreoffice,
     check_libreoffice_availability,
 )
 from smolagents import (
     CodeAgent,
@@ -29,11 +36,12 @@ def create_web_agent(step_callback):
     web_agent = CodeAgent(
             tools=[
                 search_tool,
-                visit_webpage, get_all_links, read_file_from_url,
                 get_dataset_description,
                 plot_departments_data,
                 convert_to_pdf_with_libreoffice,
-                check_libreoffice_availability
             ],
             model=model,
             max_steps=30,
@@ -48,40 +56,82 @@ def create_web_agent(step_callback):
         )
     return web_agent
-def generate_prompt(data_gouv_page):
-    return f"""Fetch me a dataset that can be just read by using the read_file_from_url tool
-        from {data_gouv_page}
-        Follow the steps below to generate a pdf report from the dataset.
-        The steps should be as follows:
-        1. Examine the page
-        2. Get all links
-        3. Get the dataset from the link
-        4. Get information about the dataset using the get_dataset_description tool
-        5. Decide on what you can draw based on either department or region data
-        5.1 if no data department or region level, look for another file!
-        6. Draw a map of France using your idea
-        7. Save the map in png file
-        8. Make as well 3 additional visualizations, not maps, that you can save in png files
-        9. Write an interesting analysis text for each of your visualizations. Be smart and think cleverly about the data and what it can state
-        10. Think of next step analysis to look at the data
-        11. Generate a comprehensive PDF report using the python-docx library that includes:
-            - A title page with the dataset name and analysis overview
-            - All your visualizations (PNG files) embedded in the report
-            - Your analysis text for each visualization
-            - Conclusions and next steps
-        Make the visualizations appropriately sized so they fit well in the PDF report.
-        Convert then that docx file to pdf using the convert_to_pdf_with_libreoffice tool.
-        Do not overcommit, just do the steps one by one and it should go fine! Do not, under any circumstance, use the 'os' module!
-        Do not generate a lot of code every step, go slowly but surely and it will work out. Save everything within the generated_data folder.
-        If question is in english, report is in english.
-        If question is in french, report is in french.
-        IMPORTANT LIBREOFFICE NOTES:
-        - If you need to use LibreOffice, first call check_libreoffice_availability() to verify it's available
-        - If LibreOffice is available, "LibreOffice found" is returned by "check_libreoffice_availability()"
-        - Use convert_to_pdf_with_libreoffice() tool instead of subprocess calls
-        - Do NOT use subprocess.run(['libreoffice', ...]) or subprocess.run(['soffice', ...]) directly
-        - The LibreOffice tools handle macOS, Linux, and Windows path differences automatically
-        """

     visit_webpage,
     get_all_links,
     read_file_from_url,
+    save_dataset_for_followup,
 )
 from tools.exploration_tools import (
     get_dataset_description,
 from tools.libreoffice_tools import (
     convert_to_pdf_with_libreoffice,
     check_libreoffice_availability,
+    get_libreoffice_info,
+)
+from tools.retrieval_tools import (
+    search_datasets,
+    get_dataset_info,
+    get_random_quality_dataset,
 )
 from smolagents import (
     CodeAgent,
     web_agent = CodeAgent(
             tools=[
                 search_tool,
+                visit_webpage, get_all_links, read_file_from_url, save_dataset_for_followup,
                 get_dataset_description,
                 plot_departments_data,
                 convert_to_pdf_with_libreoffice,
+                check_libreoffice_availability, get_libreoffice_info,
+                search_datasets, get_dataset_info, get_random_quality_dataset
             ],
             model=model,
             max_steps=30,
         )
     return web_agent
+def generate_prompt(user_query=None, initial_search_results=None):
+    """Generate a unified prompt for dataset search and analysis"""
+    base_instructions = """Follow these steps to analyze French public data:
+    1. **Dataset Selection**:
+       - You can use the search_datasets tool to find relevant datasets
+       - You can use get_dataset_info to get detailed information about specific datasets
+       - You can use get_random_quality_dataset to explore interesting datasets
+    2. **Dataset Analysis**:
+       - Examine the selected dataset page using visit_webpage
+       - Get all available data links using get_all_links
+       - Download and analyze the dataset using read_file_from_url
+       - Save the dataset for follow-up analysis using save_dataset_for_followup
+       - Get dataset description using get_dataset_description
+    3. **Visualization Creation**:
+       - If geographic data (departments/regions) is available, create a map of France
+       - Create 3 additional non-map visualizations
+       - Save all visualizations as PNG files
+    4. **Report Generation**:
+       - Write insightful analysis text for each visualization
+       - Generate a comprehensive PDF report using python-docx library that includes:
+         * Title page with dataset name and analysis overview
+         * All visualizations (PNG files) embedded in the report
+         * Analysis text for each visualization
+         * Conclusions and next steps
+       - Convert the docx file to PDF using convert_to_pdf_with_libreoffice tool
+    **Important Technical Notes:**
+    - Save everything in the generated_data folder
+    - Do NOT use the 'os' module
+    - Work step by step, don't generate too much code at once
+    - Before PDF conversion, call check_libreoffice_availability() - it returns True/False
+    - If check_libreoffice_availability() returns True, use convert_to_pdf_with_libreoffice() tool
+    - If check_libreoffice_availability() returns False, skip PDF conversion and inform user
+    - Do NOT use subprocess calls directly for LibreOffice
+    - If question is in English, report is in English. If in French, report is in French.
+    """
+    if user_query and initial_search_results:
+        return f"""I need you to analyze French public datasets related to: "{user_query}"
+**INITIAL SEARCH RESULTS:**
+{initial_search_results}
+You have these options:
+1. **Use one of the datasets from the initial search results above** - select the most relevant one
+2. **Search for different datasets** using the search_datasets tool if none of the above seem perfect
+3. **Get more information** about any dataset using get_dataset_info tool
+{base_instructions}
+Focus your analysis on insights related to "{user_query}". Choose the most relevant dataset and create meaningful visualizations that answer questions about "{user_query}".
+If user query is not specific, remain generic with respect to the dataset at hand.
+Focus on getting results and analytics; do not go with too much data, we can always improve it later.
+"""
+    elif user_query:
+        return f"""I need you to find and analyze French public datasets related to: "{user_query}"
+{base_instructions}
+Start by using the search_datasets tool to find relevant datasets related to "{user_query}". Focus your analysis on insights related to "{user_query}".
+If user query is not specific, remain generic with respect to the dataset at hand.
+Focus on getting results and analytics; do not go with too much data, we can always improve it later.
+"""
+    else:
+        return f"""I need you to find and analyze an interesting French public dataset.
+{base_instructions}
+Start by using the search_datasets tool to find interesting datasets, or use get_random_quality_dataset to explore a high-quality dataset.
+If user query is not specific, remain generic with respect to the dataset at hand.
+Focus on getting results and analytics; do not go with too much data, we can always improve it later.
+"""

app.py CHANGED Viewed

@@ -7,10 +7,10 @@ import time
 import queue
 import numpy as np
 from rank_bm25 import BM25Okapi
-import re
 from dotenv import load_dotenv
 from smolagents import CodeAgent, LiteLLMModel
 from agent import create_web_agent, generate_prompt
 from unidecode import unidecode
 load_dotenv()
@@ -302,30 +302,9 @@ def run_agent_analysis_with_progress(query, progress_callback, df=None, page_url
 def search_and_analyze(query, progress=gr.Progress()):
     """
-    Main function called when search button is clicked.
     Uses Gradio's progress bar for visual feedback.
     """
-    # Read the filtered dataset first
-    df = pd.read_csv('filtered_dataset.csv')
-    # If no query provided, randomly select one weighted by quality score
-    if not query.strip():
-        progress(0, desc="🎲 No query provided - selecting random high-quality dataset...")
-        # Use quality_score as weights for random selection
-        if 'quality_score' in df.columns:
-            # Ensure quality scores are positive for weighting
-            weights = df['quality_score'].fillna(0)
-            weights = weights - weights.min() + 0.1  # Shift to make all positive
-        else:
-            weights = None
-        # Randomly sample one dataset weighted by quality
-        selected_row = df.sample(n=1, weights=weights).iloc[0]
-        query = selected_row['title']
-        progress(0.02, f"🎯 Random selection: {query[:60]}...")
     # Clear the progress queue
     while not progress_queue.empty():
         try:
@@ -336,10 +315,10 @@ def search_and_analyze(query, progress=gr.Progress()):
     # Initialize outputs
     pdf_file = None
     images_output = [gr.Image(visible=False)] * 4
-    status = "🚀 Starting analysis..."
     # Initial progress
-    progress(0.05, desc="🚀 Initializing...")
     def progress_callback(progress_val, description):
         """Callback function to update progress - puts updates in queue"""
@@ -351,40 +330,76 @@ def search_and_analyze(query, progress=gr.Progress()):
     # Run analysis in a separate thread
     result_queue = queue.Queue()
-    # Store the page URL to show immediately (kept for compatibility)
-    page_url_to_show = None
-    def page_url_callback(url):
-        nonlocal page_url_to_show
-        page_url_to_show = url
-    # Find and show the page URL immediately FIRST
-    initialize_models()
-    progress(0.06, desc="🔍 Finding relevant dataset...")
-    most_similar_idx, similarity_score, translated_query, original_lang = find_similar_dataset_bm25(query, df)
-    data_gouv_page = df.iloc[most_similar_idx]['url']
-    dataset_title = df.iloc[most_similar_idx]['title']
-    progress(0.07, desc=f"📋 Found dataset: {dataset_title[:50]}...")
-    # Now start the analysis thread with the found dataset info
     def run_analysis():
         try:
-            # Pass the already found dataset info to the analysis function
-            result = run_agent_analysis_with_progress(query, progress_callback, df, page_url_callback, data_gouv_page, most_similar_idx)
-            result_queue.put(result)
         except Exception as e:
-            result_queue.put((f"Error: {str(e)}", [], data_gouv_page))
     analysis_thread = threading.Thread(target=run_analysis)
     analysis_thread.start()
-    # Show page URL immediately by returning current state
-    current_page_display = gr.Textbox(value=data_gouv_page, visible=True)
-    current_status = "🔗 Page found - starting analysis..."
-    # Initial update to show the page URL immediately
-    progress(0.08, desc="🔗 Page found - starting analysis...")
     # Monitor progress while analysis runs
     last_progress = 0.08
@@ -408,11 +423,18 @@ def search_and_analyze(query, progress=gr.Progress()):
                 # Check if this is a "no data" case
                 if "❌ No CSV/JSON files found" in final_status:
                     progress(1.0, desc="❌ No processable data found")
-                    return (gr.Textbox(value=page_url if page_url else data_gouv_page, visible=True),
                            final_status,
                            gr.File(visible=False),
                            gr.Image(visible=False), gr.Image(visible=False),
-                           gr.Image(visible=False), gr.Image(visible=False))
                 # Final progress update
                 progress(1.0, desc="✅ Processing results...")
@@ -441,7 +463,16 @@ def search_and_analyze(query, progress=gr.Progress()):
                 # final progress completion
                 progress(1.0, desc="🎉 Complete!")
-                return gr.Textbox(value=page_url if page_url else data_gouv_page, visible=True), final_status, download_button, *images
             except queue.Empty:
                 pass
@@ -450,14 +481,86 @@ def search_and_analyze(query, progress=gr.Progress()):
         except Exception as e:
             progress(1.0, desc=f"❌ Error: {str(e)}")
-            return gr.Textbox(value=data_gouv_page, visible=True), f"❌ Error: {str(e)}", None, *images_output
     # Ensure thread completes
     analysis_thread.join(timeout=1)
     # Fallback return
     progress(1.0, desc="🏁 Finished")
-    return gr.Textbox(value=data_gouv_page, visible=True), current_status, pdf_file, *images_output
 # Create the Gradio interface
 with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.Soft(), css="""
@@ -516,7 +619,10 @@ with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.S
     gr.HTML("""
     <div style="text-align: center; background: #f8fafc; padding: 1.5rem; border-radius: 10px; margin: 1rem 0;">
         <p style="font-size: 1.1rem; color: #374151; margin: 0;">
-            🌐 <strong>Search in French or English</strong> • 🗺️ <strong>Generate Reports with visualizations from the data</strong>
         </p>
     </div>
     """)
@@ -527,18 +633,21 @@ with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.S
             with gr.Column():
                 gr.Markdown("""
                 🎯 **How to Use:**
-                - Enter any search term related to French public data
-                - Leave empty to randomly select a high-quality dataset
                 - Results include visualizations and downloadable reports
                 ⏱️ **Processing Time:**
-                - Report generation takes 5-10 minutes depending on dataset complexity
-                - Larger datasets may require additional processing time
                 """)
             with gr.Column():
                 gr.Markdown("""
                 ⚠️ **Important Notes:**
-                - Still a work in progress, might be better to start with the example queries
                 - Some datasets may not contain processable CSV/JSON files
                 - All visualizations are automatically generated
                 - Maps focus on France when geographic data is available
@@ -571,7 +680,7 @@ with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.S
     with gr.Row():
         examples = [
-            ("🚗 Road Traffic Accidents 2005 - 2023", "road traffic accidents 2005 - 2023"),
             ("🎓 Education Directory", "education directory"),
             ("🏠 French Vacant Housing Private Park", "French vacant housing private park"),
         ]
@@ -615,14 +724,90 @@ with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.S
             image3 = gr.Image(label="🗺️ Map/Chart 3", visible=False, height=400)
             image4 = gr.Image(label="📉 Chart 4", visible=False, height=400)
     # Set up the search button click event with progress bar
     search_button.click(
         fn=search_and_analyze,
         inputs=[query_input],
-        outputs=[page_url_display, status_output, download_button, image1, image2, image3, image4],
         show_progress="full"  # Show the built-in progress bar
     )
 if __name__ == "__main__":
@@ -631,5 +816,5 @@ if __name__ == "__main__":
         share=True,
         server_name="0.0.0.0",
         server_port=7860,
-        show_error=True
     )

 import queue
 import numpy as np
 from rank_bm25 import BM25Okapi
 from dotenv import load_dotenv
 from smolagents import CodeAgent, LiteLLMModel
 from agent import create_web_agent, generate_prompt
+from followup_agent import run_followup_analysis
 from unidecode import unidecode
 load_dotenv()
 def search_and_analyze(query, progress=gr.Progress()):
     """
+    Unified function that does initial search then lets agent analyze with full autonomy.
     Uses Gradio's progress bar for visual feedback.
     """
     # Clear the progress queue
     while not progress_queue.empty():
         try:
     # Initialize outputs
     pdf_file = None
     images_output = [gr.Image(visible=False)] * 4
+    status = "🚀 Starting agent-driven analysis..."
     # Initial progress
+    progress(0.05, desc="🚀 Initializing agent...")
     def progress_callback(progress_val, description):
         """Callback function to update progress - puts updates in queue"""
     # Run analysis in a separate thread
     result_queue = queue.Queue()
     def run_analysis():
         try:
+            # Clean up previous results
+            if os.path.exists('generated_data'):
+                for file in glob.glob('generated_data/*'):
+                    try:
+                        os.remove(file)
+                    except:
+                        pass
+            else:
+                os.makedirs('generated_data', exist_ok=True)
+            # Do initial search if query provided
+            initial_search_results = None
+            if query.strip():
+                progress_callback(0.06, f"🔍 Initial search for: {query[:50]}...")
+                try:
+                    # Import search function from tools
+                    from tools.retrieval_tools import search_datasets
+                    initial_search_results = search_datasets(query, top_k=5)
+                    progress_callback(0.08, "🤖 Starting agent with search results...")
+                except Exception as e:
+                    print(f"Initial search failed: {e}")
+                    progress_callback(0.08, "🤖 Starting agent without initial results...")
+            else:
+                progress_callback(0.08, "🤖 Starting agent for random selection...")
+            step_callback = create_progress_callback()
+            # Create the agent with progress callback
+            web_agent = create_web_agent(step_callback)
+            # Generate unified prompt with initial search results
+            prompt = generate_prompt(user_query=query, initial_search_results=initial_search_results)
+            progress_callback(0.1, "🤖 Agent analyzing datasets...")
+            # Run the agent - the step_callbacks will automatically update progress
+            answer = web_agent.run(prompt)
+            # Check if the agent found no processable data
+            answer_lower = str(answer).lower() if answer else ""
+            if ("no processable data" in answer_lower or
+                "no csv nor json" in answer_lower or
+                "cannot find csv" in answer_lower or
+                "cannot find json" in answer_lower or
+                "no data to process" in answer_lower):
+                progress_callback(1.0, "❌ No CSV/JSON files found in the dataset")
+                result_queue.put(("❌ No CSV/JSON files found in the selected dataset. This dataset cannot be processed automatically.", [], None))
+                return
+            # Check if files were generated
+            generated_files = glob.glob('generated_data/*')
+            if generated_files:
+                progress_callback(1.0, "✅ Analysis completed successfully!")
+                result_queue.put(("Analysis completed successfully!", generated_files, "Agent-selected dataset"))
+            else:
+                progress_callback(1.0, "⚠️ Analysis completed but no files were generated.")
+                result_queue.put(("Analysis completed but no files were generated.", [], None))
         except Exception as e:
+            progress_callback(1.0, f"❌ Error: {str(e)}")
+            result_queue.put((f"Error during analysis: {str(e)}", [], None))
     analysis_thread = threading.Thread(target=run_analysis)
     analysis_thread.start()
+    # Show initial status
+    current_status = "🤖 Agent is finding relevant datasets..."
+    progress(0.08, desc=current_status)
     # Monitor progress while analysis runs
     last_progress = 0.08
                 # Check if this is a "no data" case
                 if "❌ No CSV/JSON files found" in final_status:
                     progress(1.0, desc="❌ No processable data found")
+                    return (gr.Textbox(value="Agent-selected dataset", visible=True),
                            final_status,
                            gr.File(visible=False),
                            gr.Image(visible=False), gr.Image(visible=False),
+                           gr.Image(visible=False), gr.Image(visible=False),
+                           gr.Markdown(visible=False),  # keep follow-up hidden
+                           gr.HTML(visible=False),
+                           gr.Row(visible=False),
+                           gr.Row(visible=False),
+                           gr.Row(visible=False),
+                           gr.Row(visible=False),
+                           gr.Row(visible=False))
                 # Final progress update
                 progress(1.0, desc="✅ Processing results...")
                 # final progress completion
                 progress(1.0, desc="🎉 Complete!")
+                # Show follow-up section after successful completion
+                return (gr.Textbox(value=page_url if page_url else "Agent-selected dataset", visible=True),
+                       final_status, download_button, *images,
+                       gr.Markdown(visible=True),  # followup_section_divider
+                       gr.HTML(visible=True),      # followup_section_header
+                       gr.Row(visible=True),       # followup_input_row
+                       gr.Row(visible=True),       # followup_result_row
+                       gr.Row(visible=True),       # followup_image_row
+                       gr.Row(visible=True),       # followup_examples_header_row
+                       gr.Row(visible=True))       # followup_examples_row
             except queue.Empty:
                 pass
         except Exception as e:
             progress(1.0, desc=f"❌ Error: {str(e)}")
+            return (gr.Textbox(value="Error", visible=True), f"❌ Error: {str(e)}", None, *images_output,
+                   gr.Markdown(visible=False),  # keep follow-up hidden on error
+                   gr.HTML(visible=False),
+                   gr.Row(visible=False),
+                   gr.Row(visible=False),
+                   gr.Row(visible=False),
+                   gr.Row(visible=False),
+                   gr.Row(visible=False))
     # Ensure thread completes
     analysis_thread.join(timeout=1)
     # Fallback return
     progress(1.0, desc="🏁 Finished")
+    return (gr.Textbox(value="Completed", visible=True), current_status, pdf_file, *images_output,
+           gr.Markdown(visible=False),  # keep follow-up hidden
+           gr.HTML(visible=False),
+           gr.Row(visible=False),
+           gr.Row(visible=False),
+           gr.Row(visible=False),
+           gr.Row(visible=False),
+           gr.Row(visible=False))
+def run_followup_question(question, progress=gr.Progress()):
+    """
+    Run a follow-up analysis based on user's question about the previous report.
+    """
+    if not question.strip():
+        return "Please enter a follow-up question.", gr.Image(visible=False)
+    progress(0.1, desc="🤖 Starting follow-up analysis...")
+    try:
+        # Check if there are previous results
+        if not os.path.exists('generated_data') or not os.listdir('generated_data'):
+            return "No previous analysis found. Please run an analysis first.", gr.Image(visible=False)
+        progress(0.3, desc="🔍 Analyzing previous report and dataset...")
+        # Run the follow-up analysis
+        result = run_followup_analysis(question)
+        progress(0.9, desc="📊 Processing results...")
+        # Look for new visualizations created by the follow-up analysis
+        import glob
+        # Get all images that were created after the analysis started
+        all_images = glob.glob('generated_data/*.png')
+        # Get recent images (created in the last few seconds)
+        import time
+        current_time = time.time()
+        recent_images = []
+        for img_path in all_images:
+            img_time = os.path.getctime(img_path)
+            if current_time - img_time < 120:  # Images created in last 2 minutes
+                recent_images.append(img_path)
+        # Get the most recent image if any
+        latest_image = None
+        if recent_images:
+            latest_image = max(recent_images, key=os.path.getctime)
+        progress(1.0, desc="✅ Follow-up analysis complete!")
+        # Enhanced result formatting
+        final_result = result
+        if latest_image:
+            final_result += f"\n\n📊 **Visualization Created:** {os.path.basename(latest_image)}"
+            if len(recent_images) > 1:
+                final_result += f"\n📈 **Total new visualizations:** {len(recent_images)}"
+            return final_result, gr.Image(value=latest_image, visible=True)
+        else:
+            return final_result, gr.Image(visible=False)
+    except Exception as e:
+        progress(1.0, desc="❌ Error in follow-up analysis")
+        return f"Error: {str(e)}", gr.Image(visible=False)
 # Create the Gradio interface
 with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.Soft(), css="""
     gr.HTML("""
     <div style="text-align: center; background: #f8fafc; padding: 1.5rem; border-radius: 10px; margin: 1rem 0;">
         <p style="font-size: 1.1rem; color: #374151; margin: 0;">
+            🌐 <strong>Search in French or English</strong> • 🤖 <strong>AI Agent finds & analyzes datasets</strong> • 🗺️ <strong>Generate Reports with visualizations</strong>
+        </p>
+        <p style="font-size: 0.9rem; color: #6b7280; margin-top: 0.5rem;">
+            Initial search results guide the agent, but it can search for different datasets if needed
         </p>
     </div>
     """)
             with gr.Column():
                 gr.Markdown("""
                 🎯 **How to Use:**
+                - Enter search terms related to French public data
+                - Leave empty for random high-quality dataset selection
+                - System provides initial search results to guide the agent
+                - Agent can use provided results or search for different datasets
                 - Results include visualizations and downloadable reports
                 ⏱️ **Processing Time:**
+                - Analysis takes 7-15 minutes depending on dataset complexity
+                - Agent has full autonomy to find the best datasets
                 """)
             with gr.Column():
                 gr.Markdown("""
                 ⚠️ **Important Notes:**
+                - Agent gets initial search results but has full autonomy to make decisions
+                - Agent can choose from initial results or search for different datasets
                 - Some datasets may not contain processable CSV/JSON files
                 - All visualizations are automatically generated
                 - Maps focus on France when geographic data is available
     with gr.Row():
         examples = [
+            ("🚗 Road Traffic Accidents 2023", "road traffic accidents 2023"),
             ("🎓 Education Directory", "education directory"),
             ("🏠 French Vacant Housing Private Park", "French vacant housing private park"),
         ]
             image3 = gr.Image(label="🗺️ Map/Chart 3", visible=False, height=400)
             image4 = gr.Image(label="📉 Chart 4", visible=False, height=400)
+    # Follow-up Analysis Section (initially hidden)
+    followup_section_divider = gr.Markdown("---", visible=False)
+    followup_section_header = gr.HTML("""
+    <div style="text-align: center; margin: 2rem 0;">
+        <h2 style="color: #374151; margin-bottom: 0.5rem;">🤖 Follow-up Analysis</h2>
+        <p style="color: #6b7280; margin: 0;">Ask questions about the generated report and dataset</p>
+    </div>
+    """, visible=False)
+    with gr.Row(visible=False) as followup_input_row:
+        followup_input = gr.Textbox(
+            label="Follow-up Question",
+            placeholder="e.g., Show me correlation between two columns, Create a chart for specific regions, What are the trends over time?",
+            scale=4
+        )
+        followup_button = gr.Button(
+            "🔍 Analyze",
+            variant="secondary",
+            scale=1,
+            size="lg"
+        )
+    with gr.Row(visible=False) as followup_result_row:
+        followup_result = gr.Textbox(
+            label="📊 Follow-up Analysis Results",
+            interactive=False,
+            lines=10,
+            visible=True
+        )
+    with gr.Row(visible=False) as followup_image_row:
+        followup_image = gr.Image(
+            label="📈 Follow-up Visualization",
+            visible=False,
+            height=500
+        )
+    # Follow-up Examples (initially hidden)
+    with gr.Row(visible=False) as followup_examples_header_row:
+        gr.HTML("""
+        <div>
+            <h4 style="color: #374151">💡 Example Follow-up Questions</h4>
+            <p style="color: #6b7280">Click any example below to try it out</p>
+        </div>
+        """)
+    with gr.Row(visible=False) as followup_examples_row:
+        followup_examples = [
+            ("📊 Correlation Analysis", "Show me the correlation between two numerical columns with a scatter plot"),
+            ("📈 Statistical Summary", "Create a comprehensive statistical summary with visualization for a specific column"),
+            ("🎯 Filter & Analyze", "Filter the data by specific criteria and create a visualization"),
+            ("📋 Dataset Overview", "Give me a detailed summary of the dataset structure and contents"),
+            ("📉 Trend Analysis", "Create a line chart showing trends over time for specific data"),
+            ("🔍 Custom Visualization", "Create a custom bar/pie/histogram chart for specific columns"),
+        ]
+        for emoji_text, query_text in followup_examples:
+            gr.Button(
+                emoji_text,
+                variant="secondary",
+                size="sm"
+            ).click(
+                lambda x=query_text: x,
+                outputs=followup_input
+            )
     # Set up the search button click event with progress bar
     search_button.click(
         fn=search_and_analyze,
         inputs=[query_input],
+        outputs=[page_url_display, status_output, download_button, image1, image2, image3, image4,
+                followup_section_divider, followup_section_header, followup_input_row,
+                followup_result_row, followup_image_row, followup_examples_header_row, followup_examples_row],
         show_progress="full"  # Show the built-in progress bar
     )
+    # Set up the follow-up button click event
+    followup_button.click(
+        fn=run_followup_question,
+        inputs=[followup_input],
+        outputs=[followup_result, followup_image],
+        show_progress="full"
+    )
 if __name__ == "__main__":
         share=True,
         server_name="0.0.0.0",
         server_port=7860,
+        show_error=True
     )

followup_agent.py ADDED Viewed

	@@ -0,0 +1,119 @@

+import os
+from tools.followup_tools import (
+    load_previous_dataset,
+    get_dataset_summary,
+    create_followup_visualization,
+    get_previous_report_content,
+    analyze_column_correlation,
+    create_statistical_summary,
+    filter_and_visualize_data,
+)
+from tools.retrieval_tools import (
+    search_datasets,
+    get_dataset_info,
+)
+from smolagents import (
+    CodeAgent,
+    DuckDuckGoSearchTool,
+    LiteLLMModel,
+)
+def create_followup_agent():
+    """Create a specialized agent for follow-up analysis"""
+    search_tool = DuckDuckGoSearchTool()
+    model = LiteLLMModel(
+        model_id="gemini/gemini-2.5-flash-preview-05-20",
+        api_key=os.getenv("GEMINI_API_KEY"),
+    )
+    followup_agent = CodeAgent(
+        tools=[
+            search_tool,
+            load_previous_dataset,
+            get_dataset_summary,
+            create_followup_visualization,
+            get_previous_report_content,
+            analyze_column_correlation,
+            create_statistical_summary,
+            filter_and_visualize_data,
+            search_datasets,
+            get_dataset_info,
+        ],
+        model=model,
+        max_steps=20,
+        verbosity_level=1,
+        planning_interval=2,
+        additional_authorized_imports=[
+            "pandas", "numpy", "matplotlib", "matplotlib.pyplot", "seaborn",
+            "os", "json", "datetime", "math", "statistics"
+        ],
+    )
+    return followup_agent
+def generate_followup_prompt(user_question, report_context=None):
+    """Generate a prompt for follow-up analysis"""
+    base_prompt = f"""You are a data analysis assistant helping with follow-up questions about a previously generated report.
+USER'S FOLLOW-UP QUESTION: "{user_question}"
+AVAILABLE TOOLS:
+1. **load_previous_dataset()** - Load the dataset used in the previous analysis
+2. **get_dataset_summary(df)** - Get detailed info about the dataset structure
+3. **get_previous_report_content()** - Get context about the previous report
+4. **create_followup_visualization()** - Create new charts and graphs (bar, line, scatter, histogram, box, pie)
+5. **analyze_column_correlation()** - Analyze relationships between columns with scatter plots
+6. **create_statistical_summary()** - Generate comprehensive stats + visualizations for any column
+7. **filter_and_visualize_data()** - Filter data by criteria and create targeted visualizations
+8. **search_datasets()** - Search for additional datasets if needed
+9. **get_dataset_info()** - Get info about specific datasets
+ANALYSIS APPROACH:
+1. First, get context by calling get_previous_report_content()
+2. Load the previous dataset using load_previous_dataset()
+3. Get a summary of the dataset structure with get_dataset_summary()
+4. Based on the user's question, perform the appropriate analysis:
+   - Create new visualizations if they want different charts
+   - Analyze correlations if they ask about relationships
+   - Filter or group data if they want specific subsets
+   - Calculate statistics if they want numerical insights
+5. **ALWAYS create visualizations when relevant** - save to generated_data folder
+6. Provide a comprehensive text answer AND create supporting visualizations
+IMPORTANT GUIDELINES:
+- Always start by understanding the previous report context
+- Use the same dataset that was used in the original analysis
+- **CREATE VISUALIZATIONS whenever possible** - charts help answer questions better
+- Provide clear, actionable insights in TEXT format
+- Save all new visualization files to the generated_data folder with descriptive filenames
+- Be concise but thorough in your explanations
+- Combine text analysis with visual evidence
+Answer the user's question: "{user_question}"
+"""
+    if report_context:
+        base_prompt += f"""
+ADDITIONAL CONTEXT ABOUT PREVIOUS REPORT:
+{report_context}
+"""
+    return base_prompt
+def run_followup_analysis(user_question, report_context=None):
+    """Run a follow-up analysis based on user question"""
+    try:
+        # Create the follow-up agent
+        agent = create_followup_agent()
+        # Generate the prompt
+        prompt = generate_followup_prompt(user_question, report_context)
+        # Run the analysis
+        result = agent.run(prompt)
+        return str(result)
+    except Exception as e:
+        return f"Error in follow-up analysis: {str(e)}"

tools/followup_tools.py ADDED Viewed

	@@ -0,0 +1,515 @@

+import os
+import pandas as pd
+import json
+import glob
+from smolagents import tool
+import matplotlib.pyplot as plt
+import seaborn as sns
+from pathlib import Path
+import numpy as np
+@tool
+def load_previous_dataset() -> pd.DataFrame:
+    """
+    Load the dataset that was used in the previous analysis.
+    Returns:
+        The pandas DataFrame that was used in the previous report generation
+    """
+    try:
+        # Look for saved dataset in generated_data folder
+        dataset_files = glob.glob('generated_data/*dataset*.csv') + glob.glob('generated_data/*data*.csv')
+        if not dataset_files:
+            # Try to find any CSV file in generated_data
+            csv_files = glob.glob('generated_data/*.csv')
+            if csv_files:
+                dataset_files = csv_files
+        if not dataset_files:
+            raise Exception("No dataset found in generated_data folder")
+        # Use the most recent dataset file
+        latest_file = max(dataset_files, key=os.path.getctime)
+        df = pd.read_csv(latest_file)
+        print(f"✅ Loaded dataset from {latest_file} with {len(df)} rows and {len(df.columns)} columns")
+        return df
+    except Exception as e:
+        raise Exception(f"Error loading previous dataset: {str(e)}")
+@tool
+def get_dataset_summary(df: pd.DataFrame) -> str:
+    """
+    Get a comprehensive summary of the dataset structure and content.
+    Args:
+        df: The pandas DataFrame to analyze
+    Returns:
+        A formatted string with dataset summary information
+    """
+    try:
+        summary_lines = []
+        summary_lines.append("=== DATASET SUMMARY ===")
+        summary_lines.append(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
+        summary_lines.append("")
+        summary_lines.append("Column Information:")
+        for col in df.columns:
+            dtype = str(df[col].dtype)
+            non_null = df[col].count()
+            null_count = df[col].isnull().sum()
+            unique_count = df[col].nunique()
+            summary_lines.append(f"  • {col}: {dtype}, {non_null} non-null, {null_count} null, {unique_count} unique")
+            # Show sample values for categorical columns
+            if df[col].dtype == 'object' and unique_count <= 10:
+                sample_values = df[col].value_counts().head(5).index.tolist()
+                summary_lines.append(f"    Sample values: {sample_values}")
+        summary_lines.append("")
+        summary_lines.append("First 3 rows:")
+        summary_lines.append(df.head(3).to_string())
+        return "\n".join(summary_lines)
+    except Exception as e:
+        return f"Error analyzing dataset: {str(e)}"
+@tool
+def create_followup_visualization(df: pd.DataFrame, chart_type: str, x_column: str, y_column: str = None, title: str = "Follow-up Analysis", filename: str = "followup_chart.png") -> str:
+    """
+    Create a visualization for follow-up analysis.
+    Args:
+        df: The pandas DataFrame to visualize
+        chart_type: Type of chart ('bar', 'line', 'scatter', 'histogram', 'box', 'pie')
+        x_column: Column name for x-axis
+        y_column: Column name for y-axis (optional for some chart types)
+        title: Title for the chart
+        filename: Name of the file to save (should end with .png)
+    Returns:
+        Path to the saved visualization file
+    """
+    try:
+        plt.figure(figsize=(12, 8))
+        if chart_type == 'bar':
+            if y_column:
+                df_grouped = df.groupby(x_column)[y_column].sum().sort_values(ascending=False)
+                plt.bar(range(len(df_grouped)), df_grouped.values)
+                plt.xticks(range(len(df_grouped)), df_grouped.index, rotation=45)
+                plt.ylabel(y_column)
+            else:
+                value_counts = df[x_column].value_counts().head(10)
+                plt.bar(range(len(value_counts)), value_counts.values)
+                plt.xticks(range(len(value_counts)), value_counts.index, rotation=45)
+                plt.ylabel('Count')
+        elif chart_type == 'line':
+            if y_column:
+                df_sorted = df.sort_values(x_column)
+                plt.plot(df_sorted[x_column], df_sorted[y_column])
+                plt.ylabel(y_column)
+            else:
+                value_counts = df[x_column].value_counts().sort_index()
+                plt.plot(value_counts.index, value_counts.values)
+                plt.ylabel('Count')
+        elif chart_type == 'scatter':
+            if y_column:
+                plt.scatter(df[x_column], df[y_column], alpha=0.6)
+                plt.ylabel(y_column)
+            else:
+                raise ValueError("Scatter plot requires both x_column and y_column")
+        elif chart_type == 'histogram':
+            plt.hist(df[x_column], bins=30, alpha=0.7)
+            plt.ylabel('Frequency')
+        elif chart_type == 'box':
+            if y_column:
+                df.boxplot(column=y_column, by=x_column)
+            else:
+                plt.boxplot(df[x_column])
+                plt.ylabel(x_column)
+        elif chart_type == 'pie':
+            value_counts = df[x_column].value_counts().head(10)
+            plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%')
+        else:
+            raise ValueError(f"Unsupported chart type: {chart_type}")
+        plt.xlabel(x_column)
+        plt.title(title)
+        plt.tight_layout()
+        # Save to generated_data folder
+        if not filename.endswith('.png'):
+            filename += '.png'
+        filepath = os.path.join('generated_data', filename)
+        plt.savefig(filepath, dpi=300, bbox_inches='tight')
+        plt.close()
+        return f"Visualization saved to: {filepath}"
+    except Exception as e:
+        plt.close()  # Ensure plot is closed even on error
+        return f"Error creating visualization: {str(e)}"
+@tool
+def get_previous_report_content() -> str:
+    """
+    Get the content of the previously generated report.
+    Returns:
+        The text content of the previous report for context
+    """
+    try:
+        # Look for PDF or DOCX files in generated_data
+        report_files = glob.glob('generated_data/*.pdf') + glob.glob('generated_data/*.docx')
+        if not report_files:
+            return "No previous report found in generated_data folder"
+        # Use the most recent report file
+        latest_report = max(report_files, key=os.path.getctime)
+        # For now, return basic info about the report
+        # In a full implementation, you'd extract text from PDF/DOCX
+        file_size = os.path.getsize(latest_report)
+        # Also look for any text files that might contain analysis
+        text_files = glob.glob('generated_data/*.txt')
+        text_content = ""
+        if text_files:
+            latest_text = max(text_files, key=os.path.getctime)
+            with open(latest_text, 'r', encoding='utf-8') as f:
+                text_content = f.read()
+        summary = f"""=== PREVIOUS REPORT CONTEXT ===
+Report file: {latest_report}
+File size: {file_size} bytes
+Created: {os.path.getctime(latest_report)}
+Additional analysis content:
+{text_content if text_content else 'No additional text content found'}
+The report was generated from the dataset in the previous analysis.
+You can use load_previous_dataset() to access the same data.
+"""
+        return summary
+    except Exception as e:
+        return f"Error accessing previous report: {str(e)}"
+@tool
+def analyze_column_correlation(df: pd.DataFrame, column1: str, column2: str) -> str:
+    """
+    Analyze correlation between two columns in the dataset.
+    Args:
+        df: The pandas DataFrame
+        column1: First column name
+        column2: Second column name
+    Returns:
+        Correlation analysis results
+    """
+    try:
+        # Check if columns exist
+        if column1 not in df.columns or column2 not in df.columns:
+            return f"Error: One or both columns not found. Available columns: {list(df.columns)}"
+        # Convert to numeric if possible
+        try:
+            col1_numeric = pd.to_numeric(df[column1], errors='coerce')
+            col2_numeric = pd.to_numeric(df[column2], errors='coerce')
+        except:
+            return f"Error: Cannot convert columns to numeric for correlation analysis"
+        # Calculate correlation
+        correlation = col1_numeric.corr(col2_numeric)
+        # Create scatter plot
+        plt.figure(figsize=(10, 6))
+        plt.scatter(col1_numeric, col2_numeric, alpha=0.6)
+        plt.xlabel(column1)
+        plt.ylabel(column2)
+        plt.title(f'Correlation between {column1} and {column2}\nCorrelation coefficient: {correlation:.3f}')
+        # Add trend line
+        if not col1_numeric.isna().all() and not col2_numeric.isna().all():
+            z = np.polyfit(col1_numeric.dropna(), col2_numeric.dropna(), 1)
+            p = np.poly1d(z)
+            plt.plot(col1_numeric, p(col1_numeric), "r--", alpha=0.8)
+        plt.tight_layout()
+        # Save plot
+        filename = f'correlation_{column1}_{column2}.png'
+        filepath = os.path.join('generated_data', filename)
+        plt.savefig(filepath, dpi=300, bbox_inches='tight')
+        plt.close()
+        # Interpret correlation
+        if abs(correlation) > 0.7:
+            strength = "strong"
+        elif abs(correlation) > 0.4:
+            strength = "moderate"
+        elif abs(correlation) > 0.2:
+            strength = "weak"
+        else:
+            strength = "very weak"
+        direction = "positive" if correlation > 0 else "negative"
+        result = f"""=== CORRELATION ANALYSIS ===
+Columns: {column1} vs {column2}
+Correlation coefficient: {correlation:.3f}
+Strength: {strength} {direction} correlation
+Interpretation:
+- The correlation is {strength} and {direction}
+- Values closer to 1 or -1 indicate stronger linear relationships
+- Values closer to 0 indicate weaker linear relationships
+Visualization saved to: {filepath}
+"""
+        return result
+    except Exception as e:
+        return f"Error in correlation analysis: {str(e)}"
+@tool
+def create_statistical_summary(df: pd.DataFrame, column_name: str) -> str:
+    """
+    Create a comprehensive statistical summary with visualization for a specific column.
+    Args:
+        df: The pandas DataFrame
+        column_name: Name of the column to analyze
+    Returns:
+        Statistical summary and saves a visualization
+    """
+    try:
+        if column_name not in df.columns:
+            return f"Error: Column '{column_name}' not found. Available columns: {list(df.columns)}"
+        column_data = df[column_name]
+        # Generate statistical summary
+        summary_lines = [f"=== STATISTICAL SUMMARY: {column_name} ==="]
+        if pd.api.types.is_numeric_dtype(column_data):
+            # Numeric column analysis
+            stats = column_data.describe()
+            summary_lines.extend([
+                f"Count: {stats['count']:.0f}",
+                f"Mean: {stats['mean']:.2f}",
+                f"Median: {stats['50%']:.2f}",
+                f"Standard Deviation: {stats['std']:.2f}",
+                f"Min: {stats['min']:.2f}",
+                f"Max: {stats['max']:.2f}",
+                f"25th Percentile: {stats['25%']:.2f}",
+                f"75th Percentile: {stats['75%']:.2f}",
+            ])
+            # Create histogram and box plot
+            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
+            # Histogram
+            ax1.hist(column_data.dropna(), bins=30, alpha=0.7, color='skyblue', edgecolor='black')
+            ax1.set_title(f'Distribution of {column_name}')
+            ax1.set_xlabel(column_name)
+            ax1.set_ylabel('Frequency')
+            ax1.grid(True, alpha=0.3)
+            # Box plot
+            ax2.boxplot(column_data.dropna())
+            ax2.set_title(f'Box Plot of {column_name}')
+            ax2.set_ylabel(column_name)
+            ax2.grid(True, alpha=0.3)
+        else:
+            # Categorical column analysis
+            value_counts = column_data.value_counts()
+            summary_lines.extend([
+                f"Total unique values: {column_data.nunique()}",
+                f"Most frequent value: {value_counts.index[0]} ({value_counts.iloc[0]} times)",
+                f"Least frequent value: {value_counts.index[-1]} ({value_counts.iloc[-1]} times)",
+                "",
+                "Top 10 values:"
+            ])
+            for value, count in value_counts.head(10).items():
+                percentage = (count / len(column_data)) * 100
+                summary_lines.append(f"  {value}: {count} ({percentage:.1f}%)")
+            # Create bar chart and pie chart
+            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
+            # Bar chart
+            top_values = value_counts.head(10)
+            ax1.bar(range(len(top_values)), top_values.values, color='lightcoral')
+            ax1.set_title(f'Top 10 Values in {column_name}')
+            ax1.set_xlabel('Categories')
+            ax1.set_ylabel('Count')
+            ax1.set_xticks(range(len(top_values)))
+            ax1.set_xticklabels(top_values.index, rotation=45, ha='right')
+            ax1.grid(True, alpha=0.3)
+            # Pie chart (top 8 values + others)
+            top_8 = value_counts.head(8)
+            others_count = value_counts.iloc[8:].sum() if len(value_counts) > 8 else 0
+            if others_count > 0:
+                pie_data = list(top_8.values) + [others_count]
+                pie_labels = list(top_8.index) + ['Others']
+            else:
+                pie_data = top_8.values
+                pie_labels = top_8.index
+            ax2.pie(pie_data, labels=pie_labels, autopct='%1.1f%%', startangle=90)
+            ax2.set_title(f'Distribution of {column_name}')
+        plt.tight_layout()
+        # Save the plot
+        filename = f'statistical_summary_{column_name}.png'
+        filepath = os.path.join('generated_data', filename)
+        plt.savefig(filepath, dpi=300, bbox_inches='tight')
+        plt.close()
+        summary_lines.append(f"\nVisualization saved to: {filepath}")
+        return "\n".join(summary_lines)
+    except Exception as e:
+        return f"Error in statistical analysis: {str(e)}"
+@tool
+def filter_and_visualize_data(df: pd.DataFrame, filter_column: str, filter_value: str, analysis_column: str, chart_type: str = "bar") -> str:
+    """
+    Filter the dataset and create a visualization of the filtered data.
+    Args:
+        df: The pandas DataFrame
+        filter_column: Column to filter by
+        filter_value: Value to filter for (can be partial match for string columns)
+        analysis_column: Column to analyze in the filtered data
+        chart_type: Type of chart to create ('bar', 'line', 'histogram', 'pie')
+    Returns:
+        Analysis results and saves a visualization
+    """
+    try:
+        if filter_column not in df.columns:
+            return f"Error: Filter column '{filter_column}' not found. Available columns: {list(df.columns)}"
+        if analysis_column not in df.columns:
+            return f"Error: Analysis column '{analysis_column}' not found. Available columns: {list(df.columns)}"
+        # Filter the data
+        if df[filter_column].dtype == 'object':
+            # String filtering - partial match
+            filtered_df = df[df[filter_column].str.contains(filter_value, case=False, na=False)]
+        else:
+            # Numeric filtering - exact match
+            try:
+                filter_value_numeric = float(filter_value)
+                filtered_df = df[df[filter_column] == filter_value_numeric]
+            except ValueError:
+                return f"Error: Cannot convert '{filter_value}' to numeric for filtering"
+        if filtered_df.empty:
+            return f"No data found matching filter: {filter_column} = '{filter_value}'"
+        result_lines = [
+            f"=== FILTERED DATA ANALYSIS ===",
+            f"Filter: {filter_column} contains/equals '{filter_value}'",
+            f"Filtered dataset size: {len(filtered_df)} rows (from {len(df)} total)",
+            f"Analysis column: {analysis_column}",
+            ""
+        ]
+        # Analyze the filtered data
+        analysis_data = filtered_df[analysis_column]
+        plt.figure(figsize=(12, 8))
+        if chart_type == "bar":
+            if pd.api.types.is_numeric_dtype(analysis_data):
+                # For numeric data, create bins
+                analysis_data.hist(bins=20, alpha=0.7, color='lightblue', edgecolor='black')
+                plt.ylabel('Frequency')
+            else:
+                # For categorical data, show value counts
+                value_counts = analysis_data.value_counts().head(15)
+                plt.bar(range(len(value_counts)), value_counts.values, color='lightcoral')
+                plt.xticks(range(len(value_counts)), value_counts.index, rotation=45, ha='right')
+                plt.ylabel('Count')
+                # Add statistics to result
+                result_lines.extend([
+                    f"Top value: {value_counts.index[0]} ({value_counts.iloc[0]} occurrences)",
+                    f"Total unique values: {analysis_data.nunique()}"
+                ])
+        elif chart_type == "line":
+            if pd.api.types.is_numeric_dtype(analysis_data):
+                sorted_data = analysis_data.sort_values()
+                plt.plot(range(len(sorted_data)), sorted_data.values, marker='o', alpha=0.7)
+                plt.ylabel(analysis_column)
+                plt.xlabel('Sorted Index')
+            else:
+                return "Line chart requires numeric data for analysis column"
+        elif chart_type == "histogram":
+            if pd.api.types.is_numeric_dtype(analysis_data):
+                plt.hist(analysis_data.dropna(), bins=30, alpha=0.7, color='green', edgecolor='black')
+                plt.ylabel('Frequency')
+                # Add statistics
+                mean_val = analysis_data.mean()
+                median_val = analysis_data.median()
+                result_lines.extend([
+                    f"Mean: {mean_val:.2f}",
+                    f"Median: {median_val:.2f}",
+                    f"Standard Deviation: {analysis_data.std():.2f}"
+                ])
+            else:
+                return "Histogram requires numeric data for analysis column"
+        elif chart_type == "pie":
+            value_counts = analysis_data.value_counts().head(10)
+            plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', startangle=90)
+        plt.title(f'{chart_type.title()} Chart: {analysis_column}\nFiltered by {filter_column} = "{filter_value}"')
+        plt.xlabel(analysis_column)
+        plt.tight_layout()
+        # Save the plot
+        filename = f'filtered_{filter_column}_{filter_value}_{analysis_column}_{chart_type}.png'
+        # Clean filename
+        filename = "".join(c for c in filename if c.isalnum() or c in ('_', '-', '.')).rstrip()
+        filepath = os.path.join('generated_data', filename)
+        plt.savefig(filepath, dpi=300, bbox_inches='tight')
+        plt.close()
+        result_lines.append(f"\nVisualization saved to: {filepath}")
+        return "\n".join(result_lines)
+    except Exception as e:
+        return f"Error in filtered analysis: {str(e)}"

tools/libreoffice_tools.py CHANGED Viewed

@@ -122,12 +122,23 @@ def convert_to_pdf_with_libreoffice(input_file: str, output_dir: str = None) ->
         return f"Error during LibreOffice conversion: {str(e)}"
 @tool
-def check_libreoffice_availability() -> str:
     """
-    Check if LibreOffice is available and return its path and version.
     Returns:
-        str: Information about LibreOffice availability
     """
     libreoffice_path = get_libreoffice_path()

         return f"Error during LibreOffice conversion: {str(e)}"
 @tool
+def check_libreoffice_availability() -> bool:
     """
+    Check if LibreOffice is available on the system.
     Returns:
+        bool: True if LibreOffice is available, False otherwise
+    """
+    libreoffice_path = get_libreoffice_path()
+    return libreoffice_path is not None
+@tool
+def get_libreoffice_info() -> str:
+    """
+    Get detailed information about LibreOffice installation for troubleshooting.
+    Returns:
+        str: Detailed information about LibreOffice availability and installation
     """
     libreoffice_path = get_libreoffice_path()

tools/retrieval_tools.py ADDED Viewed

	@@ -0,0 +1,277 @@

+import os
+import pandas as pd
+import pickle
+import numpy as np
+from smolagents import tool
+from rank_bm25 import BM25Okapi
+from dotenv import load_dotenv
+from smolagents import CodeAgent, LiteLLMModel
+from unidecode import unidecode
+import numpy as np
+load_dotenv()
+# Global variables for BM25 model
+_bm25_model = None
+_precomputed_titles = None
+_dataset_df = None
+_llm_translator = None
+def _initialize_retrieval_system():
+    """Initialize the retrieval system with BM25 model and dataset"""
+    global _bm25_model, _precomputed_titles, _dataset_df, _llm_translator
+    # Load dataset if not already loaded
+    if _dataset_df is None:
+        try:
+            _dataset_df = pd.read_csv('filtered_dataset.csv')
+            print(f"✅ Loaded dataset with {len(_dataset_df)} entries")
+        except FileNotFoundError:
+            raise Exception("filtered_dataset.csv not found. Please ensure the dataset file exists.")
+    # Initialize LLM translator if not already initialized
+    if _llm_translator is None:
+        try:
+            model = LiteLLMModel(
+                model_id="gemini/gemini-2.5-flash-preview-05-20",
+                api_key=os.getenv("GEMINI_API_KEY")
+            )
+            _llm_translator = CodeAgent(tools=[], model=model, max_steps=1)
+            print("✅ LLM translator initialized")
+        except Exception as e:
+            print(f"⚠️  Error initializing LLM translator: {e}")
+    # Load pre-computed BM25 model if available
+    if _bm25_model is None:
+        try:
+            with open('bm25_data.pkl', 'rb') as f:
+                bm25_data = pickle.load(f)
+                _bm25_model = bm25_data['bm25_model']
+                _precomputed_titles = bm25_data['titles']
+                print(f"✅ Loaded pre-computed BM25 model for {len(_precomputed_titles)} datasets")
+        except FileNotFoundError:
+            print("⚠️  Pre-computed BM25 model not found. Will compute at runtime.")
+        except Exception as e:
+            print(f"⚠️  Error loading pre-computed BM25 model: {e}")
+def _translate_query_llm(query, target_lang='fr'):
+    """Translate query using LLM"""
+    global _llm_translator
+    if _llm_translator is None:
+        return query, 'unknown'
+    try:
+        if target_lang == 'fr':
+            target_language = "French"
+        elif target_lang == 'en':
+            target_language = "English"
+        else:
+            target_language = target_lang
+        translation_prompt = f"""
+        Translate the following text to {target_language}.
+        If the text is already in {target_language}, return it as is.
+        Only return the translated text, nothing else.
+        Text to translate: "{query}"
+        """
+        response = _llm_translator.run(translation_prompt)
+        translated_text = str(response).strip().strip('"').strip("'")
+        # Simple language detection
+        if query.lower() == translated_text.lower():
+            source_lang = target_lang
+        else:
+            source_lang = 'en' if target_lang == 'fr' else 'fr'
+        return translated_text, source_lang
+    except Exception as e:
+        print(f"LLM translation error: {e}")
+        return query, 'unknown'
+def _simple_keyword_preprocessing(text):
+    """Simple preprocessing for keyword matching - handles case, accents and basic plurals"""
+    text = unidecode(str(text).lower())
+    words = text.split()
+    processed_words = []
+    for word in words:
+        if word.endswith('s') and len(word) > 3 and not word.endswith('ss'):
+            word = word[:-1]
+        elif word.endswith('x') and len(word) > 3:
+            word = word[:-1]
+        processed_words.append(word)
+    return processed_words
+@tool
+def search_datasets(query: str, top_k: int = 5) -> str:
+    """
+    Search for relevant datasets in the French public data catalog using BM25-based keyword matching.
+    Args:
+        query: The search query describing what kind of dataset you're looking for
+        top_k: Number of top results to return (default: 5)
+    Returns:
+        A formatted string containing the top matching datasets with their titles, URLs, and relevance scores
+    """
+    try:
+        # Initialize the retrieval system
+        _initialize_retrieval_system()
+        global _bm25_model, _precomputed_titles, _dataset_df
+        # Translate query to French for better matching
+        translated_query, original_lang = _translate_query_llm(query, target_lang='fr')
+        # Combine original and translated queries for search
+        search_queries = [query, translated_query] if query != translated_query else [query]
+        # Get dataset titles
+        dataset_titles = _dataset_df['title'].fillna('').tolist()
+        # Use pre-computed BM25 model if available and matches current dataset
+        if (_bm25_model is not None and _precomputed_titles is not None and
+            len(dataset_titles) == len(_precomputed_titles) and dataset_titles == _precomputed_titles):
+            bm25 = _bm25_model
+        else:
+            # Build BM25 model at runtime
+            processed_titles = [_simple_keyword_preprocessing(title) for title in dataset_titles]
+            bm25 = BM25Okapi(processed_titles)
+        # Get scores for all search queries and find best matches
+        all_scores = []
+        for search_query in search_queries:
+            try:
+                processed_query = _simple_keyword_preprocessing(search_query)
+                scores = bm25.get_scores(processed_query)
+                all_scores.append(scores)
+            except Exception as e:
+                print(f"Error processing query '{search_query}': {e}")
+                continue
+        if not all_scores:
+            return "Error: Could not process any search queries"
+        # Combine scores (take maximum across all queries)
+        combined_scores = all_scores[0]
+        for scores in all_scores[1:]:
+            combined_scores = np.maximum(combined_scores, scores)
+        # Get top-k results
+        top_indices = combined_scores.argsort()[-top_k:][::-1]
+        # Format results
+        results = []
+        results.append(f"Top {top_k} datasets for query: '{query}'")
+        if query != translated_query:
+            results.append(f"(Translated to French: '{translated_query}')")
+        results.append("")
+        for i, idx in enumerate(top_indices, 1):
+            score = combined_scores[idx]
+            title = _dataset_df.iloc[idx]['title']
+            url = _dataset_df.iloc[idx]['url']
+            organization = _dataset_df.iloc[idx].get('organization', 'N/A')
+            results.append(f"{i}. Score: {score:.2f}")
+            results.append(f"   Title: {title}")
+            results.append(f"   URL: {url}")
+            results.append(f"   Organization: {organization}")
+            results.append("")
+        return "\n".join(results)
+    except Exception as e:
+        return f"Error during dataset search: {str(e)}"
+@tool
+def get_dataset_info(dataset_url: str) -> str:
+    """
+    Get detailed information about a specific dataset from its data.gouv.fr URL.
+    Args:
+        dataset_url: The URL of the dataset page on data.gouv.fr
+    Returns:
+        Detailed information about the dataset including title, description, organization, and metadata
+    """
+    try:
+        _initialize_retrieval_system()
+        global _dataset_df
+        # Find the dataset in our catalog
+        matching_rows = _dataset_df[_dataset_df['url'] == dataset_url]
+        if matching_rows.empty:
+            return f"Dataset not found in catalog for URL: {dataset_url}"
+        dataset = matching_rows.iloc[0]
+        # Format the dataset information
+        info_lines = []
+        info_lines.append("=== DATASET INFORMATION ===")
+        info_lines.append(f"Title: {dataset.get('title', 'N/A')}")
+        info_lines.append(f"URL: {dataset.get('url', 'N/A')}")
+        info_lines.append(f"Organization: {dataset.get('organization', 'N/A')}")
+        if 'description' in dataset and pd.notna(dataset['description']):
+            description = str(dataset['description'])
+            if len(description) > 500:
+                description = description[:500] + "..."
+            info_lines.append(f"Description: {description}")
+        if 'tags' in dataset and pd.notna(dataset['tags']):
+            info_lines.append(f"Tags: {dataset['tags']}")
+        if 'license' in dataset and pd.notna(dataset['license']):
+            info_lines.append(f"License: {dataset['license']}")
+        if 'temporal_coverage' in dataset and pd.notna(dataset['temporal_coverage']):
+            info_lines.append(f"Temporal Coverage: {dataset['temporal_coverage']}")
+        if 'spatial_coverage' in dataset and pd.notna(dataset['spatial_coverage']):
+            info_lines.append(f"Spatial Coverage: {dataset['spatial_coverage']}")
+        if 'quality_score' in dataset and pd.notna(dataset['quality_score']):
+            info_lines.append(f"Quality Score: {dataset['quality_score']}")
+        return "\n".join(info_lines)
+    except Exception as e:
+        return f"Error getting dataset info: {str(e)}"
+@tool
+def get_random_quality_dataset() -> str:
+    """
+    Get a random high-quality dataset from the catalog, weighted by quality score.
+    Returns:
+        Information about a randomly selected high-quality dataset
+    """
+    try:
+        _initialize_retrieval_system()
+        global _dataset_df
+        # Use quality_score as weights for random selection
+        if 'quality_score' in _dataset_df.columns:
+            weights = _dataset_df['quality_score'].fillna(0)
+            weights = weights - weights.min() + 0.1  # Shift to make all positive
+        else:
+            weights = None
+        # Randomly sample one dataset weighted by quality
+        selected_row = _dataset_df.sample(n=1, weights=weights).iloc[0]
+        # Return dataset info
+        return get_dataset_info(selected_row['url'])
+    except Exception as e:
+        return f"Error getting random dataset: {str(e)}"

tools/webpage_tools.py CHANGED Viewed

@@ -153,6 +153,32 @@ def read_file_from_url(url: str) -> pd.DataFrame:
     except Exception as e:
         raise Exception(f"An unexpected error occurred: {str(e)}")
 if __name__ == "__main__":
     url = "https://www.data.gouv.fr/fr/datasets/repertoire-national-des-elus-1/"
     url = "https://www.data.gouv.fr/fr/datasets/catalogue-des-donnees-de-data-gouv-fr/"

     except Exception as e:
         raise Exception(f"An unexpected error occurred: {str(e)}")
+@tool
+def save_dataset_for_followup(df: pd.DataFrame, filename: str = "analysis_dataset.csv") -> str:
+    """
+    Save the current dataset to the generated_data folder for follow-up analysis.
+    Args:
+        df: The pandas DataFrame to save
+        filename: Name of the file to save (default: "analysis_dataset.csv")
+    Returns:
+        Confirmation message with file path
+    """
+    try:
+        # Ensure generated_data directory exists
+        import os
+        os.makedirs('generated_data', exist_ok=True)
+        # Save the dataset
+        filepath = os.path.join('generated_data', filename)
+        df.to_csv(filepath, index=False)
+        return f"Dataset saved for follow-up analysis: {filepath} ({len(df)} rows, {len(df.columns)} columns)"
+    except Exception as e:
+        return f"Error saving dataset: {str(e)}"
 if __name__ == "__main__":
     url = "https://www.data.gouv.fr/fr/datasets/repertoire-national-des-elus-1/"
     url = "https://www.data.gouv.fr/fr/datasets/catalogue-des-donnees-de-data-gouv-fr/"