Spaces:

axel-darmouni
/

datagouv-french-data-analyst

Sleeping

App Files Files Community

axel-darmouni commited on Jun 9

Commit

2508004

1 Parent(s): 97eafcb

update

Browse files

Files changed (14) hide show

.gitignore +6 -0
README.md +206 -1
agent.py +87 -0
app.py +635 -0
filtered_dataset.csv +0 -0
france_data/departements.geojson +0 -0
france_data/regions.geojson +0 -0
launch_gradio.py +30 -0
reexport_data.py +97 -0
requirements.txt +22 -0
tools/drawing_tools.py +206 -0
tools/exploration_tools.py +84 -0
tools/libreoffice_tools.py +155 -0
tools/webpage_tools.py +168 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+.env
+__pycache__/
+OLD/
+dataset_metadata/
+generated_data/
+.gradio/

README.md CHANGED Viewed

@@ -11,4 +11,209 @@ license: mit
 short_description: Agents for data analysis of French public data.
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: Agents for data analysis of French public data.
 ---
+# 🤖 French Public Data Analysis Agent
+**AI-powered intelligent analysis of French government datasets** with automated visualization generation and comprehensive PDF reports.
+## ✨ Features
+### 🔍 **Intelligent Dataset Discovery**
+- **BM25 Keyword Search**: Advanced keyword matching with pre-computed search indices
+- **Bilingual Query Translation**: Search in French or English - queries are automatically translated using LLM
+- **Quality-Weighted Random Selection**: Leave query empty to randomly select high-quality datasets
+- **Real-time Dataset Matching**: Instant matching against 5,000+ French government datasets
+### 🤖 **Automated AI Analysis**
+- **SmolAgents Integration**: Advanced AI agent with 30+ step planning capability
+- **Custom Tool Suite**: Specialized tools for web scraping, data analysis, and visualization
+- **Multi-step Processing**: Complete pipeline from data discovery to report generation
+- **Error Recovery**: Smart error handling and alternative data source selection
+### 📊 **Advanced Visualizations**
+- **France Geographic Maps**: Department and region-level choropleth maps
+- **Multiple Chart Types**: Bar charts, line plots, scatter plots, heatmaps
+- **Smart Visualization Selection**: AI automatically chooses appropriate chart types
+- **High-Quality PNG Output**: Publication-ready visualizations
+### 📄 **Comprehensive Reports**
+- **Professional PDF Reports**: Complete analysis with embedded visualizations
+- **Bilingual Support**: Reports generated in the same language as your query
+- **Structured Analysis**: Title page, methodology, findings, and next steps
+- **LibreOffice Integration**: Cross-platform PDF generation
+### 🎨 **Modern Web Interface**
+- **Real-time Progress Tracking**: Detailed step-by-step progress updates
+- **Responsive Design**: Beautiful, modern Gradio interface
+- **Quick Start Examples**: Pre-built queries for common use cases
+- **Accordion Tips**: Collapsible help section with usage instructions
+## 🚀 Quick Start
+### 1. Prerequisites
+- Python 3.8+
+- LibreOffice (for PDF generation)
+- Google Gemini API key
+### 2. Installation
+```bash
+# Clone the repository
+git clone <repository-url>
+cd gradio_hackathon_agent
+# Install dependencies
+pip install -r requirements.txt
+```
+### 3. Environment Setup
+Create a `.env` file in the project root:
+```bash
+GEMINI_API_KEY=your_gemini_api_key_here
+```
+### 4. Launch the Application
+```bash
+python gradio_interface.py
+```
+The interface will be available at:
+- **Local**: http://localhost:7860
+- **Public**: Shareable URL provided automatically
+## 💡 How to Use
+### Basic Usage
+1. **Enter Your Query**: Type any search term related to French public data
+   - Examples: "road traffic accidents", "education directory", "housing data"
+   - Supports both French and English queries
+2. **Or Use Quick Examples**: Click any of the pre-built example queries:
+   - 🚗 Road Traffic Accidents 2005-2023
+   - 🎓 Education Directory
+   - 🏠 French Vacant Housing Private Park
+3. **Or Go Random**: Leave the query empty to randomly select a high-quality dataset
+4. **Click "🚀 Analyze Dataset"**: The AI agent begins processing
+### Results
+- **Download PDF Report**: Complete analysis with all visualizations
+- **View Individual Charts**: Up to 4 visualizations displayed in the interface
+- **Dataset Reference**: Direct link to the original data.gouv.fr page
+## 🛠️ Technical Architecture
+### Core Components
+```
+📁 Project Structure
+├── app.py     # Main Gradio web interface with progress tracking
+├── agent.py               # SmolAgents configuration and prompt generation
+├── tools/                 # Custom agent tools
+│   ├── webpage_tools.py   # Web scraping and data extraction
+│   ├── exploration_tools.py # Dataset analysis and description
+│   ├── drawing_tools.py   # France map generation and visualization
+│   └── libreoffice_tools.py # PDF conversion utilities
+├── filtered_dataset.csv   # Pre-processed dataset index (5,000+ datasets)
+└── france_data/           # Geographic data for France maps
+```
+### Key Technologies
+- **Frontend**: Gradio with custom CSS and real-time progress
+- **AI Agent**: SmolAgents powered by an MLLM
+- **Search**: BM25 keyword matching with TF-IDF preprocessing
+- **Translation**: LLM-powered bilingual query translation
+- **Visualization**: Matplotlib, Geopandas, Seaborn
+- **PDF Generation**: python-docx + LibreOffice conversion
+- **Data Processing**: Pandas, NumPy, Shapely
+### Smart Features
+#### BM25 Search Enhancement
+- Pre-computed search indices for 5,000+ datasets
+- Accent-insensitive keyword matching
+- Plural form normalization
+- Quality-score weighted ranking
+#### LLM Translation
+- Automatic French ↔ English translation
+- Query language detection
+- Bilingual result matching
+- Context-aware translations
+#### Progress System
+- Thread-safe progress tracking
+- Queue-based status updates
+- Step-by-step visual feedback
+- Non-blocking UI execution
+## 🔧 Troubleshooting
+### Common Issues
+1. **"No CSV/JSON files found"**
+   - The selected dataset doesn't contain processable files
+   - Try a different query or use the random selection
+2. **LibreOffice PDF conversion fails**
+   - Ensure LibreOffice is installed and accessible
+   - Check the console for specific error messages
+3. **Translation errors**
+   - Verify your API key is valid
+   - Check API quota and rate limits
+4. **Slow performance**
+   - BM25 index computation may take time on first run
+   - Pre-computed indices are cached for faster subsequent searches
+### Performance Optimization
+- **Pre-compute BM25**: Run the search once to generate `bm25_data.pkl`
+- **Use SSD storage**: Faster file I/O for large datasets
+- **Monitor API usage**: API calls for translation and agent execution
+## 📊 Dataset Coverage
+- **5,000+ Datasets**: Pre-filtered French government datasets
+- **Data Sources**: data.gouv.fr, INSEE, regional authorities
+- **File Formats**: CSV, JSON, Excel, XML
+- **Topics**: All major sectors of French public administration
+- **Quality Scores**: Datasets ranked by completeness and usability
+## 🚀 Advanced Usage
+### Custom Tool Development
+Add new tools to the `tools/` directory following the SmolAgents tool pattern.
+### BM25 Index Optimization
+Regenerate search indices with:
+```python
+# Run once to create optimized search index
+python -c "from app import initialize_models; initialize_models()"
+```
+### Batch Processing
+Process multiple datasets programmatically using the agent directly.
+## 📄 License
+This project is developed for the Gradio MCP x Agents Hackathon. See individual tool licenses for third-party components.
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Add your improvements
+4. Submit a pull request
+---
+**🎉 Ready to explore French public data with AI? Launch the interface and start analyzing!**

agent.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import os
+from tools.webpage_tools import (
+    visit_webpage,
+    get_all_links,
+    read_file_from_url,
+)
+from tools.exploration_tools import (
+    get_dataset_description,
+)
+from tools.drawing_tools import (
+    plot_departments_data,
+)
+from tools.libreoffice_tools import (
+    convert_to_pdf_with_libreoffice,
+    check_libreoffice_availability,
+)
+from smolagents import (
+    CodeAgent,
+    DuckDuckGoSearchTool,
+    LiteLLMModel,
+)
+def create_web_agent(step_callback):
+    search_tool = DuckDuckGoSearchTool()
+    model = LiteLLMModel(
+                model_id="gemini/gemini-2.5-flash-preview-05-20",
+                api_key=os.getenv("GEMINI_API_KEY"),
+            )
+    web_agent = CodeAgent(
+            tools=[
+                search_tool,
+                visit_webpage, get_all_links, read_file_from_url,
+                get_dataset_description,
+                plot_departments_data,
+                convert_to_pdf_with_libreoffice,
+                check_libreoffice_availability
+            ],
+            model=model,
+            max_steps=30,
+            verbosity_level=1,  # Reduced verbosity for cleaner output
+            planning_interval=3,
+            step_callbacks=[step_callback],  # Use the built-in callback system
+            additional_authorized_imports=[
+                "subprocess", "docx", "docx.*",
+                "os", "bs4", "io", "requests", "json", "pandas",
+                "matplotlib", "matplotlib.pyplot", "matplotlib.*", "numpy",  "seaborn"
+            ],
+        )
+    return web_agent
+def generate_prompt(data_gouv_page):
+    return f"""Fetch me a dataset that can be just read by using the read_file_from_url tool
+        from {data_gouv_page}
+        Follow the steps below to generate a pdf report from the dataset.
+        The steps should be as follows:
+        1. Examine the page
+        2. Get all links
+        3. Get the dataset from the link
+        4. Get information about the dataset using the get_dataset_description tool
+        5. Decide on what you can draw based on either department or region data
+        5.1 if no data department or region level, look for another file!
+        6. Draw a map of France using your idea
+        7. Save the map in png file
+        8. Make as well 3 additional visualizations, not maps, that you can save in png files
+        9. Write an interesting analysis text for each of your visualizations. Be smart and think cleverly about the data and what it can state
+        10. Think of next step analysis to look at the data
+        11. Generate a comprehensive PDF report using the python-docx library that includes:
+            - A title page with the dataset name and analysis overview
+            - All your visualizations (PNG files) embedded in the report
+            - Your analysis text for each visualization
+            - Conclusions and next steps
+        Make the visualizations appropriately sized so they fit well in the PDF report.
+        Convert then that docx file to pdf using the convert_to_pdf_with_libreoffice tool.
+        Do not overcommit, just do the steps one by one and it should go fine! Do not, under any circumstance, use the 'os' module!
+        Do not generate a lot of code every step, go slowly but surely and it will work out. Save everything within the generated_data folder.
+        If question is in english, report is in english.
+        If question is in french, report is in french.
+        IMPORTANT LIBREOFFICE NOTES:
+        - If you need to use LibreOffice, first call check_libreoffice_availability() to verify it's available
+        - If LibreOffice is available, "LibreOffice found" is returned by "check_libreoffice_availability()"
+        - Use convert_to_pdf_with_libreoffice() tool instead of subprocess calls
+        - Do NOT use subprocess.run(['libreoffice', ...]) or subprocess.run(['soffice', ...]) directly
+        - The LibreOffice tools handle macOS, Linux, and Windows path differences automatically
+        """

app.py ADDED Viewed

	@@ -0,0 +1,635 @@

+import os
+import pandas as pd
+import gradio as gr
+import glob
+import threading
+import time
+import queue
+import numpy as np
+from rank_bm25 import BM25Okapi
+import re
+from dotenv import load_dotenv
+from smolagents import CodeAgent, LiteLLMModel
+from agent import create_web_agent, generate_prompt
+from unidecode import unidecode
+load_dotenv()
+# Global variables for progress tracking
+progress_queue = queue.Queue()
+current_status = ""
+# Initialize LLM translator and BM25
+llm_translator = None
+bm25_model = None
+precomputed_titles = None
+def initialize_models():
+    """Initialize the LLM translator and BM25 model"""
+    global llm_translator, bm25_model, precomputed_titles
+    if llm_translator is None:
+        # Initialize LLM for translation
+        try:
+            model = LiteLLMModel(
+                model_id="gemini/gemini-2.5-flash-preview-05-20",
+                api_key=os.getenv("GEMINI_API_KEY")
+            )
+            llm_translator = CodeAgent(tools=[], model=model, max_steps=1)
+            print("✅ LLM translator initialized")
+        except Exception as e:
+            print(f"⚠️  Error initializing LLM translator: {e}")
+    # Load pre-computed BM25 model if available
+    if bm25_model is None:
+        try:
+            import pickle
+            with open('bm25_data.pkl', 'rb') as f:
+                bm25_data = pickle.load(f)
+                bm25_model = bm25_data['bm25_model']
+                precomputed_titles = bm25_data['titles']
+                print(f"✅ Loaded pre-computed BM25 model for {len(precomputed_titles)} datasets")
+        except FileNotFoundError:
+            print("⚠️  Pre-computed BM25 model not found. Will compute at runtime.")
+        except Exception as e:
+            print(f"⚠️  Error loading pre-computed BM25 model: {e}")
+            print("Will compute BM25 at runtime.")
+def translate_query_llm(query, target_lang='fr'):
+    """Translate query using LLM"""
+    global llm_translator
+    if llm_translator is None:
+        initialize_models()
+    if llm_translator is None:
+        print("⚠️  LLM translator not available, returning original query")
+        return query, 'unknown'
+    try:
+        # Create translation prompt
+        if target_lang == 'fr':
+            target_language = "French"
+        elif target_lang == 'en':
+            target_language = "English"
+        else:
+            target_language = target_lang
+        translation_prompt = f"""
+        Translate the following text to {target_language}.
+        If the text is already in {target_language}, return it as is.
+        Only return the translated text, nothing else.
+        Text to translate: "{query}"
+        """
+        # Get translation from LLM
+        response = llm_translator.run(translation_prompt)
+        translated_text = str(response).strip().strip('"').strip("'")
+        # Simple language detection
+        if query.lower() == translated_text.lower():
+            source_lang = target_lang
+        else:
+            source_lang = 'en' if target_lang == 'fr' else 'fr'
+        return translated_text, source_lang
+    except Exception as e:
+        print(f"LLM translation error: {e}")
+        return query, 'unknown'
+def simple_keyword_preprocessing(text):
+    """Simple preprocessing for keyword matching - handles case, accents and basic plurals"""
+    # Convert to lowercase and remove accents
+    text = unidecode(str(text).lower())
+    # Basic plural handling - just remove trailing 's' and 'x'
+    words = text.split()
+    processed_words = []
+    for word in words:
+        # Remove common plural endings
+        if word.endswith('s') and len(word) > 3 and not word.endswith('ss'):
+            word = word[:-1]
+        elif word.endswith('x') and len(word) > 3:
+            word = word[:-1]
+        processed_words.append(word)
+    return processed_words
+def find_similar_dataset_bm25(query, df):
+    """Find the most similar dataset using BM25 keyword matching"""
+    global bm25_model, precomputed_titles
+    # Translate query to French for better matching with French datasets
+    translated_query, original_lang = translate_query_llm(query, target_lang='fr')
+    # Combine original and translated queries for search
+    search_queries = [query, translated_query] if query != translated_query else [query]
+    # Get dataset titles
+    dataset_titles = df['title'].fillna('').tolist()
+    # Use pre-computed BM25 model if available and matches current dataset
+    if (bm25_model is not None and precomputed_titles is not None and
+        len(dataset_titles) == len(precomputed_titles) and dataset_titles == precomputed_titles):
+        print("🚀 Using pre-computed BM25 model for fast matching")
+        bm25 = bm25_model
+    else:
+        # Build BM25 model at runtime
+        print("⚠️  Computing BM25 model at runtime...")
+        # Preprocess all dataset titles into tokenized form
+        processed_titles = [simple_keyword_preprocessing(title) for title in dataset_titles]
+        bm25 = BM25Okapi(processed_titles)
+    best_score = -1
+    best_idx = 0
+    for search_query in search_queries:
+        try:
+            # Preprocess the search query
+            processed_query = simple_keyword_preprocessing(search_query)
+            # Get BM25 scores for all documents
+            scores = bm25.get_scores(processed_query)
+            max_score = scores.max()
+            max_idx = scores.argmax()
+            if max_score > best_score:
+                best_score = max_score
+                best_idx = max_idx
+        except Exception as e:
+            print(f"Error processing query '{search_query}': {e}")
+            continue
+    # Show top 5 matches for comparison
+    if len(search_queries) > 0:
+        processed_query = simple_keyword_preprocessing(search_queries[0])
+        scores = bm25.get_scores(processed_query)
+    return best_idx, best_score, translated_query, original_lang
+def create_progress_callback():
+    """Create a callback function for tracking agent progress"""
+    def progress_callback(memory_step, agent=None):
+        """Callback function called at each agent step"""
+        step_number = memory_step.step_number
+        # Extract information about the current step
+        if hasattr(memory_step, 'action_input') and memory_step.action_input:
+            action_content = memory_step.action_input
+        elif hasattr(memory_step, 'action_output') and memory_step.action_output:
+            action_content = str(memory_step.action_output)
+        else:
+            action_content = ""
+        # Define progress based on step content and number
+        progress_val = min(0.1 + (step_number * 0.03), 0.95)  # Progressive increase
+        # Analyze the step content to provide meaningful status
+        action_lower = action_content.lower() if action_content else ""
+        if "visit_webpage" in action_lower or "examining" in action_lower:
+            description = f"🔍 Step {step_number}: Examining webpage..."
+        elif "get_all_links" in action_lower or "links" in action_lower:
+            description = f"🔗 Step {step_number}: Extracting data links..."
+        elif "read_file_from_url" in action_lower or "reading" in action_lower:
+            description = f"📊 Step {step_number}: Loading dataset..."
+        elif "get_dataset_description" in action_lower or "description" in action_lower:
+            description = f"📋 Step {step_number}: Analyzing dataset structure..."
+        elif "department" in action_lower or "region" in action_lower:
+            description = f"🗺️ Step {step_number}: Processing geographic data..."
+        elif "plot" in action_lower or "map" in action_lower or "france" in action_lower:
+            description = f"🗺️ Step {step_number}: Creating France map..."
+        elif "visualization" in action_lower or "chart" in action_lower:
+            description = f"📈 Step {step_number}: Generating visualizations..."
+        elif "save" in action_lower or "png" in action_lower:
+            description = f"💾 Step {step_number}: Saving visualizations..."
+        elif "pdf" in action_lower or "report" in action_lower:
+            description = f"📄 Step {step_number}: Creating PDF report..."
+        elif hasattr(memory_step, 'error') and memory_step.error:
+            description = f"⚠️ Step {step_number}: Handling error..."
+        else:
+            description = f"🤖 Step {step_number}: Processing..."
+        # Check if this is the final step
+        if hasattr(memory_step, 'action_output') and memory_step.action_output and "final" in action_lower:
+            progress_val = 1.0
+            description = "✅ Analysis complete!"
+        # Put the progress update in the queue
+        try:
+            progress_queue.put((progress_val, description))
+        except:
+            pass
+    return progress_callback
+def run_agent_analysis_with_progress(query, progress_callback, df=None, page_url_callback=None, data_gouv_page=None, most_similar_idx=None):
+    """
+    Run the agent analysis with progress tracking using smolagents callbacks.
+    """
+    try:
+        # Clean up previous results
+        if os.path.exists('generated_data'):
+            for file in glob.glob('generated_data/*'):
+                try:
+                    os.remove(file)
+                except:
+                    pass
+        else:
+            os.makedirs('generated_data', exist_ok=True)
+        # If dataset info not provided, find it (fallback)
+        if data_gouv_page is None or most_similar_idx is None:
+            progress_callback(0.02, "🤖 Initializing LLM translator and BM25...")
+            initialize_models()
+            progress_callback(0.05, "🔍 Searching for relevant datasets (using BM25 keyword matching)...")
+            # Read the filtered dataset if not provided
+            if df is None:
+                df = pd.read_csv('filtered_dataset.csv')
+            # Find the most similar dataset using BM25 keyword matching
+            most_similar_idx, similarity_score, translated_query, original_lang = find_similar_dataset_bm25(query, df)
+            data_gouv_page = df.iloc[most_similar_idx]['url']
+            # Immediately show the page URL via callback
+            if page_url_callback:
+                page_url_callback(data_gouv_page)
+            progress_callback(0.08, "🤖 Initializing agent...")
+        else:
+            # Dataset already found, continue from where we left off
+            progress_callback(0.09, "🤖 Initializing agent...")
+        step_callback = create_progress_callback()
+        progress_callback(0.1, "🤖 Starting agent analysis...")
+        # Create the agent with progress callback
+        web_agent = create_web_agent(step_callback)
+        prompt = generate_prompt(data_gouv_page)
+        # Run the agent - the step_callbacks will automatically update progress
+        answer = web_agent.run(prompt)
+        # Check if the agent found no processable data
+        answer_lower = str(answer).lower() if answer else ""
+        if ("no processable data" in answer_lower or
+            "no csv nor json" in answer_lower or
+            "cannot find csv" in answer_lower or
+            "cannot find json" in answer_lower or
+            "no data to process" in answer_lower):
+            progress_callback(1.0, "❌ No CSV/JSON files found in the dataset")
+            return "❌ No CSV/JSON files found in the selected dataset. This dataset cannot be processed automatically.", [], data_gouv_page
+        # Check if files were generated
+        generated_files = glob.glob('generated_data/*')
+        if generated_files:
+            progress_callback(1.0, "✅ Analysis completed successfully!")
+            return "Analysis completed successfully!", generated_files, data_gouv_page
+        else:
+            progress_callback(1.0, "⚠️ Analysis completed but no files were generated.")
+            return "Analysis completed but no files were generated.", [], data_gouv_page
+    except Exception as e:
+        progress_callback(1.0, f"❌ Error: {str(e)}")
+        return f"Error during analysis: {str(e)}", [], None
+def search_and_analyze(query, progress=gr.Progress()):
+    """
+    Main function called when search button is clicked.
+    Uses Gradio's progress bar for visual feedback.
+    """
+    # Read the filtered dataset first
+    df = pd.read_csv('filtered_dataset.csv')
+    # If no query provided, randomly select one weighted by quality score
+    if not query.strip():
+        progress(0, desc="🎲 No query provided - selecting random high-quality dataset...")
+        # Use quality_score as weights for random selection
+        if 'quality_score' in df.columns:
+            # Ensure quality scores are positive for weighting
+            weights = df['quality_score'].fillna(0)
+            weights = weights - weights.min() + 0.1  # Shift to make all positive
+        else:
+            weights = None
+        # Randomly sample one dataset weighted by quality
+        selected_row = df.sample(n=1, weights=weights).iloc[0]
+        query = selected_row['title']
+        progress(0.02, f"🎯 Random selection: {query[:60]}...")
+    # Clear the progress queue
+    while not progress_queue.empty():
+        try:
+            progress_queue.get_nowait()
+        except queue.Empty:
+            break
+    # Initialize outputs
+    pdf_file = None
+    images_output = [gr.Image(visible=False)] * 4
+    status = "🚀 Starting analysis..."
+    # Initial progress
+    progress(0.05, desc="🚀 Initializing...")
+    def progress_callback(progress_val, description):
+        """Callback function to update progress - puts updates in queue"""
+        try:
+            progress_queue.put((progress_val, description))
+        except:
+            pass
+    # Run analysis in a separate thread
+    result_queue = queue.Queue()
+    # Store the page URL to show immediately (kept for compatibility)
+    page_url_to_show = None
+    def page_url_callback(url):
+        nonlocal page_url_to_show
+        page_url_to_show = url
+    # Find and show the page URL immediately FIRST
+    initialize_models()
+    progress(0.06, desc="🔍 Finding relevant dataset...")
+    most_similar_idx, similarity_score, translated_query, original_lang = find_similar_dataset_bm25(query, df)
+    data_gouv_page = df.iloc[most_similar_idx]['url']
+    dataset_title = df.iloc[most_similar_idx]['title']
+    progress(0.07, desc=f"📋 Found dataset: {dataset_title[:50]}...")
+    # Now start the analysis thread with the found dataset info
+    def run_analysis():
+        try:
+            # Pass the already found dataset info to the analysis function
+            result = run_agent_analysis_with_progress(query, progress_callback, df, page_url_callback, data_gouv_page, most_similar_idx)
+            result_queue.put(result)
+        except Exception as e:
+            result_queue.put((f"Error: {str(e)}", [], data_gouv_page))
+    analysis_thread = threading.Thread(target=run_analysis)
+    analysis_thread.start()
+    # Show page URL immediately by returning current state
+    current_page_display = gr.Textbox(value=data_gouv_page, visible=True)
+    current_status = "🔗 Page found - starting analysis..."
+    # Initial update to show the page URL immediately
+    progress(0.08, desc="🔗 Page found - starting analysis...")
+    # Monitor progress while analysis runs
+    last_progress = 0.08
+    while analysis_thread.is_alive() or not result_queue.empty():
+        try:
+            # Check for progress updates from queue
+            try:
+                progress_val, description = progress_queue.get(timeout=0.1)
+                if progress_val > last_progress:
+                    last_progress = progress_val
+                    current_status = description
+                    progress(progress_val, desc=description)
+            except queue.Empty:
+                pass
+            # Check if analysis is complete
+            try:
+                final_status, files, page_url = result_queue.get(timeout=0.1)
+                # Check if this is a "no data" case
+                if "❌ No CSV/JSON files found" in final_status:
+                    progress(1.0, desc="❌ No processable data found")
+                    return (gr.Textbox(value=page_url if page_url else data_gouv_page, visible=True),
+                           final_status,
+                           gr.File(visible=False),
+                           gr.Image(visible=False), gr.Image(visible=False),
+                           gr.Image(visible=False), gr.Image(visible=False))
+                # Final progress update
+                progress(1.0, desc="✅ Processing results...")
+                # Process results
+                pdf_file = None
+                png_files = []
+                for file in files:
+                    if file.endswith('.pdf'):
+                        pdf_file = file
+                    elif file.endswith('.png'):
+                        png_files.append(file)
+                # Prepare final outputs
+                download_button = gr.File(value=pdf_file, visible=True) if pdf_file else None
+                # Prepare images for display (up to 4 images)
+                images = []
+                for i in range(4):
+                    if i < len(png_files):
+                        images.append(gr.Image(value=png_files[i], visible=True))
+                    else:
+                        images.append(gr.Image(visible=False))
+                # final progress completion
+                progress(1.0, desc="🎉 Complete!")
+                return gr.Textbox(value=page_url if page_url else data_gouv_page, visible=True), final_status, download_button, *images
+            except queue.Empty:
+                pass
+            time.sleep(0.5)  # Small delay to prevent excessive updates
+        except Exception as e:
+            progress(1.0, desc=f"❌ Error: {str(e)}")
+            return gr.Textbox(value=data_gouv_page, visible=True), f"❌ Error: {str(e)}", None, *images_output
+    # Ensure thread completes
+    analysis_thread.join(timeout=1)
+    # Fallback return
+    progress(1.0, desc="🏁 Finished")
+    return gr.Textbox(value=data_gouv_page, visible=True), current_status, pdf_file, *images_output
+# Create the Gradio interface
+with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.Soft(), css="""
+    .gradio-container {
+        max-width: 1200px !important;
+        margin: auto;
+        width: 100% !important;
+    }
+    .main-header {
+        text-align: center;
+        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+        color: white;
+        padding: 2rem;
+        border-radius: 15px;
+        margin-bottom: 2rem;
+        box-shadow: 0 8px 32px rgba(0,0,0,0.1);
+    }
+    .accordion-content {
+        overflow: hidden !important;
+        width: 100% !important;
+    }
+    .gr-accordion {
+        width: 100% !important;
+        max-width: 100% !important;
+    }
+    .gr-accordion .gr-row {
+        width: 100% !important;
+        max-width: 100% !important;
+        margin: 0 !important;
+    }
+    .gr-accordion .gr-column {
+        min-width: 0 !important;
+        flex: 1 !important;
+        max-width: 50% !important;
+        padding-right: 1rem !important;
+    }
+    .gr-accordion .gr-column:last-child {
+        padding-right: 0 !important;
+        padding-left: 1rem !important;
+    }
+""") as demo:
+    # Main header with better styling
+    gr.HTML("""
+    <div class="main-header">
+        <h1 style="margin: 0; font-size: 2.5rem; font-weight: bold;">
+            🤖 French Public Data Analysis Agent
+        </h1>
+        <p style="font-size: 1.2rem; opacity: 0.9;">
+            Intelligent analysis of French public datasets with AI-powered insights
+        </p>
+    </div>
+    """)
+    # What this agent does
+    gr.HTML("""
+    <div style="text-align: center; background: #f8fafc; padding: 1.5rem; border-radius: 10px; margin: 1rem 0;">
+        <p style="font-size: 1.1rem; color: #374151; margin: 0;">
+            🌐 <strong>Search in French or English</strong> • 🗺️ <strong>Generate Reports with visualizations from the data</strong>
+        </p>
+    </div>
+    """)
+    # Tips & Information accordion - moved to the top
+    with gr.Accordion("💡 Tips & Information", open=False):
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("""
+                🎯 **How to Use:**
+                - Enter any search term related to French public data
+                - Leave empty to randomly select a high-quality dataset
+                - Results include visualizations and downloadable reports
+                ⏱️ **Processing Time:**
+                - Report generation takes 5-10 minutes depending on dataset complexity
+                - Larger datasets may require additional processing time
+                """)
+            with gr.Column():
+                gr.Markdown("""
+                ⚠️ **Important Notes:**
+                - Still a work in progress, might be better to start with the example queries
+                - Some datasets may not contain processable CSV/JSON files
+                - All visualizations are automatically generated
+                - Maps focus on France when geographic data is available
+                🌐 **Language Support:**
+                - Search in French or English - queries are automatically translated
+                """)
+    with gr.Row():
+        query_input = gr.Textbox(
+            label="Search Query",
+            placeholder="e.g., road traffic accidents, education, housing (or leave empty for random selection)",
+            scale=4
+        )
+        search_button = gr.Button(
+            "🚀 Analyze Dataset",
+            variant="primary",
+            scale=1,
+            size="lg"
+        )
+    # Quick Start Examples row
+    with gr.Row():
+        gr.HTML("""
+        <div>
+            <h3 style="color: #374151">🚀 Quick Start Examples</h3>
+            <p style="color: #6b7280">Click any example below to get started</p>
+        </div>
+        """)
+    with gr.Row():
+        examples = [
+            ("🚗 Road Traffic Accidents 2005 - 2023", "road traffic accidents 2005 - 2023"),
+            ("🎓 Education Directory", "education directory"),
+            ("🏠 French Vacant Housing Private Park", "French vacant housing private park"),
+        ]
+        for emoji_text, query_text in examples:
+            gr.Button(
+                emoji_text,
+                variant="secondary",
+                size="sm"
+            ).click(
+                lambda x=query_text: x,
+                outputs=query_input
+            )
+    # Page info and analysis status with progress bar
+    with gr.Group():
+        page_url_display = gr.Textbox(label="🔗 Page Started On", interactive=False, visible=False)
+        with gr.Row():
+            status_output = gr.Textbox(label="📊 Analysis Status", interactive=False, scale=1)
+    # Download section
+    with gr.Row():
+        download_button = gr.File(
+            label="📄 Download PDF Report",
+            visible=False
+        )
+    gr.Markdown("---")
+    gr.HTML("""
+    <div style="text-align: center; margin: 2rem 0;">
+        <h2 style="color: #374151; margin-bottom: 0.5rem;">📊 Generated Visualizations</h2>
+        <p style="color: #6b7280; margin: 0;">Automatically generated charts and maps will appear below</p>
+    </div>
+    """)
+    with gr.Row():
+        with gr.Column():
+            image1 = gr.Image(label="📈 Chart 1", visible=False, height=400)
+            image2 = gr.Image(label="📊 Chart 2", visible=False, height=400)
+        with gr.Column():
+            image3 = gr.Image(label="🗺️ Map/Chart 3", visible=False, height=400)
+            image4 = gr.Image(label="📉 Chart 4", visible=False, height=400)
+    # Set up the search button click event with progress bar
+    search_button.click(
+        fn=search_and_analyze,
+        inputs=[query_input],
+        outputs=[page_url_display, status_output, download_button, image1, image2, image3, image4],
+        show_progress="full"  # Show the built-in progress bar
+    )
+if __name__ == "__main__":
+    demo.queue()  # Enable queuing for real-time updates
+    demo.launch(
+        share=True,
+        server_name="0.0.0.0",
+        server_port=7860,
+        show_error=True
+    )

filtered_dataset.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

france_data/departements.geojson ADDED Viewed

The diff for this file is too large to render. See raw diff

france_data/regions.geojson ADDED Viewed

The diff for this file is too large to render. See raw diff

launch_gradio.py ADDED Viewed

	@@ -0,0 +1,30 @@

+#!/usr/bin/env python3
+"""
+Launch script for the Data Analysis Agent Gradio Interface
+"""
+import sys
+import os
+# Add the current directory to Python path
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+try:
+    from app import demo
+    print("🚀 Starting Data Analysis Agent...")
+    print("📊 The interface will be available at: http://localhost:7860")
+    print("🌐 A shareable link will also be provided")
+    print("\n" + "="*50)
+    # Launch the interface
+    demo.launch()
+except ImportError as e:
+    print(f"❌ Import Error: {e}")
+    print("\n💡 Make sure you have installed all dependencies:")
+    print("   pip install -r requirements.txt")
+    sys.exit(1)
+except Exception as e:
+    print(f"❌ Error launching interface: {e}")
+    sys.exit(1)

reexport_data.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import pandas as pd
+# Read the CSV file
+df = pd.read_csv('dataset_metadata/full_dts_list.csv', sep=";")
+# Print all columns
+print("Columns in the dataset:")
+print(df.columns.tolist())
+print("\n" + "="*50 + "\n")
+# Print unique values for license column
+print("Unique values in 'license' column:")
+if 'license' in df.columns:
+    unique_licences = df['license'].unique()
+    for i, licence in enumerate(unique_licences, 1):
+        print(f"{i}. {licence}")
+    print(f"\nTotal unique license values: {len(unique_licences)}")
+    # Also show value counts for license column
+    print("\nLicense value counts:")
+    print(df['license'].value_counts())
+else:
+    print("Column 'license' not found in the dataset.")
+    print("Available columns are:", df.columns.tolist())
+print("\n" + "="*50 + "\n")
+# Select only the required columns
+required_columns = ["title", "url", "license", "quality_score"]
+# Check which columns exist
+existing_columns = [col for col in required_columns if col in df.columns]
+missing_columns = [col for col in required_columns if col not in df.columns]
+print(f"Found columns: {existing_columns}")
+if missing_columns:
+    print(f"Missing columns: {missing_columns}")
+# Select only existing columns
+df_filtered = df[existing_columns].copy()
+print(f"\nOriginal dataset shape: {df.shape}")
+print(f"After selecting columns: {df_filtered.shape}")
+# Filter out rows where license is NaN
+df_filtered = df_filtered.dropna(subset=['license'])
+print(f"After removing NaN license values: {df_filtered.shape}")
+# # Filter for datasets that include France in spatial granularity
+# if 'spatial.zones' in df_filtered.columns:
+#     # Check unique values in spatial.granularity before filtering
+#     print(f"\nUnique values in 'spatial.zones' column (first 10):")
+#     unique_spatial = df_filtered['spatial.zones'].dropna().unique()
+#     for i, value in enumerate(unique_spatial[:10], 1):
+#         print(f"{i}. {value}")
+#     if len(unique_spatial) > 10:
+#         print(f"... and {len(unique_spatial) - 10} more values")
+#
+#     # Filter for France (case-insensitive search)
+#     france_filter = df_filtered['spatial.zones'].str.contains('France', case=False, na=False)
+#     df_filtered = df_filtered[france_filter]
+#
+#     print(f"After filtering for France in spatial zones: {df_filtered.shape}")
+# else:
+#     print("Warning: 'spatial.zones' column not found, skipping France filter")
+# Filter by quality score (keep only > 0.6)
+if 'quality_score' in df_filtered.columns:
+    print(f"Before quality filtering: {df_filtered.shape}")
+    df_filtered = df_filtered[df_filtered['quality_score'] >= 0.8]
+    print(f"After filtering quality_score >= 0.8: {df_filtered.shape}")
+    # Sort by quality score (descending order - highest quality first)
+    df_filtered = df_filtered.sort_values('quality_score', ascending=False)
+    print(f"Dataset sorted by quality_score (highest first)")
+    # Show quality score distribution
+    if not df_filtered.empty:
+        print(f"Quality score range: {df_filtered['quality_score'].min():.2f} - {df_filtered['quality_score'].max():.2f}")
+else:
+    print("Warning: 'quality_score' column not found, skipping quality filtering and sorting")
+# Save to CSV
+# Drop license column before saving
+df_filtered = df_filtered.drop('license', axis=1)
+output_file = 'filtered_dataset.csv'
+df_filtered.to_csv(output_file, index=False)
+print(f"\nFiltered dataset saved to: {output_file}")
+print(f"Final dataset contains {len(df_filtered)} rows and {len(df_filtered.columns)} columns")
+print("\n" + "="*50)
+print("✅ Pre-processing complete!")
+print("Files created:")
+print(f"  - {output_file}: Filtered dataset")
+print("="*50)

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+pandas
+shapely
+geopandas
+numpy
+rtree
+pyproj
+matplotlib
+requests
+duckduckgo-search
+smolagents[toolkit]
+smolagents[litellm]
+dotenv
+beautifulsoup4
+reportlab>=3.6.0
+scikit-learn
+gradio
+pypdf2
+python-docx
+scipy
+openpyxl
+unidecode
+rank_bm25

tools/drawing_tools.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import geopandas as gpd
+import matplotlib.pyplot as plt
+import requests
+import os
+import pandas as pd
+from smolagents import tool
+from typing import Dict, Tuple, Optional
+from matplotlib.figure import Figure
+from matplotlib.axes import Axes
+from shapely.geometry.base import BaseGeometry
+def _download_geojson(url: str, file_name: str) -> str:
+    """Downloads a GeoJSON file if it doesn't exist.
+    Args:
+        url (str): The URL of the GeoJSON file.
+        file_name (str): The name of the file to save the data in.
+    Returns:
+        str: The path to the downloaded file.
+    """
+    data_dir = "france_data"
+    if not os.path.exists(data_dir):
+        os.makedirs(data_dir)
+    file_path = os.path.join(data_dir, file_name)
+    if not os.path.exists(file_path):
+        print(f"Downloading {file_name} from {url}...")
+        response = requests.get(url)
+        response.raise_for_status()  # Raise an exception for bad status codes
+        with open(file_path, 'w') as f:
+            f.write(response.text)
+        print("Download complete.")
+    return file_path
+def get_france_geodata(level: str = 'regions') -> gpd.GeoDataFrame:
+    """Gets a GeoDataFrame for Metropolitan France with its regions or departments.
+    Args:
+        level (str): The administrative level to draw ('regions' or 'departments').
+    Returns:
+        gpd.GeoDataFrame: A GeoDataFrame with the requested administrative level.
+    """
+    if level == 'regions':
+        url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/regions.geojson"
+        file_name = "regions.geojson"
+    elif level == 'departments':
+        url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements.geojson"
+        file_name = "departements.geojson"
+    else:
+        raise ValueError("level must be 'regions' or 'departments'")
+    geojson_path = _download_geojson(url, file_name)
+    gdf = gpd.read_file(geojson_path)
+    # Although the geojson files are for metropolitan France, we can filter to be safe.
+    if level == 'regions':
+        # Metropolitan region codes are between 11 and 94.
+        gdf['code'] = gdf['code'].astype(int)
+        france_metropolitan = gdf[gdf['code'].between(11, 94)]
+    else: # departments
+        # Metropolitan department codes are 01-19, 21-95, 2A, 2B. Corsica (20) is split.
+        metro_codes = [f'{i:02d}' for i in range(1, 20)] + [f'{i:02d}' for i in range(21, 96)] + ['2A', '2B']
+        france_metropolitan = gdf[gdf['code'].isin(metro_codes)]
+    france_metropolitan = france_metropolitan.to_crs(epsg=2154)
+    return france_metropolitan
+@tool
+def draw_france_map(level: str = 'regions') -> Tuple[Figure, Axes]:
+    """Draws a map of Metropolitan France with its regions or departments.
+    Args:
+        level (str): The administrative level to draw ('regions' or 'departments').
+    Returns:
+        Tuple[Figure, Axes]: A tuple containing the Matplotlib figure and axes objects.
+    """
+    france_metropolitan = get_france_geodata(level)
+    fig, ax = plt.subplots(1, 1, figsize=(15, 12))
+    # Plot with a single color
+    france_metropolitan.plot(ax=ax, color='lightgray', edgecolor='black')
+    minx, miny, maxx, maxy = france_metropolitan.total_bounds
+    padding = 0.1
+    ax.set_xlim(minx - padding * (maxx - minx), maxx + padding * (maxx - minx))
+    ax.set_ylim(miny - padding * (maxy - miny), maxy + padding * (maxy - miny))
+    ax.set_aspect('equal', adjustable='box')
+    ax.set_axis_off()
+    ax.set_title(f'Metropolitan France with {level.capitalize()}', fontsize=20)
+    return fig, ax
+@tool
+def get_geodata_mapping(level: str = 'regions') -> Dict[str, BaseGeometry]:
+    """Returns a mapping from region/department name to its polygon.
+    Args:
+        level (str): The administrative level to get the mapping for ('regions' or 'departments').
+    Returns:
+        Dict[str, BaseGeometry]: A dictionary mapping the name to the polygon.
+    """
+    france_metropolitan = get_france_geodata(level)
+    mapping = {row['nom']: row['geometry'] for _, row in france_metropolitan.iterrows()}
+    return mapping
+@tool
+def plot_geodata(geodata: gpd.GeoDataFrame, ax: Axes, color: str = None, edgecolor: str = 'black', alpha: float = 1.0, output_path: Optional[str] = None) -> Optional[str]:
+    """Plots geodata on a given map axes and optionally saves the map as an image file.
+    Args:
+        geodata (gpd.GeoDataFrame): The geodata to plot.
+        ax (Axes): The axes to plot on.
+        color (str, optional): The color for the geometries. Defaults to None.
+        edgecolor (str, optional): The color for the geometry edges. Defaults to 'black'.
+        alpha (float, optional): The alpha blending value, between 0 and 1. Defaults to 1.0.
+        output_path (Optional[str], optional): Path to save the map image file (e.g., 'map.png'). Defaults to None.
+    Returns:
+        Optional[str]: The path to the saved file if output_path is provided, otherwise None.
+    """
+    # Ensure the geodata is in the same CRS
+    geodata = geodata.to_crs(epsg=2154)
+    geodata.plot(ax=ax, color=color, edgecolor=edgecolor, alpha=alpha)
+    if output_path:
+        fig = ax.get_figure()
+        fig.savefig(output_path, bbox_inches='tight')
+    return output_path
+@tool
+def plot_departments_data(
+    data: pd.DataFrame,
+    dep_col: str = 'dep',
+    value_col: str = 'value',
+    map_title: str = 'French Departments Data',
+    output_path: Optional[str] = 'france_data.png'
+) -> Optional[str]:
+    """
+    Plots data for French departments on a map of France.
+    Args:
+        data (pd.DataFrame): DataFrame with department data. Must contain at least two columns:
+                             one for department codes and one for the values to plot.
+        dep_col (str): The name of the column in `data` that contains the department codes.
+        value_col (str): The name of the column in `data` that contains the values to plot.
+        map_title (str): The title of the map.
+        output_path (Optional[str]): Path to save the map image file. If None, the plot is not saved.
+                                     Defaults to 'france_data.png'.
+    Returns:
+        Optional[str]: The path to the saved file if output_path is provided, otherwise None.
+    """
+    # Get the geodata for departments
+    departments_gdf = get_france_geodata('departments')
+    # Ensure department codes in user data are strings for merging
+    data[dep_col] = data[dep_col].astype(str).str.zfill(2)
+    # Merge the geodata with the user's data
+    merged_gdf = departments_gdf.merge(data, left_on='code', right_on=dep_col)
+    # Create the plot
+    fig, ax = plt.subplots(1, 1, figsize=(15, 12))
+    ax.set_aspect('equal')
+    ax.set_axis_off()
+    # Plot the base map of all departments
+    departments_gdf.plot(ax=ax, color='lightgray', edgecolor='black')
+    # Plot the data on top
+    if not merged_gdf.empty:
+        merged_gdf.plot(column=value_col, ax=ax, legend=True, cmap='viridis')
+    ax.set_title(map_title, fontsize=20)
+    if output_path:
+        fig.savefig(output_path, bbox_inches='tight')
+        print(f"Map saved to {output_path}")
+        return output_path
+    return None
+if __name__ == '__main__':
+    # Create sample data for 5 departments as requested
+    # The user provided 5, 92, 63, 45, 32
+    sample_data = {
+        'dep': [5, 92, 63, 45, 32],
+        'value': [10, 50, 20, 30, 45] # Some arbitrary values
+    }
+    data_df = pd.DataFrame(sample_data)
+    print("Generating map with department data...")
+    plot_departments_data(data_df, output_path='france_departments_data.png')

tools/exploration_tools.py ADDED Viewed

	@@ -0,0 +1,84 @@

+import pandas as pd
+from smolagents import tool
+@tool
+def read_data(file_path: str) -> pd.DataFrame:
+    """
+    Reads a CSV, JSON, or Excel (.xlsx) file into a pandas DataFrame.
+    Args:
+        file_path: The path to the CSV, JSON, or Excel file.
+    Returns:
+        A pandas DataFrame with the loaded data, or an error message if the file cannot be read.
+    """
+    try:
+        if file_path.lower().endswith('.csv'):
+            df = pd.read_csv(file_path, delimiter=';')
+        elif file_path.lower().endswith('.json'):
+            df = pd.read_json(file_path)
+        elif file_path.lower().endswith('.xlsx'):
+            df = pd.read_excel(file_path, engine='openpyxl')
+        else:
+            return "Unsupported file format. Please use a CSV, JSON, or Excel (.xlsx) file."
+        return df
+    except Exception as e:
+        return f"Error reading the data file: {str(e)}"
+@tool
+def get_dataset_description(df: pd.DataFrame) -> str:
+    """
+    Provides a description of the dataset, including info, description, and head.
+    Args:
+        df: The pandas DataFrame to describe.
+    Returns:
+        A string containing the description of the DataFrame.
+    """
+    try:
+        info = df.info(verbose=False)
+        description = df.describe()
+        head = df.head()
+        return f"Info:\n{info}\n\nDescription:\n{description}\n\nHead:\n{head}"
+    except Exception as e:
+        return f"Error describing the DataFrame: {str(e)}"
+@tool
+def get_value_counts(df: pd.DataFrame, column_name: str) -> str:
+    """
+    Gets the value counts for a specified column in the DataFrame.
+    Args:
+        df: The pandas DataFrame.
+        column_name: The name of the column to get the value counts for.
+    Returns:
+        A string containing the value counts for the column, or an error message.
+    """
+    try:
+        value_counts = df[column_name].value_counts().to_string()
+        return f"Value counts for column '{column_name}':\n{value_counts}"
+    except KeyError:
+        return f"Error: Column '{column_name}' not found in the DataFrame."
+    except Exception as e:
+        return f"Error getting value counts: {str(e)}"
+@tool
+def get_correlation_matrix(df: pd.DataFrame) -> str:
+    """
+    Computes and returns the correlation matrix for the numerical columns in the DataFrame.
+    Args:
+        df: The pandas DataFrame.
+    Returns:
+        A string containing the correlation matrix, or an error message.
+    """
+    try:
+        # Select only numeric columns for correlation matrix
+        numeric_df = df.select_dtypes(include=['number'])
+        correlation_matrix = numeric_df.corr().to_string()
+        return f"Correlation Matrix:\n{correlation_matrix}"
+    except Exception as e:
+        return f"Error computing correlation matrix: {str(e)}"

tools/libreoffice_tools.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import os
+import subprocess
+import platform
+from smolagents import tool
+def get_libreoffice_path():
+    """
+    Get the correct LibreOffice path based on the operating system.
+    Returns:
+        str: Path to LibreOffice executable or None if not found
+    """
+    system = platform.system()
+    if system == "Darwin":  # macOS
+        # Common LibreOffice installation paths on macOS
+        possible_paths = [
+            "/Applications/LibreOffice.app/Contents/MacOS/soffice",
+            "/Applications/LibreOffice Developer Edition.app/Contents/MacOS/soffice",
+            "/opt/homebrew/bin/soffice",  # Homebrew installation
+            "/usr/local/bin/soffice"
+        ]
+        for path in possible_paths:
+            if os.path.exists(path):
+                return path
+    elif system == "Linux":
+        # Common LibreOffice paths on Linux
+        possible_paths = [
+            "/usr/bin/libreoffice",
+            "/usr/bin/soffice",
+            "/snap/bin/libreoffice",
+            "/usr/local/bin/libreoffice"
+        ]
+        for path in possible_paths:
+            if os.path.exists(path):
+                return path
+    elif system == "Windows":
+        # Common LibreOffice paths on Windows
+        possible_paths = [
+            r"C:\Program Files\LibreOffice\program\soffice.exe",
+            r"C:\Program Files (x86)\LibreOffice\program\soffice.exe"
+        ]
+        for path in possible_paths:
+            if os.path.exists(path):
+                return path
+    # Try to find it in PATH as fallback
+    try:
+        result = subprocess.run(['which', 'soffice'], capture_output=True, text=True)
+        if result.returncode == 0:
+            return result.stdout.strip()
+    except:
+        pass
+    try:
+        result = subprocess.run(['which', 'libreoffice'], capture_output=True, text=True)
+        if result.returncode == 0:
+            return result.stdout.strip()
+    except:
+        pass
+    return None
+@tool
+def convert_to_pdf_with_libreoffice(input_file: str, output_dir: str = None) -> str:
+    """
+    Convert a document to PDF using LibreOffice.
+    Args:
+        input_file: Path to the input document
+        output_dir: Directory to save the PDF (optional, defaults to same directory as input)
+    Returns:
+        str: Path to the generated PDF file or error message
+    """
+    libreoffice_path = get_libreoffice_path()
+    if not libreoffice_path:
+        return "LibreOffice not found. Please install LibreOffice from https://www.libreoffice.org/"
+    if not os.path.exists(input_file):
+        return f"Input file not found: {input_file}"
+    if output_dir is None:
+        output_dir = os.path.dirname(input_file)
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir, exist_ok=True)
+    try:
+        # Use LibreOffice headless mode to convert to PDF
+        cmd = [
+            libreoffice_path,
+            '--headless',
+            '--convert-to', 'pdf',
+            '--outdir', output_dir,
+            input_file
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+        if result.returncode == 0:
+            # Generate expected output filename
+            base_name = os.path.splitext(os.path.basename(input_file))[0]
+            pdf_path = os.path.join(output_dir, f"{base_name}.pdf")
+            if os.path.exists(pdf_path):
+                return pdf_path
+            else:
+                return f"PDF conversion completed but file not found at expected location: {pdf_path}"
+        else:
+            return f"LibreOffice conversion failed: {result.stderr}"
+    except subprocess.TimeoutExpired:
+        return "LibreOffice conversion timed out after 60 seconds"
+    except Exception as e:
+        return f"Error during LibreOffice conversion: {str(e)}"
+@tool
+def check_libreoffice_availability() -> str:
+    """
+    Check if LibreOffice is available and return its path and version.
+    Returns:
+        str: Information about LibreOffice availability
+    """
+    libreoffice_path = get_libreoffice_path()
+    if not libreoffice_path:
+        system = platform.system()
+        install_info = {
+            "Darwin": "Install with: brew install libreoffice OR download from https://www.libreoffice.org/",
+            "Linux": "Install with: sudo apt install libreoffice OR sudo yum install libreoffice",
+            "Windows": "Download from https://www.libreoffice.org/"
+        }
+        return f"LibreOffice not found on {system}. {install_info.get(system, 'Install from https://www.libreoffice.org/')}"
+    try:
+        # Get version info
+        result = subprocess.run([libreoffice_path, '--version'], capture_output=True, text=True, timeout=10)
+        version_info = result.stdout.strip() if result.returncode == 0 else "Version unknown"
+        return f"LibreOffice found at: {libreoffice_path}\nVersion: {version_info}"
+    except:
+        return f"LibreOffice found at: {libreoffice_path}\nVersion: Unable to determine"
+if __name__ == "__main__":
+    # Test the LibreOffice detection
+    print(check_libreoffice_availability())

tools/webpage_tools.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import requests
+from smolagents import tool
+from requests.exceptions import RequestException
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
+from dotenv import load_dotenv
+import pandas as pd
+import json
+from io import StringIO, BytesIO
+load_dotenv()
+@tool
+def visit_webpage(url: str) -> str:
+    """Visits a webpage at the given URL and returns its full DOM content.
+    Args:
+        url: The URL of the webpage to visit.
+    Returns:
+        The DOM of the webpage as a string, or an error message if the request fails.
+    """
+    try:
+        # Send a GET request to the URL
+        response = requests.get(url)
+        response.raise_for_status()  # Raise an exception for bad status codes
+        return response.text
+    except RequestException as e:
+        return f"Error fetching the webpage: {str(e)}"
+    except Exception as e:
+        return f"An unexpected error occurred: {str(e)}"
+@tool
+def get_all_links(html_content: str, base_url: str) -> list[str]:
+    """
+    Finds all links to CSV, JSON, and Excel (.xlsx) files in the given HTML content.
+    Args:
+        html_content: The HTML content of a webpage.
+        base_url: The base URL of the webpage to resolve relative links.
+    Returns:
+        A list of all unique absolute URLs to CSV, JSON, or Excel files found on the page.
+    """
+    soup = BeautifulSoup(html_content, 'html.parser')
+    links = set()
+    for a_tag in soup.find_all('a', href=True):
+        href = a_tag['href']
+        # Join the base URL with the found href to handle relative links
+        absolute_url = urljoin(base_url, href)
+        if absolute_url.lower().endswith(('.csv', '.json', '.xlsx')):
+            links.add(absolute_url)
+    return list(links)
+@tool
+def read_csv_file(file_path: str) -> str:
+    """
+    Reads a CSV file and returns its content as a string.
+    Args:
+        file_path: The path to the CSV file.
+    Returns:
+        The content of the CSV file as a string, or an error message if the file cannot be read.
+    """
+    try:
+        df = pd.read_csv(file_path, delimiter=';')
+        return df.to_string()
+    except Exception as e:
+        return f"Error reading the CSV file: {str(e)}"
+@tool
+def read_file_from_url(url: str) -> pd.DataFrame:
+    """
+    Reads a CSV, JSON, or Excel (.xlsx) file from a static URL and loads it into a pandas DataFrame.
+    Args:
+        url: The URL of the CSV, JSON, or Excel file to read.
+    Returns:
+        A pandas DataFrame containing the data from the file, or raises an exception if the file cannot be read.
+    """
+    try:
+        # Send a GET request to the URL
+        response = requests.get(url)
+        response.raise_for_status()  # Raise an exception for bad status codes
+        # Handle encoding properly
+        if response.encoding is None or response.encoding.lower() in ['iso-8859-1', 'ascii']:
+            response.encoding = 'utf-8'
+        # Determine file type based on URL extension
+        if url.lower().endswith('.csv'):
+            # Use BytesIO to handle encoding properly
+            content_bytes = response.content
+            # Try different delimiters for CSV files
+            try:
+                # First try comma separator
+                df = pd.read_csv(BytesIO(content_bytes), encoding='utf-8')
+            except Exception:
+                try:
+                    # Then try semicolon separator
+                    df = pd.read_csv(BytesIO(content_bytes), delimiter=';', encoding='utf-8')
+                except Exception:
+                    try:
+                        # Finally try tab separator
+                        df = pd.read_csv(BytesIO(content_bytes), delimiter='\t', encoding='utf-8')
+                    except Exception:
+                        # Last resort: try latin-1 encoding
+                        df = pd.read_csv(BytesIO(content_bytes), delimiter=';', encoding='latin-1')
+        elif url.lower().endswith('.json'):
+            # Parse JSON and convert to DataFrame - use proper encoding
+            json_data = json.loads(response.text)
+            # Handle different JSON structures
+            if isinstance(json_data, list):
+                df = pd.DataFrame(json_data)
+            elif isinstance(json_data, dict):
+                # If it's a dict, try to find the main data array
+                if len(json_data.keys()) == 1:
+                    # If there's only one key, use its value
+                    key = list(json_data.keys())[0]
+                    if isinstance(json_data[key], list):
+                        df = pd.DataFrame(json_data[key])
+                    else:
+                        df = pd.DataFrame([json_data])
+                else:
+                    # Multiple keys, treat the whole dict as a single row
+                    df = pd.DataFrame([json_data])
+            else:
+                raise ValueError("Unsupported JSON structure")
+        elif url.lower().endswith('.xlsx'):
+            # Handle Excel files
+            content_bytes = response.content
+            df = pd.read_excel(BytesIO(content_bytes), engine='openpyxl')
+        else:
+            raise ValueError("Unsupported file type. Only CSV, JSON, and Excel (.xlsx) files are supported.")
+        return df
+    except RequestException as e:
+        raise Exception(f"Error fetching the file from URL: {str(e)}")
+    except json.JSONDecodeError as e:
+        raise Exception(f"Error parsing JSON file: {str(e)}")
+    except pd.errors.EmptyDataError:
+        raise Exception("The file is empty or contains no data")
+    except Exception as e:
+        raise Exception(f"An unexpected error occurred: {str(e)}")
+if __name__ == "__main__":
+    url = "https://www.data.gouv.fr/fr/datasets/repertoire-national-des-elus-1/"
+    url = "https://www.data.gouv.fr/fr/datasets/catalogue-des-donnees-de-data-gouv-fr/"
+    dom_content = visit_webpage(url)
+    if not dom_content.startswith("Error"):
+        all_links = get_all_links(dom_content, url)
+        for link in all_links:
+            print(link)
+    link = "https://static.data.gouv.fr/resources/repertoire-national-des-elus-1/20250312-164351/elus-conseillers-darrondissements-ca.csv"
+    link = "https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20250608-054904/export-dataset-20250608-054904.csv"
+    content = read_file_from_url(link)
+    print(content.head())