Spaces:

axel-darmouni
/

datagouv-french-data-analyst

Sleeping

App Files Files Community

axel-darmouni commited on Jun 9

Commit

2dd2794

1 Parent(s): b765960

update: docx use

Browse files

Files changed (5) hide show

README.md +9 -10
agent.py +4 -14
app.py +9 -9
requirements.txt +0 -1
tools/followup_tools.py +18 -4

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ tag: agent-demo-track
 # 🤖 French Public Data Analysis Agent
-**AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive PDF reports, and **interactive follow-up analysis capabilities**.
 ## ✨ Features
@@ -48,10 +48,10 @@ tag: agent-demo-track
 - **Follow-up Visualizations**: Generate additional charts based on user questions
 ### 📄 **Comprehensive Reports**
-- **Professional PDF Reports**: Complete analysis with embedded visualizations
 - **Bilingual Support**: Reports generated in the same language as your query
 - **Structured Analysis**: Title page, methodology, findings, and next steps
-- **LibreOffice Integration**: Cross-platform PDF generation
 - **Report Continuity**: Follow-up analysis references previous report context
 ### 🎨 **Modern Web Interface**
@@ -67,7 +67,6 @@ tag: agent-demo-track
 ### 1. Prerequisites
 - Python 3.8+
-- LibreOffice (for PDF generation)
 - Google Gemini API key
 ### 2. Installation
@@ -140,7 +139,7 @@ After the initial analysis is complete:
 ### Results
-- **Download PDF Report**: Complete analysis with all visualizations
 - **View Individual Charts**: Up to 4 visualizations displayed in the interface
 - **Dataset Reference**: Direct link to the original data.gouv.fr page
 - **Follow-up Visualizations**: Additional charts generated from follow-up questions
@@ -159,7 +158,7 @@ After the initial analysis is complete:
 │   ├── webpage_tools.py      # Web scraping and data extraction
 │   ├── exploration_tools.py  # Dataset analysis and description
 │   ├── drawing_tools.py      # France map generation and visualization
-│   ├── libreoffice_tools.py  # PDF conversion utilities
 │   ├── followup_tools.py     # Follow-up analysis tools
 │   └── retrieval_tools.py    # Dataset search and retrieval
 ├── filtered_dataset.csv      # Pre-processed dataset index (5,000+ datasets)
@@ -176,7 +175,7 @@ After the initial analysis is complete:
 - **Search**: BM25 keyword matching with TF-IDF preprocessing
 - **Translation**: LLM-powered bilingual query translation
 - **Visualization**: Matplotlib, Geopandas, Seaborn
-- **PDF Generation**: python-docx + LibreOffice conversion
 - **Data Processing**: Pandas, NumPy, Shapely, Scipy
 - **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering ⭐
@@ -222,8 +221,8 @@ After the initial analysis is complete:
    - Try a different query or use the random selection
    - Agent will automatically search for alternative datasets
-2. **LibreOffice PDF conversion fails**
-   - Ensure LibreOffice is installed and accessible
    - Check the console for specific error messages
 3. **Translation errors**
@@ -301,7 +300,7 @@ pandas, shapely, geopandas, numpy, rtree, pyproj
 matplotlib, requests, duckduckgo-search
 smolagents[toolkit], smolagents[litellm]
 dotenv, beautifulsoup4, reportlab>=3.6.0
-scikit-learn, gradio, pypdf2, python-docx
 scipy, openpyxl, unidecode, rank_bm25
 ```

 # 🤖 French Public Data Analysis Agent
+**AI-powered intelligent analysis of French public datasets** with automated visualization generation, comprehensive DOCX reports, and **interactive follow-up analysis capabilities**.
 ## ✨ Features
 - **Follow-up Visualizations**: Generate additional charts based on user questions
 ### 📄 **Comprehensive Reports**
+- **Professional DOCX Reports**: Complete analysis with embedded visualizations
 - **Bilingual Support**: Reports generated in the same language as your query
 - **Structured Analysis**: Title page, methodology, findings, and next steps
+- **Direct DOCX Generation**: No external dependencies required
 - **Report Continuity**: Follow-up analysis references previous report context
 ### 🎨 **Modern Web Interface**
 ### 1. Prerequisites
 - Python 3.8+
 - Google Gemini API key
 ### 2. Installation
 ### Results
+- **Download DOCX Report**: Complete analysis with all visualizations
 - **View Individual Charts**: Up to 4 visualizations displayed in the interface
 - **Dataset Reference**: Direct link to the original data.gouv.fr page
 - **Follow-up Visualizations**: Additional charts generated from follow-up questions
 │   ├── webpage_tools.py      # Web scraping and data extraction
 │   ├── exploration_tools.py  # Dataset analysis and description
 │   ├── drawing_tools.py      # France map generation and visualization
+│   ├── libreoffice_tools.py  # Document utilities (legacy)
 │   ├── followup_tools.py     # Follow-up analysis tools
 │   └── retrieval_tools.py    # Dataset search and retrieval
 ├── filtered_dataset.csv      # Pre-processed dataset index (5,000+ datasets)
 - **Search**: BM25 keyword matching with TF-IDF preprocessing
 - **Translation**: LLM-powered bilingual query translation
 - **Visualization**: Matplotlib, Geopandas, Seaborn
+- **Report Generation**: python-docx for DOCX documents
 - **Data Processing**: Pandas, NumPy, Shapely, Scipy
 - **Follow-up Analytics**: Statistical analysis, correlation studies, custom filtering ⭐
    - Try a different query or use the random selection
    - Agent will automatically search for alternative datasets
+2. **DOCX report generation fails**
+   - Ensure python-docx is installed correctly
    - Check the console for specific error messages
 3. **Translation errors**
 matplotlib, requests, duckduckgo-search
 smolagents[toolkit], smolagents[litellm]
 dotenv, beautifulsoup4, reportlab>=3.6.0
+scikit-learn, gradio, python-docx
 scipy, openpyxl, unidecode, rank_bm25
 ```

agent.py CHANGED Viewed

@@ -11,11 +11,6 @@ from tools.exploration_tools import (
 from tools.drawing_tools import (
     plot_departments_data,
 )
-from tools.libreoffice_tools import (
-    convert_to_pdf_with_libreoffice,
-    check_libreoffice_availability,
-    get_libreoffice_info,
-)
 from tools.retrieval_tools import (
     search_datasets,
     get_dataset_info,
@@ -39,8 +34,6 @@ def create_web_agent(step_callback):
                 visit_webpage, get_all_links, read_file_from_url, save_dataset_for_followup,
                 get_dataset_description,
                 plot_departments_data,
-                convert_to_pdf_with_libreoffice,
-                check_libreoffice_availability, get_libreoffice_info,
                 search_datasets, get_dataset_info, get_random_quality_dataset
             ],
             model=model,
@@ -49,7 +42,7 @@ def create_web_agent(step_callback):
             planning_interval=3,
             step_callbacks=[step_callback],  # Use the built-in callback system
             additional_authorized_imports=[
-                "subprocess", "docx", "docx.*",
                 "os", "bs4", "io", "requests", "json", "pandas",
                 "matplotlib", "matplotlib.pyplot", "matplotlib.*", "numpy",  "seaborn"
             ],
@@ -80,21 +73,18 @@ def generate_prompt(user_query=None, initial_search_results=None):
     4. **Report Generation**:
        - Write insightful analysis text for each visualization
-       - Generate a comprehensive PDF report using python-docx library that includes:
          * Title page with dataset name and analysis overview
          * All visualizations (PNG files) embedded in the report
          * Analysis text for each visualization
          * Conclusions and next steps
-       - Convert the docx file to PDF using convert_to_pdf_with_libreoffice tool
     **Important Technical Notes:**
     - Save everything in the generated_data folder
     - Do NOT use the 'os' module
     - Work step by step, don't generate too much code at once
-    - Before PDF conversion, call check_libreoffice_availability() - it returns True/False
-    - If check_libreoffice_availability() returns True, use convert_to_pdf_with_libreoffice() tool
-    - If check_libreoffice_availability() returns False, skip PDF conversion and inform user
-    - Do NOT use subprocess calls directly for LibreOffice
     - If question is in English, report is in English. If in French, report is in French.
     """

 from tools.drawing_tools import (
     plot_departments_data,
 )
 from tools.retrieval_tools import (
     search_datasets,
     get_dataset_info,
                 visit_webpage, get_all_links, read_file_from_url, save_dataset_for_followup,
                 get_dataset_description,
                 plot_departments_data,
                 search_datasets, get_dataset_info, get_random_quality_dataset
             ],
             model=model,
             planning_interval=3,
             step_callbacks=[step_callback],  # Use the built-in callback system
             additional_authorized_imports=[
+                "docx", "docx.*",
                 "os", "bs4", "io", "requests", "json", "pandas",
                 "matplotlib", "matplotlib.pyplot", "matplotlib.*", "numpy",  "seaborn"
             ],
     4. **Report Generation**:
        - Write insightful analysis text for each visualization
+       - Generate a comprehensive DOCX report using python-docx library that includes:
          * Title page with dataset name and analysis overview
          * All visualizations (PNG files) embedded in the report
          * Analysis text for each visualization
          * Conclusions and next steps
+       - Save the final DOCX report in the generated_data folder
     **Important Technical Notes:**
     - Save everything in the generated_data folder
     - Do NOT use the 'os' module
     - Work step by step, don't generate too much code at once
+    - Generate a complete DOCX report that can be downloaded by the user
     - If question is in English, report is in English. If in French, report is in French.
     """

app.py CHANGED Viewed

@@ -206,8 +206,8 @@ def create_progress_callback():
             description = f"📈 Step {step_number}: Generating visualizations..."
         elif "save" in action_lower or "png" in action_lower:
             description = f"💾 Step {step_number}: Saving visualizations..."
-        elif "pdf" in action_lower or "report" in action_lower:
-            description = f"📄 Step {step_number}: Creating PDF report..."
         elif hasattr(memory_step, 'error') and memory_step.error:
             description = f"⚠️ Step {step_number}: Handling error..."
         else:
@@ -313,7 +313,7 @@ def search_and_analyze(query, progress=gr.Progress()):
             break
     # Initialize outputs
-    pdf_file = None
     images_output = [gr.Image(visible=False)] * 4
     status = "🚀 Starting agent-driven analysis..."
@@ -440,17 +440,17 @@ def search_and_analyze(query, progress=gr.Progress()):
                 progress(1.0, desc="✅ Processing results...")
                 # Process results
-                pdf_file = None
                 png_files = []
                 for file in files:
-                    if file.endswith('.pdf'):
-                        pdf_file = file
                     elif file.endswith('.png'):
                         png_files.append(file)
                 # Prepare final outputs
-                download_button = gr.File(value=pdf_file, visible=True) if pdf_file else None
                 # Prepare images for display (up to 4 images)
                 images = []
@@ -495,7 +495,7 @@ def search_and_analyze(query, progress=gr.Progress()):
     # Fallback return
     progress(1.0, desc="🏁 Finished")
-    return (gr.Textbox(value="Completed", visible=True), current_status, pdf_file, *images_output,
            gr.Markdown(visible=False),  # keep follow-up hidden
            gr.HTML(visible=False),
            gr.Row(visible=False),
@@ -704,7 +704,7 @@ with gr.Blocks(title="🤖 French Public Data Analysis Agent", theme=gr.themes.S
     # Download section
     with gr.Row():
         download_button = gr.File(
-            label="📄 Download PDF Report",
             visible=False
         )

             description = f"📈 Step {step_number}: Generating visualizations..."
         elif "save" in action_lower or "png" in action_lower:
             description = f"💾 Step {step_number}: Saving visualizations..."
+        elif "docx" in action_lower or "report" in action_lower:
+            description = f"📄 Step {step_number}: Creating DOCX report..."
         elif hasattr(memory_step, 'error') and memory_step.error:
             description = f"⚠️ Step {step_number}: Handling error..."
         else:
             break
     # Initialize outputs
+    docx_file = None
     images_output = [gr.Image(visible=False)] * 4
     status = "🚀 Starting agent-driven analysis..."
                 progress(1.0, desc="✅ Processing results...")
                 # Process results
+                docx_file = None
                 png_files = []
                 for file in files:
+                    if file.endswith('.docx'):
+                        docx_file = file
                     elif file.endswith('.png'):
                         png_files.append(file)
                 # Prepare final outputs
+                download_button = gr.File(value=docx_file, visible=True) if docx_file else None
                 # Prepare images for display (up to 4 images)
                 images = []
     # Fallback return
     progress(1.0, desc="🏁 Finished")
+    return (gr.Textbox(value="Completed", visible=True), current_status, docx_file, *images_output,
            gr.Markdown(visible=False),  # keep follow-up hidden
            gr.HTML(visible=False),
            gr.Row(visible=False),
     # Download section
     with gr.Row():
         download_button = gr.File(
+            label="📄 Download DOCX Report",
             visible=False
         )

requirements.txt CHANGED Viewed

@@ -14,7 +14,6 @@ beautifulsoup4
 reportlab>=3.6.0
 scikit-learn
 gradio
-pypdf2
 python-docx
 scipy
 openpyxl

 reportlab>=3.6.0
 scikit-learn
 gradio
 python-docx
 scipy
 openpyxl

tools/followup_tools.py CHANGED Viewed

@@ -172,8 +172,8 @@ def get_previous_report_content() -> str:
         The text content of the previous report for context
     """
     try:
-        # Look for PDF or DOCX files in generated_data
-        report_files = glob.glob('generated_data/*.pdf') + glob.glob('generated_data/*.docx')
         if not report_files:
             return "No previous report found in generated_data folder"
@@ -181,8 +181,19 @@ def get_previous_report_content() -> str:
         # Use the most recent report file
         latest_report = max(report_files, key=os.path.getctime)
-        # For now, return basic info about the report
-        # In a full implementation, you'd extract text from PDF/DOCX
         file_size = os.path.getsize(latest_report)
         # Also look for any text files that might contain analysis
@@ -199,6 +210,9 @@ Report file: {latest_report}
 File size: {file_size} bytes
 Created: {os.path.getctime(latest_report)}
 Additional analysis content:
 {text_content if text_content else 'No additional text content found'}

         The text content of the previous report for context
     """
     try:
+        # Look for DOCX files in generated_data
+        report_files = glob.glob('generated_data/*.docx')
         if not report_files:
             return "No previous report found in generated_data folder"
         # Use the most recent report file
         latest_report = max(report_files, key=os.path.getctime)
+        # Try to extract basic text from DOCX file
+        docx_content = ""
+        try:
+            from docx import Document
+            doc = Document(latest_report)
+            paragraphs = []
+            for para in doc.paragraphs:
+                if para.text.strip():
+                    paragraphs.append(para.text.strip())
+            docx_content = "\n".join(paragraphs[:10])  # First 10 paragraphs for context
+        except Exception as e:
+            docx_content = f"Could not extract text from DOCX: {str(e)}"
         file_size = os.path.getsize(latest_report)
         # Also look for any text files that might contain analysis
 File size: {file_size} bytes
 Created: {os.path.getctime(latest_report)}
+DOCX Report Content (first 10 paragraphs):
+{docx_content}
 Additional analysis content:
 {text_content if text_content else 'No additional text content found'}