Spaces:

ErvinYubo
/

TopEdu

Sleeping

App Files Files Community

Ervinoreo commited on Aug 14

Commit

846f122

1 Parent(s): ecf227f

gradio

Browse files

Files changed (21) hide show

.gitignore +3 -0
README.md +0 -274
README_GRADIO.md +204 -0
app.py +0 -123
app_gradio.py +137 -0
app_gradio_modular.py +137 -0
installed_packages.txt +0 -178
my_pages/about.py +0 -37
my_pages/manage_documents.py +0 -73
my_pages/search_uni.py +0 -104
my_pages/upload_documents.py +0 -202
requirements.txt +15 -3
runtime.txt +0 -1
start.sh +0 -43
tabs/help.py +168 -0
tabs/initialize.py +55 -0
tabs/manage.py +237 -0
tabs/query.py +139 -0
tabs/upload.py +99 -0
utils/rag_system.py +110 -83
utils/translations.py +68 -0

.gitignore CHANGED Viewed

@@ -22,6 +22,8 @@ share/python-wheels/
 .installed.cfg
 *.egg
 MANIFEST
 # Virtual Environment
 .venv/
@@ -31,6 +33,7 @@ ENV/
 env/
 .venv
 myenv/
 # Environment Variables
 .env

 .installed.cfg
 *.egg
 MANIFEST
+tabs/__pycache__/
+.gradio
 # Virtual Environment
 .venv/
 env/
 .venv
 myenv/
+gradio/
 # Environment Variables
 .env

README.md DELETED Viewed

@@ -1,274 +0,0 @@
-# PanSea University Search
-An AI-powered RAG (Retrieval-Augmented Generation) system for searching ASEAN university admission requirements, designed to help prospective students find accurate and up-to-date information about study opportunities across Southeast Asia.
-## 🎯 Problem & Solution
-**Problem:** Prospective students worldwide seeking to study abroad face difficulty finding accurate, up-to-date university admission requirements. Information is scattered across PDFs, brochures, and outdated agency websites. Many waste time applying to unsuitable programs due to missing criteria and pay high agent fees.
-**Solution:** An LLM-powered, RAG-based platform powered by **SEA-LION multilingual models** that ingests official admissions documents from ASEAN universities. Students can query in any ASEAN language and receive ranked program matches with fees, entry requirements, deadlines, application windows, and source citations.
-## 🌟 Features
-- 📄 **PDF Document Ingestion**: Upload official university admission documents
-- 🔍 **Intelligent Search**: Natural language queries in multiple ASEAN languages
-- 🎯 **Accurate Responses**: AI-powered answers with source citations
-- 🔗 **Shareable Results**: Generate links to share query results
-- 🌏 **Multi-language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
-- 💰 **Advanced Filtering**: Budget range, study level, country preferences
-## 🚀 Quick Start
-### Prerequisites
-- Python 3.11+
-- SEA-LION API Key
-- OpenAI API Key (optional, for fallback embeddings)
-### Installation
-1. **Clone and navigate to the project:**
-   ```bash
-   cd pansea
-   ```
-2. **Activate virtual environment:**
-   ```bash
-   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
-   ```
-3. **Install dependencies:**
-   ```bash
-   pip install -r requirements.txt
-   ```
-4. **Set up environment variables:**
-   ```bash
-   cp .env.example .env
-   # Edit .env and add your SEA-LION API key (OpenAI key optional for fallback)
-   ```
-5. **Run the application:**
-   ```bash
-   streamlit run app.py
-   ```
-6. **Open your browser to:** `http://localhost:8501`
-### Usage
-#### 1. Upload Documents
-- Go to the "Upload Documents" page
-- Enter university name and country
-- Select document type (admission requirements, tuition fees, etc.)
-- Upload PDF files containing university information
-- Click "Process Documents"
-#### 2. Search Universities
-- Go to the "Search Universities" page
-- Choose your response language
-- Enter questions like:
-  - "Show me universities in Malaysia for master's degrees with tuition under 40,000 RMB per year"
-  - "专科毕业，无雅思，想在马来西亚读硕士，学费不超过 4 万人民币/年"
-  - "What are the English proficiency requirements for Singapore universities?"
-- Apply optional filters (budget, study level, countries)
-- Get AI-powered responses with source citations
-#### 3. Share Results
-- Each query generates a unique shareable link
-- Share results with friends, family, or education consultants
-- Access shared results without needing to upload documents again
-## 📁 Project Structure
-```
-pansea/
-├── app.py                 # Main Streamlit application
-├── rag_system.py          # RAG system implementation
-├── requirements.txt       # Python dependencies
-├── .env                   # Environment variables
-├── .venv/                 # Virtual environment
-├── chroma_db/            # Vector database storage
-├── documents/            # Uploaded documents storage
-├── query_results/        # Shared query results
-└── README.md            # This file
-```
-## 🛠️ Core Components
-### DocumentIngestion Class
-- Handles PDF text extraction using PyPDF2
-- Creates document chunks with metadata
-- Builds and persists ChromaDB vector store
-- Manages document preprocessing and storage
-### RAGSystem Class
-- Implements retrieval-augmented generation
-- Uses BGE-small-en-v1.5 embeddings for semantic search (with OpenAI fallback)
-- Leverages SEA-LION models for response generation:
-  - **SEA-LION v3.5 Reasoning Model** for complex university queries
-  - **SEA-LION v3 Instruct Model** for translation and simple questions
-- Provides multilingual query support with automatic model selection
-### Streamlit UI
-- Clean, intuitive interface
-- Multi-page navigation
-- File upload with progress tracking
-- Advanced search filters
-- Shareable query results
-## 🌏 Supported Languages
-The system supports queries and responses in:
-- **English** - Primary language
-- **中文 (Chinese)** - For Chinese-speaking students
-- **Bahasa Malaysia** - For Malaysian context
-- **ไทย (Thai)** - For Thai students
-- **Bahasa Indonesia** - For Indonesian students
-- **Tiếng Việt (Vietnamese)** - For Vietnamese students
-- **Filipino** - For Philippines context
-## 🎯 Target ASEAN Countries
-- 🇸🇬 Singapore
-- 🇲🇾 Malaysia
-- 🇹🇭 Thailand
-- 🇮🇩 Indonesia
-- 🇵🇭 Philippines
-- 🇻🇳 Vietnam
-- 🇧🇳 Brunei
-- 🇰🇭 Cambodia
-- 🇱🇦 Laos
-- 🇲🇲 Myanmar
-## 🔧 Configuration
-### Environment Variables (.env)
-```bash
-# SEA-LION API Configuration
-SEA_LION_API_KEY=your_sea_lion_api_key_here
-SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
-# OpenAI API Configuration (for embeddings)
-OPENAI_API_KEY=your_openai_api_key_here
-# Application Configuration
-APP_NAME=Top.Edu University Search
-APP_VERSION=1.0.0
-CHROMA_PERSIST_DIRECTORY=./chroma_db
-UPLOAD_FOLDER=./documents
-MAX_FILE_SIZE_MB=50
-```
-### Customization Options
-- **Chunk Size**: Adjust text splitting in `rag_system.py`
-- **Retrieval Count**: Modify number of retrieved documents (default: 5)
-- **Model Selection**: Configure SEA-LION model selection logic
-- **UI Themes**: Modify CSS in `app.py`
-- **Query Classification**: Adjust complex vs simple query detection
-## 📊 Example Queries
-Try these sample queries to test the system and see different model usage:
-### Complex Queries (Uses SEA-LION Reasoning Model)
-1. **Multi-criteria Search**: "Show me universities in Thailand and Malaysia for engineering master's programs with tuition under $15,000 per year"
-2. **Chinese Query**: "专科毕业，无雅思，想在马来西亚读硕士，学费不超过 4 万人民币/年"
-3. **Comparative Analysis**: "Compare MBA programs in Singapore and Indonesia with GMAT requirements and scholarship opportunities"
-### Simple Queries (Uses SEA-LION Instruct Model)
-4. **Translation**: "How do you say 'application deadline' in Thai and Indonesian?"
-5. **Definition**: "What is the difference between IELTS and TOEFL?"
-6. **Basic Information**: "What does GPA stand for and how is it calculated?"
-## 🔍 Technical Stack
-- **Backend**: Python 3.11, LangChain
-- **LLM Models**:
-  - SEA-LION v3.5 8B Reasoning (complex queries)
-  - SEA-LION v3 9B Instruct (simple queries & translation)
-- **Embeddings**: BGE-small-en-v1.5 (with OpenAI ada-002 fallback)
-- **Vector Database**: ChromaDB with persistence
-- **Frontend**: Streamlit with custom CSS
-- **Document Processing**: PyPDF2, PyCryptodome (for encrypted PDFs), RecursiveCharacterTextSplitter
-## 📈 Roadmap
-- [ ] Support for additional document formats (Word, Excel)
-- [x] Integration with SEA-LION multilingual models
-- [ ] Real-time web scraping of university websites
-- [ ] Mobile-responsive design
-- [ ] User authentication and query history
-- [ ] Advanced analytics and insights
-- [ ] Integration with university application systems
-- [ ] Fine-tuning SEA-LION models on university-specific data
-## 🤝 Contributing
-1. Fork the repository
-2. Create a feature branch (`git checkout -b feature/amazing-feature`)
-3. Commit your changes (`git commit -m 'Add amazing feature'`)
-4. Push to the branch (`git push origin feature/amazing-feature`)
-5. Open a Pull Request
-## 📄 License
-This project is licensed under the MIT License - see the LICENSE file for details.
-## 💡 Tips for Best Results
-1. **Upload Quality Documents**: Use official admission guides and requirements documents
-2. **Be Specific**: Include specific criteria in your queries (budget, location, program type)
-3. **Use Natural Language**: Ask questions as you would to a human counselor
-4. **Try Multiple Languages**: The system works well with mixed-language queries
-5. **Check Sources**: Always review the source documents cited in responses
-## 🆘 Troubleshooting
-### Common Issues
-**"No documents found"**: Upload PDF documents first in the Upload Documents page
-**"API Key not found"**: Add your SEA-LION API key to the .env file
-**"No embeddings available"**: BGE embeddings are used by default. If issues occur, add your OpenAI API key for fallback embeddings
-**"Import errors"**: Install dependencies using `pip install -r requirements.txt`
-**"ChromaDB errors"**: Delete the `chroma_db` folder and restart the application
-**"PyCryptodome is required for AES algorithm"**: This error occurs with encrypted PDFs. PyCryptodome is now included in requirements.txt
-**"Could not extract text from PDF"**: This can happen with:
-- Password-protected PDFs (provide unprotected versions)
-- Scanned PDFs or image-based documents (consider OCR tools)
-- Heavily encrypted or corrupted PDF files
-## 📞 Support
-For support, please create an issue on GitHub or contact the development team.
----
-**Made with ❤️ for students seeking education opportunities in ASEAN** 🎓

README_GRADIO.md ADDED Viewed

	@@ -0,0 +1,204 @@

+# 🌏 ASEAN University Search - Gradio Version
+An AI-powered university document search and Q&A system built with Gradio, specifically designed for ASEAN universities. This version uses **SEA-LION AI models** for intelligent responses and supports multiple Southeast Asian languages.
+## ✨ Features
+- 🤖 **AI-Powered Search**: Uses SEA-LION models for intelligent document analysis
+- 🌍 **Multi-Language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
+- 📚 **Automatic Metadata Extraction**: Detects university names, countries, and document types
+- 🔍 **Semantic Document Chunking**: Intelligent text splitting for better retrieval
+- 📱 **Shareable Links**: Built-in Gradio sharing for easy deployment
+- 🎯 **Source Citations**: Always shows which documents were used for answers
+## 🚀 Quick Start
+### Option 1: Using the Startup Script (Recommended)
+```bash
+./start_gradio.sh
+```
+### Option 2: Manual Setup
+```bash
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate
+# Install requirements
+pip install -r requirements_gradio.txt
+# Run the application
+python app_gradio.py
+```
+## 🌐 Deployment Options
+### 1. **Local with Public Link** (Immediate)
+- Run the app locally
+- Gradio automatically creates a public shareable link
+- Perfect for testing and sharing
+### 2. **HuggingFace Spaces** (Free, Recommended)
+1. Go to [HuggingFace Spaces](https://huggingface.co/spaces)
+2. Create new space with Gradio SDK
+3. Upload your files:
+   - `app_gradio.py`
+   - `requirements_gradio.txt` (rename to `requirements.txt`)
+   - `utils/` folder
+   - `.env` file (with your API keys)
+4. Deploy automatically!
+### 3. **Google Colab** (Free)
+```python
+# Upload files to Colab
+!pip install -r requirements_gradio.txt
+!python app_gradio.py
+```
+### 4. **Railway/Render** (Paid but reliable)
+- Push to GitHub
+- Connect to Railway/Render
+- Auto-deploy with custom domain
+## 🔧 Configuration
+### Environment Variables
+Create a `.env` file:
+```env
+# Required for SEA-LION models
+SEA_LION_API_KEY=your_sea_lion_api_key_here
+SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
+# Optional: For OpenAI embeddings fallback
+OPENAI_API_KEY=your_openai_api_key_here
+# Optional: Custom vector database location
+CHROMA_PERSIST_DIRECTORY=./chroma_db
+```
+### Model Configuration
+The system automatically chooses the appropriate model:
+- **Simple queries**: SEA-LION Instruct (faster)
+- **Complex analysis**: SEA-LION Reasoning (more thorough)
+## 📋 How to Use
+1. **Initialize System** 🚀
+   - Click "Initialize Systems"
+   - Wait for models to download (first time only)
+2. **Upload Documents** 📄
+   - Upload PDF university documents
+   - System automatically extracts metadata
+   - Supports multiple documents at once
+3. **Ask Questions** 🔍
+   - Type questions in natural language
+   - Choose response language
+   - Get AI answers with source citations
+## 🎯 Example Questions
+- "What are the admission requirements for Computer Science in Singapore?"
+- "Which universities offer scholarships under $5000?"
+- "Compare MBA programs in Thailand and Malaysia"
+- "找到学费低于 5000 美元的工程专业" (Chinese)
+- "Cari universitas dengan beasiswa di Indonesia" (Indonesian)
+## 🛠️ Troubleshooting
+### Common Issues
+**"No embedding model available"**
+```bash
+# Install sentence transformers
+pip install sentence-transformers torch
+# Or set OpenAI API key
+export OPENAI_API_KEY=your_key_here
+```
+**"Cannot load model"**
+- Ensure internet connection for model download
+- Try smaller model: set `EMBEDDING_MODEL=all-MiniLM-L6-v2`
+**PDF extraction fails**
+- Ensure PDFs are text-based (not scanned images)
+- Check if PDF is password-protected
+## 🔄 Differences from Streamlit Version
+| Feature           | Streamlit                | Gradio                   |
+| ----------------- | ------------------------ | ------------------------ |
+| **Deployment**    | Complex, SQLite issues   | Simple, multiple options |
+| **Sharing**       | Limited                  | Built-in public links    |
+| **UI**            | More customizable        | Clean, mobile-friendly   |
+| **Dependencies**  | Heavy, version conflicts | Lighter, more stable     |
+| **Cloud Hosting** | Streamlit Cloud only     | HF Spaces, Colab, etc.   |
+## 📁 Project Structure
+```
+📦 ASEAN University Search (Gradio)
+├── 🚀 app_gradio.py              # Main Gradio application
+├── 📋 requirements_gradio.txt     # Gradio-specific dependencies
+├── ⚡ start_gradio.sh            # Quick startup script
+├── 🔧 utils/
+│   ├── rag_system.py            # Core RAG logic (Streamlit-free)
+│   ├── display.py               # Display utilities
+│   └── translations.py          # Language translations
+├── 📁 documents/                # Document storage
+├── 🗄️ chroma_db/               # Vector database
+├── 📊 query_results/            # Saved query results
+└── 🔐 .env                     # Environment variables
+```
+## 🌟 Benefits of Gradio Version
+1. **🚀 Faster Deployment**: No SQLite version conflicts
+2. **🌐 Built-in Sharing**: Automatic public links
+3. **📱 Mobile-Friendly**: Responsive design
+4. **🔧 Fewer Dependencies**: More stable installation
+5. **🎯 Multiple Hosting Options**: HF Spaces, Colab, Railway, etc.
+6. **🛠️ Better Error Handling**: Clearer error messages
+7. **⚡ Faster Loading**: Optimized model initialization
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch: `git checkout -b feature-name`
+3. Make your changes
+4. Commit: `git commit -m "Add feature"`
+5. Push: `git push origin feature-name`
+6. Create a Pull Request
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- **SEA-LION AI**: For the amazing Southeast Asia-focused language models
+- **Gradio**: For the excellent web interface framework
+- **LangChain**: For the robust RAG pipeline
+- **ChromaDB**: For efficient vector storage
+- **Sentence Transformers**: For semantic embeddings
+---
+**Built with ❤️ for the ASEAN education community**

app.py DELETED Viewed

@@ -1,123 +0,0 @@
-import streamlit as st
-import os
-from urllib.parse import urlparse, parse_qs
-from utils.rag_system import DocumentIngestion, RAGSystem, save_query_result, load_shared_query
-from datetime import datetime
-import uuid
-from utils.translations import translations, get_text, get_language_code
-from pathlib import Path
-from my_pages.search_uni import search_page
-from my_pages.upload_documents import upload_documents_page
-from my_pages.manage_documents import manage_documents_page
-from my_pages.about import about_page
-from utils.display import display_shared_query
-# Load external CSS
-def load_css(file_name):
-    css_file = Path(file_name)
-    if css_file.exists():
-        with open(css_file) as f:
-            st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
-load_css("styles.css")
-# Configure Streamlit page
-st.set_page_config(
-    page_title="PanSea University Search",
-    page_icon="🎓",
-    layout="wide",
-    initial_sidebar_state="expanded"
-)
-def main():
-    # Initialize language in session state if not present
-    if 'app_language' not in st.session_state:
-        st.session_state.app_language = "English"
-    # Get current language from session state
-    current_lang = st.session_state.app_language
-    # Check for shared query in URL
-    query_params = st.query_params
-    shared_query_id = query_params.get("share", [None])[0]
-    if shared_query_id:
-        display_shared_query(shared_query_id)
-        return
-    # Main header
-    st.markdown(f"""
-    <div class="main-header">
-        <h1>{get_text("app_title", current_lang)}</h1>
-        <h5>{get_text("app_subtitle", current_lang)}</h5>
-    </div>
-    """, unsafe_allow_html=True)
-    # Sidebar
-    with st.sidebar:
-        # Global language selector
-        selected_language = st.selectbox(
-            "🌐 Language / 语言 / Bahasa",
-            ["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
-             "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"],
-            index=["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
-                   "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"].index(
-                next((lang for lang in ["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
-                      "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"]
-                      if get_language_code(lang) == current_lang), "English")),
-            key="global_language_selector"
-        )
-        # Update session state when language changes
-        new_lang = get_language_code(selected_language)
-        if new_lang != current_lang:
-            st.session_state.app_language = new_lang
-            st.rerun()
-        # Update current_lang after potential change
-        current_lang = st.session_state.app_language
-        st.divider()
-        # Navigation header
-        st.markdown(f"## {get_text('navigation', current_lang)}")
-        # Define the pages
-        page_keys = ["search_universities", "upload_documents", "manage_documents", "about"]
-        page_translations = {key: get_text(key, current_lang) for key in page_keys}
-        # Initialize current page if needed
-        if "current_page_key" not in st.session_state:
-            st.session_state.current_page_key = page_keys[0]
-        # Sidebar buttons
-        for key in page_keys:
-            if st.button(page_translations[key], use_container_width=True):
-                st.session_state.current_page_key = key
-    # Main content
-    if st.session_state.current_page_key == "upload_documents":
-        upload_documents_page()
-    elif st.session_state.current_page_key == "manage_documents":
-        manage_documents_page()
-    elif st.session_state.current_page_key == "about":
-        about_page()
-    else:
-        search_page()
-if __name__ == "__main__":
-    # Check if SEA-LION API key is set
-    if not os.getenv("SEA_LION_API_KEY"):
-        st.error("🚨 SEA-LION API Key not found! Please set your SEA_LION_API_KEY in the .env file.")
-        st.code("SEA_LION_API_KEY=your_api_key_here")
-        st.stop()
-    # Check if OpenAI API key is set (needed for embeddings)
-    if not os.getenv("OPENAI_API_KEY") or os.getenv("OPENAI_API_KEY") == "your_openai_api_key_here":
-        st.warning("⚠️ OpenAI API Key not configured properly. You'll need it for document embeddings.")
-        st.info("The system will use SEA-LION models for text generation, but OpenAI for document embeddings.")
-    main()

app_gradio.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""
+PANSEA University Requirements Assistant - Gradio Version (Modular)
+A comprehensive tool for navigating university admission requirements across Southeast Asia.
+"""
+import gradio as gr
+import os
+import sys
+from datetime import datetime
+# Add the current directory to Python path for imports
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+# Import our RAG system
+from utils.rag_system import DocumentIngestion, RAGSystem
+# Import modular tab components
+from tabs.initialize import create_initialize_tab
+from tabs.upload import create_upload_tab
+from tabs.query import create_query_tab
+from tabs.manage import create_manage_tab
+from tabs.help import create_help_tab
+def create_interface():
+    """Create the main Gradio interface using modular components"""
+    # Global state management - shared across all tabs
+    global_vars = {
+        'doc_ingestion': None,
+        'rag_system': None,
+        'vectorstore': None
+    }
+    # Custom CSS for better styling
+    custom_css = """
+    .gradio-container {
+        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    }
+    .tab-nav button {
+        font-weight: 500;
+        font-size: 14px;
+    }
+    .tab-nav button[aria-selected="true"] {
+        background: linear-gradient(45deg, #1e3a8a, #3b82f6);
+        color: white;
+    }
+    .feedback-box {
+        background: #f8fafc;
+        border: 1px solid #e2e8f0;
+        border-radius: 8px;
+        padding: 16px;
+        margin: 8px 0;
+    }
+    .success-message {
+        background: #dcfce7;
+        color: #166534;
+        border: 1px solid #bbf7d0;
+        padding: 12px;
+        border-radius: 6px;
+        margin: 8px 0;
+    }
+    .error-message {
+        background: #fef2f2;
+        color: #dc2626;
+        border: 1px solid #fecaca;
+        padding: 12px;
+        border-radius: 6px;
+        margin: 8px 0;
+    }
+    """
+    # Create the main interface
+    with gr.Blocks(
+        title="🌏 PANSEA University Assistant",
+        theme=gr.themes.Soft(
+            primary_hue="blue",
+            secondary_hue="slate"
+        ),
+        css=custom_css,
+        analytics_enabled=False
+    ) as interface:
+        # Header
+        gr.Markdown("""
+        # 🌏 TopEdu
+        **Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
+        Upload university documents, ask questions, and get intelligent answers about admission requirements,
+        programs, deadlines, and more across Southeast Asian universities.
+        ---
+        """)
+        # Main tabs using modular components
+        with gr.Tabs():
+            create_initialize_tab(global_vars)
+            create_upload_tab(global_vars)
+            create_query_tab(global_vars)
+            create_manage_tab(global_vars)
+            create_help_tab(global_vars)
+        # Footer
+        gr.Markdown(f"""
+        ---
+        **🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
+        💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
+        """)
+    return interface
+def main():
+    """Launch the application"""
+    interface = create_interface()
+    # Launch configuration
+    interface.launch(
+        share=False,           # Set to True for public sharing
+        server_name="0.0.0.0", # Allow external connections
+        server_port=7860,      # Default Gradio port
+        show_api=False,        # Hide API documentation
+        show_error=True,       # Show detailed error messages
+        quiet=False,           # Show startup messages
+        favicon_path=None,     # Could add custom favicon
+        app_kwargs={
+            "docs_url": None,  # Disable FastAPI docs
+            "redoc_url": None  # Disable ReDoc docs
+        }
+    )
+if __name__ == "__main__":
+    print("🚀 Starting PANSEA University Requirements Assistant...")
+    print("📍 Access the application at: http://localhost:7860")
+    print("🔗 For public sharing, set share=True in the launch() method")
+    print("-" * 60)
+    main()

app_gradio_modular.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""
+PANSEA University Requirements Assistant - Gradio Version (Modular)
+A comprehensive tool for navigating university admission requirements across Southeast Asia.
+"""
+import gradio as gr
+import os
+import sys
+from datetime import datetime
+# Add the current directory to Python path for imports
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+# Import our RAG system
+from utils.rag_system import DocumentIngestion, RAGSystem
+# Import modular tab components
+from tabs.initialize import create_initialize_tab
+from tabs.upload import create_upload_tab
+from tabs.query import create_query_tab
+from tabs.manage import create_manage_tab
+from tabs.help import create_help_tab
+def create_interface():
+    """Create the main Gradio interface using modular components"""
+    # Global state management - shared across all tabs
+    global_vars = {
+        'doc_ingestion': None,
+        'rag_system': None,
+        'vectorstore': None
+    }
+    # Custom CSS for better styling
+    custom_css = """
+    .gradio-container {
+        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    }
+    .tab-nav button {
+        font-weight: 500;
+        font-size: 14px;
+    }
+    .tab-nav button[aria-selected="true"] {
+        background: linear-gradient(45deg, #1e3a8a, #3b82f6);
+        color: white;
+    }
+    .feedback-box {
+        background: #f8fafc;
+        border: 1px solid #e2e8f0;
+        border-radius: 8px;
+        padding: 16px;
+        margin: 8px 0;
+    }
+    .success-message {
+        background: #dcfce7;
+        color: #166534;
+        border: 1px solid #bbf7d0;
+        padding: 12px;
+        border-radius: 6px;
+        margin: 8px 0;
+    }
+    .error-message {
+        background: #fef2f2;
+        color: #dc2626;
+        border: 1px solid #fecaca;
+        padding: 12px;
+        border-radius: 6px;
+        margin: 8px 0;
+    }
+    """
+    # Create the main interface
+    with gr.Blocks(
+        title="🌏 PANSEA University Assistant",
+        theme=gr.themes.Soft(
+            primary_hue="blue",
+            secondary_hue="slate"
+        ),
+        css=custom_css,
+        analytics_enabled=False
+    ) as interface:
+        # Header
+        gr.Markdown("""
+        # 🌏 TopEdu
+        **Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
+        Upload university documents, ask questions, and get intelligent answers about admission requirements,
+        programs, deadlines, and more across Southeast Asian universities.
+        ---
+        """)
+        # Main tabs using modular components
+        with gr.Tabs():
+            create_initialize_tab(global_vars)
+            create_upload_tab(global_vars)
+            create_query_tab(global_vars)
+            create_manage_tab(global_vars)
+            create_help_tab(global_vars)
+        # Footer
+        gr.Markdown(f"""
+        ---
+        **🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
+        💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
+        """)
+    return interface
+def main():
+    """Launch the application"""
+    interface = create_interface()
+    # Launch configuration
+    interface.launch(
+        share=False,           # Set to True for public sharing
+        server_name="0.0.0.0", # Allow external connections
+        server_port=7860,      # Default Gradio port
+        show_api=False,        # Hide API documentation
+        show_error=True,       # Show detailed error messages
+        quiet=False,           # Show startup messages
+        favicon_path=None,     # Could add custom favicon
+        app_kwargs={
+            "docs_url": None,  # Disable FastAPI docs
+            "redoc_url": None  # Disable ReDoc docs
+        }
+    )
+if __name__ == "__main__":
+    print("🚀 Starting PANSEA University Requirements Assistant...")
+    print("📍 Access the application at: http://localhost:7860")
+    print("🔗 For public sharing, set share=True in the launch() method")
+    print("-" * 60)
+    main()

installed_packages.txt DELETED Viewed

@@ -1,178 +0,0 @@
-aiohappyeyeballs==2.6.1
-aiohttp==3.12.15
-aiosignal==1.4.0
-altair==5.5.0
-altex==0.2.0
-annotated-types==0.7.0
-anyio==4.10.0
-asgiref==3.9.1
-async-timeout==4.0.3
-attrs==25.3.0
-backoff==2.2.1
-bcrypt==4.3.0
-beautifulsoup4==4.13.4
-blinker==1.9.0
-build==1.3.0
-cachetools==5.5.2
-certifi==2025.8.3
-charset-normalizer==3.4.3
-chroma-hnswlib==0.7.3
-chromadb==1.0.16
-click==8.2.1
-coloredlogs==15.0.1
-contourpy==1.3.2
-cycler==0.12.1
-dataclasses-json==0.6.7
-Deprecated==1.2.18
-distro==1.9.0
-durationpy==0.10
-entrypoints==0.4
-exceptiongroup==1.3.0
-faiss-cpu==1.7.4
-Faker==37.5.3
-fastapi==0.116.1
-favicon==0.7.0
-filelock==3.18.0
-flatbuffers==25.2.10
-fonttools==4.59.0
-frozenlist==1.7.0
-fsspec==2025.7.0
-gitdb==4.0.12
-GitPython==3.1.45
-google-auth==2.40.3
-googleapis-common-protos==1.70.0
-grpcio==1.74.0
-h11==0.16.0
-hf-xet==1.1.7
-htbuilder==0.9.0
-httpcore==1.0.9
-httptools==0.6.4
-httpx==0.28.1
-huggingface-hub==0.34.4
-humanfriendly==10.0
-idna==3.10
-importlib-metadata==6.11.0
-importlib_resources==6.5.2
-Jinja2==3.1.6
-jiter==0.10.0
-joblib==1.5.1
-jsonpatch==1.33
-jsonpointer==3.0.0
-jsonschema==4.25.0
-jsonschema-specifications==2025.4.1
-kiwisolver==1.4.9
-kubernetes==33.1.0
-langchain-text-splitters==0.3.9
-lxml==6.0.0
-Markdown==3.8.2
-markdown-it-py==4.0.0
-markdownlit==0.0.7
-MarkupSafe==3.0.2
-marshmallow==3.26.1
-matplotlib==3.10.5
-mdurl==0.1.2
-mmh3==5.2.0
-mpmath==1.3.0
-multidict==6.6.4
-mypy_extensions==1.1.0
-narwhals==2.1.0
-networkx==3.4.2
-numpy==1.26.4
-oauthlib==3.3.1
-onnxruntime==1.22.1
-opentelemetry-api==1.27.0
-opentelemetry-exporter-otlp-proto-common==1.27.0
-opentelemetry-exporter-otlp-proto-grpc==1.27.0
-opentelemetry-instrumentation==0.48b0
-opentelemetry-instrumentation-asgi==0.48b0
-opentelemetry-instrumentation-fastapi==0.48b0
-opentelemetry-proto==1.27.0
-opentelemetry-sdk==1.27.0
-opentelemetry-semantic-conventions==0.48b0
-opentelemetry-util-http==0.48b0
-orjson==3.11.2
-overrides==7.7.0
-packaging==23.2
-pandas==2.3.1
-pillow==10.4.0
-posthog==5.4.0
-propcache==0.3.2
-protobuf==4.25.8
-pulsar-client==3.8.0
-pyarrow==21.0.0
-pyasn1==0.6.1
-pyasn1_modules==0.4.2
-pybase64==1.4.2
-pycryptodome==3.23.0
-pydantic==2.11.7
-pydantic_core==2.33.2
-pydeck==0.9.1
-Pygments==2.19.2
-pymdown-extensions==10.16.1
-pyparsing==3.2.3
-PyPDF2==3.0.1
-PyPika==0.48.9
-pyproject_hooks==1.2.0
-python-dateutil==2.9.0.post0
-python-dotenv==1.0.0
-pytz==2025.2
-PyYAML==6.0.2
-referencing==0.36.2
-regex==2025.7.34
-requests==2.32.4
-requests-oauthlib==2.0.0
-requests-toolbelt==1.0.0
-rich==13.9.4
-rpds-py==0.27.0
-rsa==4.9.1
-safetensors==0.6.2
-scikit-learn==1.7.1
-scipy==1.15.3
-sentence-transformers==5.1.0
-shellingham==1.5.4
-six==1.17.0
-smmap==5.0.2
-sniffio==1.3.1
-soupsieve==2.7
-SQLAlchemy==2.0.43
-st-annotated-text==4.0.2
-starlette==0.47.2
-streamlit==1.48.0
-streamlit-camera-input-live==0.2.0
-streamlit-card==1.0.2
-streamlit-embedcode==0.1.2
-streamlit-extras==0.3.5
-streamlit-image-coordinates==0.1.9
-streamlit-keyup==0.3.0
-streamlit-toggle-switch==1.0.2
-streamlit-vertical-slider==2.5.5
-streamlit_faker==0.0.4
-sympy==1.14.0
-tenacity==8.5.0
-threadpoolctl==3.6.0
-tiktoken==0.11.0
-tokenizers==0.21.4
-toml==0.10.2
-tomli==2.2.1
-torch==2.8.0
-tornado==6.5.2
-tqdm==4.67.1
-transformers==4.55.0
-typer==0.16.0
-typing-inspect==0.9.0
-typing-inspection==0.4.1
-typing_extensions==4.14.1
-tzdata==2025.2
-tzlocal==5.3.1
-urllib3==2.5.0
-uvicorn==0.35.0
-uvloop==0.21.0
-validators==0.35.0
-watchdog==3.0.0
-watchfiles==1.1.0
-websocket-client==1.8.0
-websockets==15.0.1
-wrapt==1.17.3
-yarl==1.20.1
-zipp==3.23.0
-zstandard==0.23.0

my_pages/about.py DELETED Viewed

@@ -1,37 +0,0 @@
-import streamlit as st
-from utils.translations import get_text
-def about_page():
-    # Get current language from session state
-    lang = st.session_state.get('app_language', 'English')
-    st.header(get_text("about_header", lang))
-    # col1, col2 = st.columns([2, 1])
-    # with col1:
-    st.markdown(f"""
-    ### {get_text("who_we_are", lang)}
-    {get_text("who_we_are_description", lang)}
-    ### {get_text("what_we_do", lang)}
-    {get_text("what_we_do_description", lang)}
-    ### {get_text("supported_languages", lang)}
-    - English
-    - 中文 (Chinese / Mandarin)
-    - Bahasa Malaysia
-    - ไทย (Thai)
-    - Bahasa Indonesia
-    - Tiếng Việt (Vietnamese)
-    - Filipino
-    - ភាសាខ្មែរ (Khmer)
-    - ພາສາລາວ (Lao)
-    - မြန်မာဘာသာ (Burmese)
-    """)
-    # with col2:
-    #     st.markdown(f"""
-    #     ### {get_text("contact", lang)}
-    #     Reach out to us for support or inquiries!
-    #     """)

my_pages/manage_documents.py DELETED Viewed

@@ -1,73 +0,0 @@
-import streamlit as st
-from utils.rag_system import DocumentIngestion
-from utils.translations import get_text
-def manage_documents_page():
-    # Get current language from session state
-    current_lang = st.session_state.get('app_language', 'English')
-    st.header(get_text("manage_header", current_lang))
-    st.write(get_text("manage_description", current_lang))
-    from utils.rag_system import DocumentIngestion
-    doc_ingestion = DocumentIngestion()
-    vectorstore = doc_ingestion.load_existing_vectorstore()
-    if not vectorstore:
-        st.warning("No files found. Upload documents first.")
-        return
-    # Get all documents (chunks) in the vectorstore
-    try:
-        # Chroma stores documents as chunks, but we want to show original metadata
-        # We'll group by file_id to show unique documents
-        collection = vectorstore._collection
-        all_docs = collection.get(include=["metadatas", "documents"])  # Removed 'ids'
-        metadatas = all_docs["metadatas"]
-        ids = all_docs["ids"]  # ids are always returned
-        documents = all_docs["documents"]
-        # Group by file_id
-        doc_map = {}
-        for meta, doc_id, doc_text in zip(metadatas, ids, documents):
-            file_id = meta.get("file_id", doc_id)
-            if file_id not in doc_map:
-                doc_map[file_id] = {
-                    "source": meta.get("source", "Unknown"),
-                    "university": meta.get("university", "Unknown"),
-                    "country": meta.get("country", "Unknown"),
-                    "document_type": meta.get("document_type", "Unknown"),
-                    "upload_timestamp": meta.get("upload_timestamp", "Unknown"),
-                    "file_id": file_id,
-                    "chunks": []
-                }
-            doc_map[file_id]["chunks"].append(doc_text)
-        if not doc_map:
-            st.info(get_text("no_documents", current_lang))
-            return
-        st.subheader(get_text("document_list", current_lang))
-        for file_id, info in doc_map.items():
-            with st.expander(f"{info['source']} ({info['university']}, {info['country']})"):
-                st.write(f"**Type:** {info['document_type']}")
-                st.write(f"**{get_text('last_updated', current_lang)}:** {info['upload_timestamp']}")
-                st.write(f"**File ID:** {file_id}")
-                st.write(f"**{get_text('total_chunks', current_lang)}:** {len(info['chunks'])}")
-                if st.button(f"🗑️ Delete Document", key=f"del_{file_id}"):
-                    # Delete all chunks with this file_id
-                    ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == file_id]
-                    vectorstore._collection.delete(ids=ids_to_delete)
-                    st.success(f"Deleted document: {info['source']}")
-                    st.rerun()
-        # Add Delete All button
-        if doc_map:
-            if st.button(get_text("delete_all", current_lang), key="del_all_docs", type="secondary"):
-                all_ids = list(ids)
-                vectorstore._collection.delete(ids=all_ids)
-                st.success(get_text("documents_deleted", current_lang))
-                st.rerun()
-    except Exception as e:
-        st.error(f"Error loading documents: {str(e)}")

my_pages/search_uni.py DELETED Viewed

@@ -1,104 +0,0 @@
-import streamlit as st
-from utils.translations import get_text
-from utils.rag_system import RAGSystem, save_query_result
-def search_page():
-    lang = st.session_state.get('app_language', 'English')
-    # --- Header & description ---
-    st.header(get_text("search_header", lang))
-    st.write(get_text("search_description", lang))
-    if lang != "English":
-        st.info(f'{get_text("responses_in", lang)} **{lang}**')
-    # --- Initialize query_text ---
-    if "query_text" not in st.session_state:
-        st.session_state.query_text = ""
-    # --- Example queries ---
-    complex_examples = [
-        get_text("example_complex_1", lang),
-        get_text("example_complex_2", lang),
-        get_text("example_complex_3", lang),
-        get_text("example_complex_4", lang)
-    ]
-    simple_examples = [
-        get_text("example_simple_1", lang),
-        get_text("example_simple_2", lang),
-        get_text("example_simple_3", lang),
-        get_text("example_simple_4", lang)
-    ]
-    with st.expander(get_text("example_queries", lang)):
-        tab1, tab2 = st.tabs([get_text("complex_queries", lang), get_text("simple_queries", lang)])
-        with tab1:
-            for i, ex in enumerate(complex_examples):
-                if st.button(ex, key=f"complex_{i}", use_container_width=True):
-                    st.session_state.query_text = ex
-        with tab2:
-            for i, ex in enumerate(simple_examples):
-                if st.button(ex, key=f"simple_{i}", use_container_width=True):
-                    st.session_state.query_text = ex
-    # --- Query input ---
-    st.text_area(
-        get_text("your_question", lang),
-        height=120,
-        placeholder=get_text("placeholder_text", lang),
-        key="query_text"
-    )
-    # --- Optional filters (initially empty) ---
-    with st.expander(get_text("advanced_filters", lang)):
-        col1, col2, col3 = st.columns(3)
-        budget_options = [get_text(opt, lang) for opt in ["any", "under_10k", "10k_20k", "20k_30k", "30k_40k", "over_40k"]]
-        study_level_options = [get_text(lvl, lang) for lvl in ["diploma", "bachelor", "master", "phd"]]
-        country_options = [get_text(c, lang) for c in ["singapore", "malaysia", "thailand", "indonesia", "philippines", "vietnam", "brunei"]]
-        selected_budget = col1.select_slider(get_text("budget_range", lang), options=budget_options, value=budget_options[0])
-        selected_levels = col2.multiselect(get_text("study_level", lang), study_level_options, default=[])
-        selected_countries = col3.multiselect(get_text("preferred_countries", lang), country_options, default=[])
-    # --- Ensure RAG system is initialized once ---
-    if "rag_system_ready" not in st.session_state:
-        st.session_state.rag_system_ready = False
-        try:
-            st.session_state.rag_system = RAGSystem()
-            st.session_state.rag_system_ready = True
-        except Exception as e:
-            st.error(f"Failed to initialize RAG system: {e}")
-    # --- Search button ---
-    search_disabled = not st.session_state.query_text.strip() or not st.session_state.rag_system_ready
-    if st.button(get_text("search_button", lang), disabled=search_disabled):
-        placeholder = st.empty()
-        placeholder.info("Searching...")
-        # Combine query with filter info
-        filter_info = {
-            "budget": selected_budget if selected_budget != budget_options[0] else None,
-            "study_levels": selected_levels,
-            "countries": selected_countries
-        }
-        full_query = f"{st.session_state.query_text.strip()}\nFilters: {filter_info}"
-        # Call RAG system with filters
-        query_result = st.session_state.rag_system.query(
-            question=full_query,
-            language=lang
-        )
-        placeholder.empty()
-        save_query_result(query_result)
-        st.success(query_result["answer"])
-        if query_result["source_documents"]:
-            st.markdown("#### Source Documents")
-            for i, doc in enumerate(query_result["source_documents"], 1):
-                st.markdown(
-                    f"- **{i}. {doc.metadata.get('source', 'Unknown')}** "
-                    f"({doc.metadata.get('university', 'Unknown')}, {doc.metadata.get('country', 'Unknown')})"
-                )

my_pages/upload_documents.py DELETED Viewed

@@ -1,202 +0,0 @@
-from langchain.schema import Document
-import streamlit as st
-from utils.rag_system import DocumentIngestion
-from utils.translations import get_text
-def upload_documents_page():
-    # Get current language from session state
-    current_lang = st.session_state.get('app_language', 'English')
-    st.header(get_text("upload_header", current_lang))
-    st.write(get_text("upload_description", current_lang))
-    # Add information about automatic metadata detection
-    st.info("🤖 **Automatic Metadata Detection Enabled**: The system will automatically detect university name, country, and document type from your uploaded files using AI.")
-    # File upload (removed manual metadata input fields)
-    uploaded_files = st.file_uploader(
-        get_text("choose_files", current_lang),
-        accept_multiple_files=True,
-        type=['pdf'],
-        help=get_text("file_limit", current_lang)
-    )
-    # # Optional: Add language selection for processing (if needed for multilingual documents)
-    # col1, col2 = st.columns(2)
-    # with col1:
-    #     processing_language = st.selectbox(
-    #         f"🌐 Processing Language (Optional)",
-    #         ["Auto-detect", "English", "Chinese", "Malay", "Thai", "Indonesian", "Vietnamese", "Filipino"],
-    #         help="Select the primary language of your documents for better metadata extraction"
-    #     )
-    # with col2:
-    #     # Optional: Allow users to override detected metadata if needed
-    #     allow_manual_override = st.checkbox(
-    #         "🔧 Allow manual metadata correction after processing",
-    #         value=False,
-    #         help="Enable this to manually correct any incorrectly detected metadata"
-    #     )
-    if uploaded_files and st.button(get_text("process_documents", current_lang), type="primary"):
-        with st.spinner(f"{get_text('processing_docs', current_lang)} (with automatic metadata detection)..."):
-            try:
-                # Initialize document ingestion
-                doc_ingestion = DocumentIngestion()
-                # Process documents with automatic metadata extraction
-                documents = doc_ingestion.process_documents(uploaded_files)
-                if documents:
-                    # Show detected metadata for review/correction if enabled
-                    # if allow_manual_override and documents:
-                    #     st.subheader("🔍 Review Detected Metadata")
-                    #     st.write("Review and correct the automatically detected metadata if needed:")
-                    #     corrected_documents = []
-                    #     for i, doc in enumerate(documents):
-                    #         with st.expander(f"📄 {doc.metadata['source']}", expanded=False):
-                    #             col1, col2, col3 = st.columns(3)
-                    #             with col1:
-                    #                 corrected_university = st.text_input(
-                    #                     "University Name",
-                    #                     value=doc.metadata['university'],
-                    #                     key=f"uni_{i}"
-                    #                 )
-                    #             with col2:
-                    #                 corrected_country = st.selectbox(
-                    #                     "Country",
-                    #                     ["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
-                    #                      "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"],
-                    #                     index=0 if doc.metadata['country'] == "Unknown" else
-                    #                           (["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
-                    #                             "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"].index(doc.metadata['country'])
-                    #                            if doc.metadata['country'] in ["Singapore", "Malaysia", "Thailand", "Indonesia",
-                    #                                                          "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"] else 0),
-                    #                     key=f"country_{i}"
-                    #                 )
-                    #             with col3:
-                    #                 corrected_doc_type = st.selectbox(
-                    #                     "Document Type",
-                    #                     ["admission_requirements", "tuition_fees", "program_information",
-                    #                      "scholarship_info", "application_deadlines", "general_info"],
-                    #                     index=["admission_requirements", "tuition_fees", "program_information",
-                    #                            "scholarship_info", "application_deadlines", "general_info"].index(doc.metadata['document_type']),
-                    #                     key=f"doctype_{i}"
-                    #                 )
-                    #             # Update document metadata with corrections
-                    #             corrected_doc = Document(
-                    #                 page_content=doc.page_content,
-                    #                 metadata={
-                    #                     **doc.metadata,
-                    #                     "university": corrected_university,
-                    #                     "country": corrected_country,
-                    #                     "document_type": corrected_doc_type,
-                    #                     "manually_corrected": True
-                    #                 }
-                    #             )
-                    #             corrected_documents.append(corrected_doc)
-                    #     # Use corrected documents
-                    #     documents = corrected_documents
-                    #     if st.button("✅ Confirm and Save Documents", type="primary"):
-                    #         # Create or update vector store with corrected metadata
-                    #         vectorstore = doc_ingestion.create_vector_store(documents)
-                    #         if vectorstore:
-                    #             st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with corrected metadata!")
-                    #             # Show final processed files
-                    #             with st.expander("📋 Final Processed Files"):
-                    #                 for doc in documents:
-                    #                     st.write(f"• **{doc.metadata['source']}**")
-                    #                     st.write(f"  - University: {doc.metadata['university']}")
-                    #                     st.write(f"  - Country: {doc.metadata['country']}")
-                    #                     st.write(f"  - Type: {doc.metadata['document_type']}")
-                    #                     if doc.metadata.get('manually_corrected'):
-                    #                         st.write(f"  - ✏️ Manually corrected")
-                    #                     st.write("---")
-                    # else:
-                        # Process normally without manual override
-                        vectorstore = doc_ingestion.create_vector_store(documents)
-                        if vectorstore:
-                            st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with automatic metadata detection!")
-                            # Show processed files with detected metadata
-                            with st.expander("📋 Processed Files with Detected Metadata"):
-                                for doc in documents:
-                                    st.write(f"• **{doc.metadata['source']}**")
-                                    st.write(f"  - 🏫 University: {doc.metadata['university']}")
-                                    st.write(f"  - 🌏 Country: {doc.metadata['country']}")
-                                    st.write(f"  - 📋 Type: {doc.metadata['document_type']}")
-                                    st.write(f"  - 🤖 Auto-detected: Yes")
-                                    st.write("---")
-                            # Show summary of detected metadata
-                            universities = list(set([doc.metadata['university'] for doc in documents if doc.metadata['university'] != 'Unknown']))
-                            countries = list(set([doc.metadata['country'] for doc in documents if doc.metadata['country'] != 'Unknown']))
-                            doc_types = list(set([doc.metadata['document_type'] for doc in documents]))
-                            if universities or countries or doc_types:
-                                st.subheader("📊 Detection Summary")
-                                if universities:
-                                    st.write(f"🏫 **Universities detected**: {', '.join(universities)}")
-                                if countries:
-                                    st.write(f"🌏 **Countries detected**: {', '.join(countries)}")
-                                if doc_types:
-                                    st.write(f"📋 **Document types detected**: {', '.join(doc_types)}")
-                else:
-                    st.error(get_text("no_docs_processed", current_lang))
-            except Exception as e:
-                st.error(f"{get_text('failed_to_process', current_lang)}: {str(e)}")
-                st.error("Please check your API keys and model configurations.")
-# Additional helper function for metadata validation
-def validate_metadata(metadata: dict) -> dict:
-    """Validate and clean extracted metadata"""
-    # List of valid countries for ASEAN region
-    valid_countries = [
-        "Singapore", "Malaysia", "Thailand", "Indonesia",
-        "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"
-    ]
-    # List of valid document types
-    valid_doc_types = [
-        "admission_requirements", "tuition_fees", "program_information",
-        "scholarship_info", "application_deadlines", "general_info"
-    ]
-    # Clean and validate country
-    if metadata.get('country', '').strip():
-        country = metadata['country'].strip()
-        # Try to match with valid countries (case insensitive)
-        for valid_country in valid_countries:
-            if valid_country.lower() in country.lower() or country.lower() in valid_country.lower():
-                metadata['country'] = valid_country
-                break
-        else:
-            # If no match found, keep original but mark as unvalidated
-            if country.lower() not in [c.lower() for c in valid_countries]:
-                metadata['country'] = country  # Keep original
-    # Validate document type
-    if metadata.get('document_type') not in valid_doc_types:
-        metadata['document_type'] = "general_info"  # Default fallback
-    # Clean university name
-    if metadata.get('university_name'):
-        # Remove common prefixes/suffixes that might be incorrectly included
-        university = metadata['university_name'].strip()
-        # Remove quotes if present
-        university = university.strip('"\'')
-        metadata['university_name'] = university
-    return metadata

requirements.txt CHANGED Viewed

@@ -1,5 +1,4 @@
-#change requirements
 aiohappyeyeballs==2.6.1
 aiohttp==3.12.15
 aiosignal==1.4.0
@@ -10,6 +9,7 @@ attrs==25.3.0
 backoff==2.2.1
 bcrypt==4.3.0
 blinker==1.9.0
 build==1.3.0
 cachetools==5.5.2
 certifi==2025.8.3
@@ -20,6 +20,8 @@ coloredlogs==15.0.1
 dataclasses-json==0.6.7
 distro==1.9.0
 durationpy==0.10
 filelock==3.18.0
 flatbuffers==25.2.10
 frozenlist==1.7.0
@@ -28,6 +30,9 @@ gitdb==4.0.12
 GitPython==3.1.45
 google-auth==2.40.3
 googleapis-common-protos==1.70.0
 grpcio==1.74.0
 h11==0.16.0
 hf-xet==1.1.7
@@ -91,12 +96,14 @@ pydantic==2.11.7
 pydantic-settings==2.10.1
 pydantic_core==2.33.2
 pydeck==0.9.1
 Pygments==2.19.2
 PyPDF2==3.0.1
 PyPika==0.48.9
 pyproject_hooks==1.2.0
 python-dateutil==2.9.0.post0
 python-dotenv==1.1.1
 pytz==2025.2
 PyYAML==6.0.2
 referencing==0.36.2
@@ -107,15 +114,19 @@ requests-toolbelt==1.0.0
 rich==14.1.0
 rpds-py==0.27.0
 rsa==4.9.1
 safetensors==0.6.2
 scikit-learn==1.7.1
 scipy==1.16.1
 sentence-transformers==5.1.0
 shellingham==1.5.4
 six==1.17.0
 smmap==5.0.2
 sniffio==1.3.1
 streamlit==1.48.0
 sympy==1.14.0
 tenacity==9.1.2
@@ -123,6 +134,7 @@ threadpoolctl==3.6.0
 tiktoken==0.11.0
 tokenizers==0.21.4
 toml==0.10.2
 torch==2.8.0
 tornado==6.5.2
 tqdm==4.67.1

+aiofiles==24.1.0
 aiohappyeyeballs==2.6.1
 aiohttp==3.12.15
 aiosignal==1.4.0
 backoff==2.2.1
 bcrypt==4.3.0
 blinker==1.9.0
+Brotli==1.1.0
 build==1.3.0
 cachetools==5.5.2
 certifi==2025.8.3
 dataclasses-json==0.6.7
 distro==1.9.0
 durationpy==0.10
+fastapi==0.116.1
+ffmpy==0.6.1
 filelock==3.18.0
 flatbuffers==25.2.10
 frozenlist==1.7.0
 GitPython==3.1.45
 google-auth==2.40.3
 googleapis-common-protos==1.70.0
+gradio==5.42.0
+gradio_client==1.11.1
+groovy==0.1.2
 grpcio==1.74.0
 h11==0.16.0
 hf-xet==1.1.7
 pydantic-settings==2.10.1
 pydantic_core==2.33.2
 pydeck==0.9.1
+pydub==0.25.1
 Pygments==2.19.2
 PyPDF2==3.0.1
 PyPika==0.48.9
 pyproject_hooks==1.2.0
 python-dateutil==2.9.0.post0
 python-dotenv==1.1.1
+python-multipart==0.0.20
 pytz==2025.2
 PyYAML==6.0.2
 referencing==0.36.2
 rich==14.1.0
 rpds-py==0.27.0
 rsa==4.9.1
+ruff==0.12.8
+safehttpx==0.1.6
 safetensors==0.6.2
 scikit-learn==1.7.1
 scipy==1.16.1
+semantic-version==2.10.0
 sentence-transformers==5.1.0
 shellingham==1.5.4
 six==1.17.0
 smmap==5.0.2
 sniffio==1.3.1
+SQLAlchemy==2.0.43
+starlette==0.47.2
 streamlit==1.48.0
 sympy==1.14.0
 tenacity==9.1.2
 tiktoken==0.11.0
 tokenizers==0.21.4
 toml==0.10.2
+tomlkit==0.13.3
 torch==2.8.0
 tornado==6.5.2
 tqdm==4.67.1

runtime.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- python-3.10.12

start.sh DELETED Viewed

@@ -1,43 +0,0 @@
-#!/bin/bash
-# PanSea University Search - Startup Script
-echo "🎓 Starting PanSea University Search..."
-# Check if virtual environment exists
-if [ ! -d ".venv" ]; then
-    echo "❌ Virtual environment not found. Please run setup first."
-    exit 1
-fi
-# Activate virtual environment
-source .venv/bin/activate
-# Check if .env file exists
-if [ ! -f ".env" ]; then
-    echo "⚠️  .env file not found. Please create one with your OpenAI API key."
-    echo "Example:"
-    echo "OPENAI_API_KEY=your_api_key_here"
-    exit 1
-fi
-# Create necessary directories
-mkdir -p chroma_db
-mkdir -p documents
-mkdir -p query_results
-# Check if required packages are installed
-echo "🔍 Checking dependencies..."
-python -c "import streamlit, langchain, chromadb" 2>/dev/null
-if [ $? -ne 0 ]; then
-    echo "❌ Dependencies not found. Installing..."
-    pip install -r requirements.txt
-fi
-echo "🚀 Starting Streamlit application..."
-echo "📱 Open your browser to: http://localhost:8501"
-echo "🛑 Press Ctrl+C to stop the application"
-echo ""
-# Start the Streamlit app
-streamlit run app.py --server.port=8501 --server.address=0.0.0.0

tabs/help.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""
+Help tab functionality for the Gradio app
+"""
+import gradio as gr
+def create_help_tab(global_vars):
+    """Create the Help tab with comprehensive documentation"""
+    with gr.Tab("❓ Help", id="help"):
+        gr.Markdown("""
+        # 🌏 PANSEA University Requirements Assistant - User Guide
+        Welcome to the PANSEA (Pan-Southeast Asian) University Requirements Assistant! This tool helps you navigate university admission requirements across Southeast Asian countries using advanced AI-powered document analysis.
+        ---
+        ## 🚀 Getting Started
+        ### Step 1: Initialize the System
+        1. Go to the **🔧 Initialize** tab
+        2. Click **"Initialize All Systems"**
+        3. Wait for the success message
+        4. The system will set up AI models and document processing capabilities
+        ### Step 2: Upload Documents
+        1. Navigate to the **📤 Upload Documents** tab
+        2. Select one or more PDF files containing university requirement information
+        3. Fill in the document metadata:
+           - **University Name**: Official name of the institution
+           - **Country**: Select from Southeast Asian countries
+           - **Document Type**: Choose the type of document
+           - **Language**: Document language
+        4. Click **"Process Documents"**
+        5. Wait for processing completion
+        ### Step 3: Query Documents
+        1. Go to the **🔍 Query Documents** tab
+        2. Type your question in the query box
+        3. Click **"Search Documents"**
+        4. Review the AI-generated answer and source references
+        5. Use example questions to explore different types of queries
+        ### Step 4: Manage Documents
+        1. Visit the **🗂 Manage Documents** tab
+        2. View all uploaded documents and statistics
+        3. Delete individual documents or clear all documents as needed
+        ---
+        ## 📖 Features Overview
+        ### 🤖 AI-Powered Analysis
+        - Uses advanced SEA-LION AI models optimized for Southeast Asian contexts
+        - Semantic search across your document collection
+        - Contextual answers with source citations
+        - Multi-language document support
+        ### 📚 Document Management
+        - Support for PDF documents
+        - Intelligent text chunking for better search results
+        - Metadata tracking (university, country, document type, language)
+        - Easy document deletion and management
+        ### 🌐 Regional Focus
+        - Specialized for Southeast Asian universities
+        - Supports multiple countries and languages
+        - Culturally aware responses
+        - Up-to-date admission requirement information
+        ---
+        ## 💡 Usage Tips
+        ### Asking Better Questions
+        - **Be Specific**: "What are the English proficiency requirements for Computer Science at NUS?" instead of "What are the requirements?"
+        - **Include Context**: Mention specific programs, countries, or universities you're interested in
+        - **Use Keywords**: Include terms like "admission", "requirements", "GPA", "test scores", etc.
+        ### Document Upload Best Practices
+        - **Quality Documents**: Upload official university brochures, requirement documents, or application guides
+        - **Accurate Metadata**: Fill in all metadata fields correctly for better search results
+        - **Regular Updates**: Replace outdated documents with current versions
+        - **Organized Approach**: Upload documents systematically by country or university
+        ### Managing Your Knowledge Base
+        - **Regular Maintenance**: Remove outdated documents periodically
+        - **Logical Organization**: Group related documents together
+        - **Backup Important Queries**: Save important answers for future reference
+        ---
+        ## 🛠 Troubleshooting
+        ### Common Issues
+        **Problem**: "Please initialize systems first" error
+        - **Solution**: Go to the Initialize tab and click "Initialize All Systems"
+        **Problem**: Document upload fails
+        - **Solution**: Ensure PDF files are not corrupted and contain text (not just images)
+        **Problem**: No search results
+        - **Solution**: Check if documents are uploaded and try different keywords
+        **Problem**: Slow performance
+        - **Solution**: Wait for processing to complete, avoid uploading too many large documents at once
+        ### Technical Requirements
+        - **File Format**: PDF documents only
+        - **File Size**: Reasonable size limits (avoid extremely large files)
+        - **Content**: Text-based PDFs work best (scanned images may not work well)
+        - **Internet**: Required for AI model access
+        ---
+        ## 📊 Understanding Results
+        ### Query Responses
+        - **Answer**: AI-generated response based on your documents
+        - **Sources**: Specific document chunks used to generate the answer
+        - **Confidence**: Implied by the specificity and detail of the response
+        - **Context**: Related information that might be helpful
+        ### Document Statistics
+        - **Total Documents**: Number of unique documents uploaded
+        - **Total Chunks**: Number of text segments for searching
+        - **Metadata**: Information about each document's origin and type
+        ---
+        ## 🌟 Best Practices for University Research
+        ### Research Strategy
+        1. **Start Broad**: Upload general university information first
+        2. **Get Specific**: Add detailed program requirements
+        3. **Compare Options**: Query for comparisons between universities
+        4. **Verify Information**: Cross-reference with official university websites
+        ### Question Types to Try
+        - **Admission Requirements**: "What are the minimum GPA requirements for..."
+        - **Test Scores**: "What IELTS/TOEFL scores are needed for..."
+        - **Application Deadlines**: "When is the application deadline for..."
+        - **Program Details**: "What courses are included in the... program at..."
+        - **Scholarships**: "What scholarship opportunities are available for..."
+        ---
+        ## 🆘 Support & Feedback
+        If you encounter issues or have suggestions for improvement:
+        1. **Check Documentation**: Review this help section first
+        2. **Try Different Approaches**: Rephrase your queries or check document formats
+        3. **Document Issues**: Note specific error messages or unexpected behavior
+        4. **Feature Requests**: Consider what additional functionality would be helpful
+        ---
+        ## 🔄 Version Information
+        **Current Version**: Gradio-based PANSEA Assistant
+        **AI Models**: SEA-LION optimized for Southeast Asian contexts
+        **Document Processing**: Advanced semantic chunking and embedding
+        **Search Technology**: Vector similarity search with contextual ranking
+        ---
+        *Happy university hunting! 🎓 We hope this tool helps you find the perfect educational opportunity in Southeast Asia.*
+        """)

tabs/initialize.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""
+Initialize tab functionality for the Gradio app
+"""
+import gradio as gr
+from utils.rag_system import DocumentIngestion, RAGSystem
+def initialize_systems(global_vars):
+    """Initialize the RAG systems"""
+    try:
+        print("🚀 Initializing document ingestion system...")
+        global_vars['doc_ingestion'] = DocumentIngestion()
+        print("🚀 Initializing RAG system...")
+        global_vars['rag_system'] = RAGSystem()
+        return "✅ Systems initialized successfully! You can now upload documents."
+    except Exception as e:
+        error_msg = f"❌ Error initializing systems: {str(e)}\n\n"
+        if "sentence-transformers" in str(e):
+            error_msg += """
+**Possible solutions:**
+1. Install sentence-transformers: `pip install sentence-transformers`
+2. Or provide OpenAI API key in environment variables
+3. Check that PyTorch is properly installed
+**For deployment:**
+- Ensure requirements.txt includes: sentence-transformers, torch, transformers
+"""
+        return error_msg
+def create_initialize_tab(global_vars):
+    """Create the Initialize System tab"""
+    with gr.Tab("🚀 Initialize System", id="init"):
+        gr.Markdown("""
+        ### Step 1: Initialize the System
+        Click the button below to initialize the AI models and embedding systems.
+        This may take a few moments on first run as models are downloaded.
+        """)
+        init_btn = gr.Button(
+            "🚀 Initialize Systems",
+            variant="primary",
+            size="lg"
+        )
+        init_status = gr.Textbox(
+            label="Initialization Status",
+            interactive=False,
+            lines=8,
+            placeholder="Click 'Initialize Systems' to start..."
+        )
+        init_btn.click(
+            lambda: initialize_systems(global_vars),
+            outputs=init_status
+        )

tabs/manage.py ADDED Viewed

	@@ -0,0 +1,237 @@

+"""
+Manage documents tab functionality for the Gradio app
+"""
+import gradio as gr
+def manage_documents(global_vars):
+    """Manage uploaded documents - view, delete individual or all documents"""
+    doc_ingestion = global_vars.get('doc_ingestion')
+    if not doc_ingestion:
+        return "❌ Please initialize systems first!", "", ""
+    try:
+        vectorstore = doc_ingestion.load_existing_vectorstore()
+        if not vectorstore:
+            return "⚠️ No documents found. Upload documents first.", "", ""
+        # Get all documents from vectorstore
+        collection = vectorstore._collection
+        all_docs = collection.get(include=["metadatas", "documents"])
+        metadatas = all_docs["metadatas"]
+        ids = all_docs["ids"]
+        documents = all_docs["documents"]
+        # Group by file_id to show unique documents
+        doc_map = {}
+        for meta, doc_id, doc_text in zip(metadatas, ids, documents):
+            file_id = meta.get("file_id", doc_id)
+            if file_id not in doc_map:
+                doc_map[file_id] = {
+                    "source": meta.get("source", "Unknown"),
+                    "university": meta.get("university", "Unknown"),
+                    "country": meta.get("country", "Unknown"),
+                    "document_type": meta.get("document_type", "Unknown"),
+                    "language": meta.get("language", "Unknown"),
+                    "upload_timestamp": meta.get("upload_timestamp", "Unknown"),
+                    "file_id": file_id,
+                    "chunks": []
+                }
+            doc_map[file_id]["chunks"].append(doc_text)
+        if not doc_map:
+            return "ℹ️ No documents found in the system.", "", ""
+        # Create summary
+        total_documents = len(doc_map)
+        total_chunks = sum(len(info["chunks"]) for info in doc_map.values())
+        summary = f"""## 📊 Document Statistics
+**Total Documents:** {total_documents}
+**Total Text Chunks:** {total_chunks}
+**Storage Status:** Active
+## 📚 Document List
+        """
+        # Create document list with details
+        document_list = ""
+        file_id_list = []
+        for i, (file_id, info) in enumerate(doc_map.items(), 1):
+            timestamp = info['upload_timestamp'][:19] if len(info['upload_timestamp']) > 19 else info['upload_timestamp']
+            document_list += f"""
+**{i}. {info['source']}**
+- University: {info['university']}
+- Country: {info['country']}
+- Type: {info['document_type']}
+- Language: {info['language']}
+- Chunks: {len(info['chunks'])}
+- Uploaded: {timestamp}
+- File ID: `{file_id}`
+---
+            """
+            file_id_list.append(file_id)
+        # Create dropdown options for individual deletion
+        file_options = [f"{info['source']} ({info['university']})" for info in doc_map.values()]
+        return summary, document_list, file_options
+    except Exception as e:
+        return f"❌ Error loading documents: {str(e)}", "", []
+def delete_document(selected_file, current_doc_list, global_vars):
+    """Delete a specific document"""
+    doc_ingestion = global_vars.get('doc_ingestion')
+    if not doc_ingestion or not selected_file:
+        return "❌ Please select a document to delete.", current_doc_list
+    try:
+        vectorstore = doc_ingestion.load_existing_vectorstore()
+        if not vectorstore:
+            return "❌ No vectorstore found.", current_doc_list
+        # Get all documents and find the matching file_id
+        collection = vectorstore._collection
+        all_docs = collection.get(include=["metadatas"])
+        metadatas = all_docs["metadatas"]
+        ids = all_docs["ids"]
+        # Find file_id for the selected document
+        target_file_id = None
+        for meta, doc_id in zip(metadatas, ids):
+            source = meta.get("source", "Unknown")
+            university = meta.get("university", "Unknown")
+            if f"{source} ({university})" == selected_file:
+                target_file_id = meta.get("file_id", doc_id)
+                break
+        if not target_file_id:
+            return "❌ Document not found.", current_doc_list
+        # Delete all chunks with this file_id
+        ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == target_file_id]
+        collection.delete(ids=ids_to_delete)
+        # Refresh the document list
+        _, new_doc_list, _ = manage_documents(global_vars)
+        return f"✅ Successfully deleted document: {selected_file}", new_doc_list
+    except Exception as e:
+        return f"❌ Error deleting document: {str(e)}", current_doc_list
+def delete_all_documents(global_vars):
+    """Delete all documents from the vectorstore"""
+    doc_ingestion = global_vars.get('doc_ingestion')
+    if not doc_ingestion:
+        return "❌ Please initialize systems first.", ""
+    try:
+        vectorstore_instance = doc_ingestion.load_existing_vectorstore()
+        if not vectorstore_instance:
+            return "⚠️ No documents found to delete.", ""
+        # Get all document IDs
+        collection = vectorstore_instance._collection
+        all_docs = collection.get()
+        all_ids = all_docs["ids"]
+        # Delete all documents
+        if all_ids:
+            collection.delete(ids=all_ids)
+            # Clear global vectorstore
+            global_vars['vectorstore'] = None
+            return f"✅ Successfully deleted all {len(all_ids)} document chunks.", ""
+        else:
+            return "ℹ️ No documents found to delete.", ""
+    except Exception as e:
+        return f"❌ Error deleting all documents: {str(e)}", ""
+def create_manage_tab(global_vars):
+    """Create the Manage Documents tab"""
+    with gr.Tab("🗂 Manage Documents", id="manage"):
+        gr.Markdown("""
+        ### Step 4: Manage Your Documents
+        View, inspect, and manage all uploaded documents in your knowledge base.
+        You can see document details and delete individual documents or all documents.
+        """)
+        # Buttons for actions
+        with gr.Row():
+            refresh_btn = gr.Button("🔄 Refresh Document List", variant="secondary")
+            delete_all_btn = gr.Button("🗑️ Delete All Documents", variant="stop")
+        # Document statistics and list
+        doc_summary = gr.Markdown(
+            value="📊 Click 'Refresh Document List' to view your documents.",
+            label="Document Summary"
+        )
+        doc_list = gr.Markdown(
+            value="📚 Document details will appear here after refresh.",
+            label="Document List"
+        )
+        # Individual document deletion
+        gr.Markdown("### 🗑️ Delete Individual Document")
+        with gr.Row():
+            file_selector = gr.Dropdown(
+                choices=[],
+                label="Select Document to Delete",
+                interactive=True,
+                info="First click 'Refresh Document List' to see available documents"
+            )
+            delete_single_btn = gr.Button("🗑️ Delete Selected", variant="stop")
+        delete_status = gr.Textbox(
+            label="Action Status",
+            interactive=False,
+            lines=2,
+            placeholder="Deletion status will appear here..."
+        )
+        # Event handlers
+        def refresh_documents():
+            summary, documents, file_options = manage_documents(global_vars)
+            # Update dropdown choices
+            return summary, documents, gr.Dropdown(choices=file_options, value=None)
+        def delete_selected_document(selected_file, current_list):
+            if not selected_file:
+                return "❌ Please select a document to delete first.", current_list, gr.Dropdown(choices=[])
+            status, new_list = delete_document(selected_file, current_list, global_vars)
+            # Also refresh the file options after deletion
+            _, _, new_options = manage_documents(global_vars)
+            return status, new_list, gr.Dropdown(choices=new_options, value=None)
+        def delete_all_docs():
+            status, empty_list = delete_all_documents(global_vars)
+            return status, "📚 No documents in the system.", gr.Dropdown(choices=[], value=None)
+        # Connect event handlers
+        refresh_btn.click(
+            refresh_documents,
+            outputs=[doc_summary, doc_list, file_selector]
+        )
+        delete_single_btn.click(
+            delete_selected_document,
+            inputs=[file_selector, doc_list],
+            outputs=[delete_status, doc_list, file_selector]
+        )
+        delete_all_btn.click(
+            delete_all_docs,
+            outputs=[delete_status, doc_list, file_selector]
+        )

tabs/query.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""
+Query documents tab functionality for the Gradio app
+"""
+import gradio as gr
+def query_documents(question, language, global_vars):
+    """Handle document queries"""
+    rag_system = global_vars.get('rag_system')
+    vectorstore = global_vars.get('vectorstore')
+    if not rag_system:
+        return "❌ Please initialize systems first using the 'Initialize System' tab!"
+    if not vectorstore:
+        return "❌ Please upload and process documents first using the 'Upload Documents' tab!"
+    if not question.strip():
+        return "❌ Please enter a question."
+    try:
+        print(f"🔍 Processing query: {question}")
+        result = rag_system.query(question, language)
+        # Format response
+        answer = result["answer"]
+        sources = result.get("source_documents", [])
+        model_used = result.get("model_used", "SEA-LION")
+        # Add model information
+        response = f"**Model Used:** {model_used}\n\n"
+        response += f"**Answer:**\n{answer}\n\n"
+        if sources:
+            response += "**📚 Sources:**\n"
+            for i, doc in enumerate(sources[:3], 1):
+                metadata = doc.metadata
+                source_name = metadata.get('source', 'Unknown')
+                university = metadata.get('university', 'Unknown')
+                country = metadata.get('country', 'Unknown')
+                doc_type = metadata.get('document_type', 'Unknown')
+                response += f"{i}. **{source_name}**\n"
+                response += f"   - University: {university}\n"
+                response += f"   - Country: {country}\n"
+                response += f"   - Type: {doc_type}\n"
+                response += f"   - Preview: {doc.page_content[:150]}...\n\n"
+        else:
+            response += "\n*No specific sources found. This might be a general response.*"
+        return response
+    except Exception as e:
+        return f"❌ Error querying documents: {str(e)}\n\nPlease check the console for more details."
+def get_example_questions():
+    """Return example questions for the interface"""
+    return [
+        "What are the admission requirements for Computer Science programs in Singapore?",
+        "Which universities offer scholarships for international students?",
+        "What are the tuition fees for MBA programs in Thailand?",
+        "Find universities with engineering programs under $5000 per year",
+        "What are the application deadlines for programs in Malaysia?",
+        "Compare admission requirements between different ASEAN countries"
+    ]
+def create_query_tab(global_vars):
+    """Create the Search & Query tab"""
+    with gr.Tab("🔍 Search & Query", id="query"):
+        gr.Markdown("""
+        ### Step 3: Ask Questions
+        Ask questions about the uploaded documents in your preferred language.
+        The AI will provide detailed answers with source citations.
+        """)
+        with gr.Row():
+            with gr.Column(scale=3):
+                question_input = gr.Textbox(
+                    label="💭 Your Question",
+                    placeholder="Ask anything about the universities...",
+                    lines=3
+                )
+            with gr.Column(scale=1):
+                language_dropdown = gr.Dropdown(
+                    choices=[
+                        "English", "Chinese", "Malay", "Thai",
+                        "Indonesian", "Vietnamese", "Filipino"
+                    ],
+                    value="English",
+                    label="🌍 Response Language"
+                )
+        query_btn = gr.Button(
+            "🔍 Search Documents",
+            variant="primary",
+            size="lg"
+        )
+        answer_output = gr.Textbox(
+            label="🤖 AI Response",
+            interactive=False,
+            lines=20,
+            placeholder="Ask a question to get AI-powered answers..."
+        )
+        # Example questions section
+        gr.Markdown("### 💡 Example Questions")
+        example_questions = get_example_questions()
+        with gr.Row():
+            for i in range(0, len(example_questions), 2):
+                with gr.Column():
+                    if i < len(example_questions):
+                        example_btn = gr.Button(
+                            example_questions[i],
+                            size="sm",
+                            variant="secondary"
+                        )
+                        example_btn.click(
+                            lambda x=example_questions[i]: x,
+                            outputs=question_input
+                        )
+                    if i + 1 < len(example_questions):
+                        example_btn2 = gr.Button(
+                            example_questions[i + 1],
+                            size="sm",
+                            variant="secondary"
+                        )
+                        example_btn2.click(
+                            lambda x=example_questions[i + 1]: x,
+                            outputs=question_input
+                        )
+        query_btn.click(
+            lambda question, language: query_documents(question, language, global_vars),
+            inputs=[question_input, language_dropdown],
+            outputs=answer_output
+        )

tabs/upload.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""
+Upload documents tab functionality for the Gradio app
+"""
+import gradio as gr
+def upload_documents(files, global_vars):
+    """Handle document upload and processing"""
+    doc_ingestion = global_vars.get('doc_ingestion')
+    if not doc_ingestion:
+        return "❌ Please initialize systems first using the 'Initialize System' tab!"
+    if not files:
+        return "❌ Please upload at least one PDF file."
+    try:
+        # Filter for PDF files only
+        pdf_files = []
+        for file_path in files:
+            if file_path.endswith('.pdf'):
+                pdf_files.append(file_path)
+        if not pdf_files:
+            return "❌ Please upload PDF files only."
+        print(f"📄 Processing {len(pdf_files)} PDF file(s)...")
+        # Process documents
+        documents = doc_ingestion.process_documents(pdf_files)
+        if documents:
+            print("🔗 Creating vector store...")
+            # Create vector store
+            vectorstore = doc_ingestion.create_vector_store(documents)
+            if vectorstore:
+                # Store vectorstore in global vars
+                global_vars['vectorstore'] = vectorstore
+                # Create summary
+                summary = f"✅ Successfully processed {len(documents)} document(s):\n\n"
+                for i, doc in enumerate(documents, 1):
+                    metadata = doc.metadata
+                    university = metadata.get('university', 'Unknown')
+                    country = metadata.get('country', 'Unknown')
+                    doc_type = metadata.get('document_type', 'Unknown')
+                    language = metadata.get('language', 'Unknown')
+                    summary += f"{i}. **{metadata['source']}**\n"
+                    summary += f"   - University: {university}\n"
+                    summary += f"   - Country: {country}\n"
+                    summary += f"   - Type: {doc_type}\n"
+                    summary += f"   - Language: {language}\n\n"
+                summary += "🎉 **Ready for queries!** Go to the 'Search & Query' tab to start asking questions."
+                return summary
+            else:
+                return "❌ Failed to create vector store from documents."
+        else:
+            return "❌ No documents were successfully processed. Please check if your PDFs are readable."
+    except Exception as e:
+        return f"❌ Error processing documents: {str(e)}\n\nPlease check the console for more details."
+def create_upload_tab(global_vars):
+    """Create the Upload Documents tab"""
+    with gr.Tab("📄 Upload Documents", id="upload"):
+        gr.Markdown("""
+        ### Step 2: Upload PDF Documents
+        Upload university documents (brochures, admission guides, etc.) in PDF format.
+        The system will automatically extract metadata including university name, country, and document type.
+        """)
+        file_upload = gr.File(
+            label="📁 Upload PDF Documents",
+            file_types=[".pdf"],
+            file_count="multiple",
+            height=120
+        )
+        upload_btn = gr.Button(
+            "📄 Process Documents",
+            variant="primary",
+            size="lg"
+        )
+        upload_status = gr.Textbox(
+            label="Processing Status",
+            interactive=False,
+            lines=12,
+            placeholder="Upload PDF files and click 'Process Documents'..."
+        )
+        upload_btn.click(
+            lambda files: upload_documents(files, global_vars),
+            inputs=file_upload,
+            outputs=upload_status
+        )

utils/rag_system.py CHANGED Viewed

@@ -2,7 +2,6 @@ import os
 import uuid
 import tempfile
 from typing import List, Optional, Dict, Any
-import streamlit as st
 from pathlib import Path
 import PyPDF2
 from langchain.text_splitter import RecursiveCharacterTextSplitter
@@ -27,24 +26,54 @@ class AlternativeEmbeddings:
     """Alternative embeddings using Sentence Transformers when OpenAI is not available"""
     def __init__(self):
         try:
             from sentence_transformers import SentenceTransformer
-            # Use BGE-small-en for better performance
-            self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")
-            self.embedding_size = 384
         except ImportError:
-            st.error("sentence-transformers not available. Please install it or provide OpenAI API key.")
-            self.model = None
     def embed_documents(self, texts):
         if not self.model:
-            return []
-        return self.model.encode(texts).tolist()
     def embed_query(self, text):
         if not self.model:
-            return []
-        return self.model.encode([text])[0].tolist()
 class SEALionLLM:
     """Custom LLM class for SEA-LION models"""
@@ -168,7 +197,7 @@ class SEALionLLM:
             return response_text
         except Exception as e:
-            st.error(f"Error with SEA-LION model: {str(e)}")
             return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
     def extract_metadata(self, document_text: str) -> Dict[str, str]:
@@ -213,33 +242,33 @@ class SEALionLLM:
             )
             response_text = response.choices[0].message.content.strip()
-            st.subheader("--- DEBUG: LLM Metadata Extraction Details ---")
-            st.write(f"**Input Text for LLM (first 2 pages):**\n```\n{document_text[:1000]}...\n```") # Show first 1000 chars of input
-            st.write(f"**Raw LLM Response:**\n```json\n{response_text}\n```")
             json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
             if json_match:
                 json_str = json_match.group(0)
                 try:
                     metadata = json.loads(json_str)
-                    st.write(f"**Parsed JSON Metadata:**\n```json\n{json.dumps(metadata, indent=2)}\n```")
                     required_keys = ["university_name", "country", "document_type", "language"]
                     if all(key in metadata for key in required_keys):
-                        st.success("DEBUG: Successfully extracted and parsed metadata from LLM.")
                         return metadata
                     else:
-                        st.warning("DEBUG: LLM response missing required keys, attempting fallback or using defaults.")
                         return self._get_default_metadata()
                 except json.JSONDecodeError as e:
-                    st.error(f"DEBUG: JSON Parsing Failed: {e}")
-                    st.write(f"DEBUG: Attempting fallback text extraction from raw response.")
                     return self._extract_from_text_response(response_text)
             else:
-                st.error("DEBUG: No JSON object found in LLM response.")
                 return self._extract_from_text_response(response_text)
         except Exception as e:
-            st.error(f"DEBUG: Error during LLM Metadata Extraction: {str(e)}")
             return self._get_default_metadata()
     def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
@@ -260,7 +289,7 @@ class SEALionLLM:
             elif "language" in line.lower() and ":" in line:
                 value = line.split(":", 1)[1].strip().strip('",')
                 metadata["language"] = value
-        st.write(f"DEBUG: Fallback text extraction result: {metadata}")
         return metadata
     def _get_default_metadata(self) -> Dict[str, str]:
@@ -301,10 +330,10 @@ class DocumentIngestion:
                     self.embeddings = OpenAIEmbeddings()
                     self.embedding_type = "OpenAI"
                 except Exception as e:
-                    st.error("Both BGE and OpenAI embeddings failed. Please check your setup.")
                     raise e
             else:
-                st.error("No embedding model available. Please install sentence-transformers or provide OpenAI API key.")
                 raise Exception("No embedding model available")
         self.text_splitter = SemanticChunker(
@@ -321,80 +350,77 @@ class DocumentIngestion:
         self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
         os.makedirs(self.persist_directory, exist_ok=True)
-    def extract_text_from_pdf(self, pdf_file) -> List[str]:
-        """Extract text from uploaded PDF file with multiple fallback methods."""
         try:
             # Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
-            pdf_reader = PyPDF2.PdfReader(pdf_file)
-            # Check if PDF is encrypted
-            if pdf_reader.is_encrypted:
-                # Try to decrypt with empty password (common for protected but not password-protected PDFs)
-                try:
-                    pdf_reader.decrypt("")
-                except Exception:
-                    st.warning(f"PDF {pdf_file.name} is password-protected. Please provide an unprotected version.")
-                    return [] # Return empty list for password-protected PDFs
-            text_per_page = []
-            for page_num, page in enumerate(pdf_reader.pages):
-                try:
-                    page_text = page.extract_text()
-                    text_per_page.append(page_text)
-                except Exception as e:
-                    st.warning(f"Could not extract text from page {page_num + 1} of {pdf_file.name}: {str(e)}")
-                    text_per_page.append("") # Append empty string for failed pages
-            if any(text.strip() for text in text_per_page):
-                return text_per_page
-            else:
-                st.warning(f"No extractable text found in {pdf_file.name}. This might be a scanned PDF or image-based document.")
-                return []
         except Exception as e:
             error_msg = str(e)
             if "PyCryptodome" in error_msg:
-                st.error(f"Encryption error with {pdf_file.name}: {error_msg}")
-                st.info("💡 The PDF uses encryption. PyCryptodome has been installed to handle this.")
             elif "password" in error_msg.lower():
-                st.error(f"Password-protected PDF: {pdf_file.name}")
-                st.info("💡 Please provide an unprotected version of this PDF.")
             else:
-                st.error(f"Error extracting text from {pdf_file.name}: {error_msg}")
             return []
-    def process_documents(self, uploaded_files) -> List[Document]: # Removed university_name, country, document_type parameters
-        """Process uploaded PDF files and convert to documents with automatic metadata extraction."""
         documents = []
         processed_count = 0
         failed_count = 0
-        st.info(f"📄 Processing {len(uploaded_files)} document(s) with automatic metadata detection...") # Changed to print
-        for uploaded_file in uploaded_files:
-            if uploaded_file.type == "application/pdf":
-                st.write(f"🔍 Extracting text from: **{uploaded_file.name}**") # Changed to print
                 # Extract text per page
-                text_per_page = self.extract_text_from_pdf(uploaded_file)
-                st.write(f"DEBUG: Extracted {len(text_per_page)} pages from {uploaded_file.name}")
                 if text_per_page:
                     # Combine first two pages for metadata extraction
                     text_for_metadata = "\n".join(text_per_page[:2])
-                    st.write(f"DEBUG: Text for metadata extraction (first 500 chars): {text_for_metadata[:500]}")
                     # Extract metadata using LLM
-                    st.write(f"🤖 Detecting metadata for: **{uploaded_file.name}**") # Changed to print
                     extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
-                    # Validate and clean metadata (assuming validate_metadata is defined elsewhere or will be added)
-                    # For now, we\'ll use the extracted_metadata directly.
-                    # If you want me to add validate_metadata here, please provide its content.
-                    # extracted_metadata = validate_metadata(extracted_metadata)
                     # Create metadata
                     metadata = {
-                        "source": uploaded_file.name,
                         "university": extracted_metadata.get("university_name", "Unknown"),
                         "country": extracted_metadata.get("country", "Unknown"),
                         "document_type": extracted_metadata.get("document_type", "general_info"),
@@ -410,26 +436,27 @@ class DocumentIngestion:
                     )
                     documents.append(doc)
                     processed_count += 1
-                    st.success(f"✅ Successfully processed: **{uploaded_file.name}** ({len(doc.page_content)} characters)") # Changed to print
                 else:
                     failed_count += 1
-                    st.warning(f"⚠️ Could not extract text from **{uploaded_file.name}**") # Changed to print
             else:
                 failed_count += 1
-                st.error(f"❌ Unsupported file type: **{uploaded_file.type}** for {uploaded_file.name}") # Changed to print
         # Summary
         if processed_count > 0:
-            st.success(f"🎉 Successfully processed **{processed_count}** document(s)") # Changed to print
         if failed_count > 0:
-            st.warning(f"⚠️ Failed to process **{failed_count}** document(s)") # Changed to print
         return documents
     def create_vector_store(self, documents: List[Document]) -> Chroma:
         """Create and persist vector store from documents."""
         if not documents:
-            st.error("No documents to process") # Changed to print
             return None
         # Split documents into chunks
@@ -453,7 +480,7 @@ class DocumentIngestion:
             )
             return vectorstore
         except Exception as e:
-            st.warning(f"Could not load existing vector store: {str(e)}") # Changed to print
             return None
 class RAGSystem:
@@ -480,7 +507,7 @@ class RAGSystem:
             )
             return vectorstore
         except Exception as e:
-            st.error(f"Error loading vector store: {str(e)}")
             return None
     def query(self, question: str, language: str = "English") -> Dict[str, Any]:
@@ -532,7 +559,7 @@ Document {i} (Source: {source_info}, University: {university}, Country: {country
             }
         except Exception as e:
-            st.error(f"Error querying system: {str(e)}")
             return {
                 "answer": f"Error processing your question: {str(e)}",
                 "source_documents": [],
@@ -570,7 +597,7 @@ def save_query_result(query_result: Dict[str, Any]):
                 json.dump(save_data, f, indent=2, ensure_ascii=False)
             return True
         except Exception as e:
-            st.error(f"Error saving query result: {str(e)}")
             return False
     return False
@@ -583,6 +610,6 @@ def load_shared_query(query_id: str) -> Optional[Dict[str, Any]]:
             with open(result_file, 'r', encoding='utf-8') as f:
                 return json.load(f)
         except Exception as e:
-            st.error(f"Error loading shared query: {str(e)}")
     return None

 import uuid
 import tempfile
 from typing import List, Optional, Dict, Any
 from pathlib import Path
 import PyPDF2
 from langchain.text_splitter import RecursiveCharacterTextSplitter
     """Alternative embeddings using Sentence Transformers when OpenAI is not available"""
     def __init__(self):
+        self.model = None
+        self.embedding_size = 384
         try:
             from sentence_transformers import SentenceTransformer
+            # Try smaller models in order of preference for better cloud compatibility
+            model_options = [
+                ("all-MiniLM-L6-v2", 384),      # Very small and reliable
+                ("paraphrase-MiniLM-L3-v2", 384), # Even smaller
+                ("BAAI/bge-small-en-v1.5", 384)   # Original choice
+            ]
+            for model_name, embed_size in model_options:
+                try:
+                    print(f"🔄 Trying to load model: {model_name}")
+                    self.model = SentenceTransformer(model_name)
+                    self.embedding_size = embed_size
+                    print(f"✅ Successfully loaded: {model_name}")
+                    break
+                except Exception as e:
+                    print(f"⚠️ Failed to load {model_name}: {str(e)}")
+                    continue
+            if not self.model:
+                raise Exception("All embedding models failed to load")
         except ImportError:
+            print("❌ sentence-transformers not available. Please install it or provide OpenAI API key.")
+            raise ImportError("sentence-transformers not available")
     def embed_documents(self, texts):
         if not self.model:
+            raise Exception("No embedding model available")
+        try:
+            return self.model.encode(texts, convert_to_numpy=True).tolist()
+        except Exception as e:
+            print(f"Error encoding documents: {e}")
+            raise
     def embed_query(self, text):
         if not self.model:
+            raise Exception("No embedding model available")
+        try:
+            return self.model.encode([text], convert_to_numpy=True)[0].tolist()
+        except Exception as e:
+            print(f"Error encoding query: {e}")
+            raise
 class SEALionLLM:
     """Custom LLM class for SEA-LION models"""
             return response_text
         except Exception as e:
+            print(f"Error with SEA-LION model: {str(e)}")
             return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
     def extract_metadata(self, document_text: str) -> Dict[str, str]:
             )
             response_text = response.choices[0].message.content.strip()
+            print("--- DEBUG: LLM Metadata Extraction Details ---")
+            print(f"**Input Text for LLM (first 2 pages):**\n```\n{document_text[:1000]}...\n```") # Show first 1000 chars of input
+            print(f"**Raw LLM Response:**\n```json\n{response_text}\n```")
             json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
             if json_match:
                 json_str = json_match.group(0)
                 try:
                     metadata = json.loads(json_str)
+                    print(f"**Parsed JSON Metadata:**\n```json\n{json.dumps(metadata, indent=2)}\n```")
                     required_keys = ["university_name", "country", "document_type", "language"]
                     if all(key in metadata for key in required_keys):
+                        print("DEBUG: Successfully extracted and parsed metadata from LLM.")
                         return metadata
                     else:
+                        print("DEBUG: LLM response missing required keys, attempting fallback or using defaults.")
                         return self._get_default_metadata()
                 except json.JSONDecodeError as e:
+                    print(f"DEBUG: JSON Parsing Failed: {e}")
+                    print(f"DEBUG: Attempting fallback text extraction from raw response.")
                     return self._extract_from_text_response(response_text)
             else:
+                print("DEBUG: No JSON object found in LLM response.")
                 return self._extract_from_text_response(response_text)
         except Exception as e:
+            print(f"DEBUG: Error during LLM Metadata Extraction: {str(e)}")
             return self._get_default_metadata()
     def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
             elif "language" in line.lower() and ":" in line:
                 value = line.split(":", 1)[1].strip().strip('",')
                 metadata["language"] = value
+        print(f"DEBUG: Fallback text extraction result: {metadata}")
         return metadata
     def _get_default_metadata(self) -> Dict[str, str]:
                     self.embeddings = OpenAIEmbeddings()
                     self.embedding_type = "OpenAI"
                 except Exception as e:
+                    print("Both BGE and OpenAI embeddings failed. Please check your setup.")
                     raise e
             else:
+                print("No embedding model available. Please install sentence-transformers or provide OpenAI API key.")
                 raise Exception("No embedding model available")
         self.text_splitter = SemanticChunker(
         self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
         os.makedirs(self.persist_directory, exist_ok=True)
+    def extract_text_from_pdf(self, pdf_file_path) -> List[str]:
+        """Extract text from PDF file path with multiple fallback methods."""
         try:
             # Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
+            with open(pdf_file_path, 'rb') as pdf_file:
+                pdf_reader = PyPDF2.PdfReader(pdf_file)
+                # Check if PDF is encrypted
+                if pdf_reader.is_encrypted:
+                    # Try to decrypt with empty password (common for protected but not password-protected PDFs)
+                    try:
+                        pdf_reader.decrypt("")
+                    except Exception:
+                        print(f"PDF {os.path.basename(pdf_file_path)} is password-protected. Please provide an unprotected version.")
+                        return [] # Return empty list for password-protected PDFs
+                text_per_page = []
+                for page_num, page in enumerate(pdf_reader.pages):
+                    try:
+                        page_text = page.extract_text()
+                        text_per_page.append(page_text)
+                    except Exception as e:
+                        print(f"Could not extract text from page {page_num + 1} of {os.path.basename(pdf_file_path)}: {str(e)}")
+                        text_per_page.append("") # Append empty string for failed pages
+                if any(text.strip() for text in text_per_page):
+                    return text_per_page
+                else:
+                    print(f"No extractable text found in {os.path.basename(pdf_file_path)}. This might be a scanned PDF or image-based document.")
+                    return []
         except Exception as e:
             error_msg = str(e)
             if "PyCryptodome" in error_msg:
+                print(f"Encryption error with {os.path.basename(pdf_file_path)}: {error_msg}")
+                print("💡 The PDF uses encryption. PyCryptodome has been installed to handle this.")
             elif "password" in error_msg.lower():
+                print(f"Password-protected PDF: {os.path.basename(pdf_file_path)}")
+                print("💡 Please provide an unprotected version of this PDF.")
             else:
+                print(f"Error extracting text from {os.path.basename(pdf_file_path)}: {error_msg}")
             return []
+    def process_documents(self, pdf_file_paths) -> List[Document]:
+        """Process PDF file paths and convert to documents with automatic metadata extraction."""
         documents = []
         processed_count = 0
         failed_count = 0
+        print(f"📄 Processing {len(pdf_file_paths)} document(s) with automatic metadata detection...") # Changed to print
+        for pdf_file_path in pdf_file_paths:
+            if pdf_file_path.endswith('.pdf'):
+                filename = os.path.basename(pdf_file_path)
+                print(f"🔍 Extracting text from: **{filename}**") # Changed to print
                 # Extract text per page
+                text_per_page = self.extract_text_from_pdf(pdf_file_path)
+                print(f"DEBUG: Extracted {len(text_per_page)} pages from {filename}")
                 if text_per_page:
                     # Combine first two pages for metadata extraction
                     text_for_metadata = "\n".join(text_per_page[:2])
+                    print(f"DEBUG: Text for metadata extraction (first 500 chars): {text_for_metadata[:500]}")
                     # Extract metadata using LLM
+                    print(f"🤖 Detecting metadata for: **{filename}**") # Changed to print
                     extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
                     # Create metadata
                     metadata = {
+                        "source": filename,
                         "university": extracted_metadata.get("university_name", "Unknown"),
                         "country": extracted_metadata.get("country", "Unknown"),
                         "document_type": extracted_metadata.get("document_type", "general_info"),
                     )
                     documents.append(doc)
                     processed_count += 1
+                    print(f"✅ Successfully processed: **{filename}** ({len(doc.page_content)} characters)") # Changed to print
                 else:
                     failed_count += 1
+                    print(f"⚠️ Could not extract text from **{filename}**") # Changed to print
             else:
                 failed_count += 1
+                filename = os.path.basename(pdf_file_path)
+                print(f"❌ Unsupported file type for {filename} (expected .pdf)") # Changed to print
         # Summary
         if processed_count > 0:
+            print(f"🎉 Successfully processed **{processed_count}** document(s)") # Changed to print
         if failed_count > 0:
+            print(f"⚠️ Failed to process **{failed_count}** document(s)") # Changed to print
         return documents
     def create_vector_store(self, documents: List[Document]) -> Chroma:
         """Create and persist vector store from documents."""
         if not documents:
+            print("No documents to process") # Changed to print
             return None
         # Split documents into chunks
             )
             return vectorstore
         except Exception as e:
+            print(f"Could not load existing vector store: {str(e)}") # Changed to print
             return None
 class RAGSystem:
             )
             return vectorstore
         except Exception as e:
+            print(f"Error loading vector store: {str(e)}")
             return None
     def query(self, question: str, language: str = "English") -> Dict[str, Any]:
             }
         except Exception as e:
+            print(f"Error querying system: {str(e)}")
             return {
                 "answer": f"Error processing your question: {str(e)}",
                 "source_documents": [],
                 json.dump(save_data, f, indent=2, ensure_ascii=False)
             return True
         except Exception as e:
+            print(f"Error saving query result: {str(e)}")
             return False
     return False
             with open(result_file, 'r', encoding='utf-8') as f:
                 return json.load(f)
         except Exception as e:
+            print(f"Error loading shared query: {str(e)}")
     return None

utils/translations.py CHANGED Viewed

@@ -110,6 +110,40 @@ translations = {
         "example_simple_2": "What is the difference between bachelor and master degree?",
         "example_simple_3": "How to apply for student visa?",
         "example_simple_4": "What documents are needed for university application?",
     },
     "中文": {
@@ -223,6 +257,40 @@ translations = {
         "example_simple_2": "学士学位和硕士学位有什么区别？",
         "example_simple_3": "如何申请学生签证？",
         "example_simple_4": "大学申请需要哪些文件？",
     },
     "Malay": {

         "example_simple_2": "What is the difference between bachelor and master degree?",
         "example_simple_3": "How to apply for student visa?",
         "example_simple_4": "What documents are needed for university application?",
+        # System messages
+        "systems_initialized": "✅ Systems initialized successfully!",
+        "can_upload_documents": "You can now upload documents.",
+        "initialization_error": "Error initializing systems",
+        "installation_help": """**Possible solutions:**
+1. Install sentence-transformers: `pip install sentence-transformers`
+2. Or provide OpenAI API key in environment variables
+3. Check that PyTorch is properly installed
+**For deployment:**
+- Ensure requirements.txt includes: sentence-transformers, torch, transformers""",
+        "please_initialize_first": "Please initialize systems first using the 'Initialize System' tab!",
+        "please_upload_pdf": "Please upload at least one PDF file.",
+        "upload_pdf_only": "Please upload PDF files only.",
+        "successfully_processed_docs": "Successfully processed",
+        "failed_create_vectorstore": "Failed to create vector store from documents.",
+        "no_docs_successfully_processed": "No documents were successfully processed. Please check if your PDFs are readable.",
+        "error_processing_docs": "Error processing documents",
+        "check_console": "Please check the console for more details.",
+        "please_upload_process_first": "Please upload and process documents first using the 'Upload Documents' tab!",
+        "please_enter_question": "Please enter a question.",
+        "processing_query": "Processing query",
+        "model_used": "Model Used",
+        "answer": "Answer",
+        "sources": "Sources",
+        "no_sources_found": "No specific sources found. This might be a general response.",
+        "error_querying_docs": "Error querying documents",
+        "ready_for_queries": "Ready for queries! Go to the 'Search & Query' tab to start asking questions.",
+        # Interface elements
+        "initialize_system": "Initialize System",
+        "initialize_systems": "Initialize Systems",
+        "initialization_status": "Initialization Status",
     },
     "中文": {
         "example_simple_2": "学士学位和硕士学位有什么区别？",
         "example_simple_3": "如何申请学生签证？",
         "example_simple_4": "大学申请需要哪些文件？",
+        # System messages
+        "systems_initialized": "✅ 系统初始化成功！",
+        "can_upload_documents": "您现在可以上传文档。",
+        "initialization_error": "系统初始化错误",
+        "installation_help": """**可能的解决方案：**
+1. 安装 sentence-transformers: `pip install sentence-transformers`
+2. 或在环境变量中提供 OpenAI API 密钥
+3. 检查 PyTorch 是否正确安装
+**部署时：**
+- 确保 requirements.txt 包含：sentence-transformers, torch, transformers""",
+        "please_initialize_first": "请先使用'初始化系统'选项卡初始化系统！",
+        "please_upload_pdf": "请至少上传一个PDF文件。",
+        "upload_pdf_only": "请仅上传PDF文件。",
+        "successfully_processed_docs": "成功处理",
+        "failed_create_vectorstore": "创建向量存储失败。",
+        "no_docs_successfully_processed": "没有成功处理任何文档。请检查您的PDF是否可读。",
+        "error_processing_docs": "处理文档时出错",
+        "check_console": "请查看控制台获取更多详细信息。",
+        "please_upload_process_first": "请先使用'上传文档'选项卡上传和处理文档！",
+        "please_enter_question": "请输入问题。",
+        "processing_query": "正在处理查询",
+        "model_used": "使用的模型",
+        "answer": "答案",
+        "sources": "来源",
+        "no_sources_found": "未找到特定来源。这可能是一般性回答。",
+        "error_querying_docs": "查询文档时出错",
+        "ready_for_queries": "准备查询！前往'搜索与查询'选项卡开始提问。",
+        # Interface elements
+        "initialize_system": "初始化系统",
+        "initialize_systems": "初始化系统",
+        "initialization_status": "初始化状态",
     },
     "Malay": {