Ervinoreo commited on
Commit
846f122
·
1 Parent(s): ecf227f
.gitignore CHANGED
@@ -22,6 +22,8 @@ share/python-wheels/
22
  .installed.cfg
23
  *.egg
24
  MANIFEST
 
 
25
 
26
  # Virtual Environment
27
  .venv/
@@ -31,6 +33,7 @@ ENV/
31
  env/
32
  .venv
33
  myenv/
 
34
 
35
  # Environment Variables
36
  .env
 
22
  .installed.cfg
23
  *.egg
24
  MANIFEST
25
+ tabs/__pycache__/
26
+ .gradio
27
 
28
  # Virtual Environment
29
  .venv/
 
33
  env/
34
  .venv
35
  myenv/
36
+ gradio/
37
 
38
  # Environment Variables
39
  .env
README.md DELETED
@@ -1,274 +0,0 @@
1
- # PanSea University Search
2
-
3
- An AI-powered RAG (Retrieval-Augmented Generation) system for searching ASEAN university admission requirements, designed to help prospective students find accurate and up-to-date information about study opportunities across Southeast Asia.
4
-
5
- ## 🎯 Problem & Solution
6
-
7
- **Problem:** Prospective students worldwide seeking to study abroad face difficulty finding accurate, up-to-date university admission requirements. Information is scattered across PDFs, brochures, and outdated agency websites. Many waste time applying to unsuitable programs due to missing criteria and pay high agent fees.
8
-
9
- **Solution:** An LLM-powered, RAG-based platform powered by **SEA-LION multilingual models** that ingests official admissions documents from ASEAN universities. Students can query in any ASEAN language and receive ranked program matches with fees, entry requirements, deadlines, application windows, and source citations.
10
-
11
- ## 🌟 Features
12
-
13
- - 📄 **PDF Document Ingestion**: Upload official university admission documents
14
- - 🔍 **Intelligent Search**: Natural language queries in multiple ASEAN languages
15
- - 🎯 **Accurate Responses**: AI-powered answers with source citations
16
- - 🔗 **Shareable Results**: Generate links to share query results
17
- - 🌏 **Multi-language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
18
- - 💰 **Advanced Filtering**: Budget range, study level, country preferences
19
-
20
- ## 🚀 Quick Start
21
-
22
- ### Prerequisites
23
-
24
- - Python 3.11+
25
- - SEA-LION API Key
26
- - OpenAI API Key (optional, for fallback embeddings)
27
-
28
- ### Installation
29
-
30
- 1. **Clone and navigate to the project:**
31
-
32
- ```bash
33
- cd pansea
34
- ```
35
-
36
- 2. **Activate virtual environment:**
37
-
38
- ```bash
39
- source .venv/bin/activate # On Windows: .venv\Scripts\activate
40
- ```
41
-
42
- 3. **Install dependencies:**
43
-
44
- ```bash
45
- pip install -r requirements.txt
46
- ```
47
-
48
- 4. **Set up environment variables:**
49
-
50
- ```bash
51
- cp .env.example .env
52
- # Edit .env and add your SEA-LION API key (OpenAI key optional for fallback)
53
- ```
54
-
55
- 5. **Run the application:**
56
-
57
- ```bash
58
- streamlit run app.py
59
- ```
60
-
61
- 6. **Open your browser to:** `http://localhost:8501`
62
-
63
- ### Usage
64
-
65
- #### 1. Upload Documents
66
-
67
- - Go to the "Upload Documents" page
68
- - Enter university name and country
69
- - Select document type (admission requirements, tuition fees, etc.)
70
- - Upload PDF files containing university information
71
- - Click "Process Documents"
72
-
73
- #### 2. Search Universities
74
-
75
- - Go to the "Search Universities" page
76
- - Choose your response language
77
- - Enter questions like:
78
- - "Show me universities in Malaysia for master's degrees with tuition under 40,000 RMB per year"
79
- - "专科毕业,无雅思,想在马来西亚读硕士,学费不超过 4 万人民币/年"
80
- - "What are the English proficiency requirements for Singapore universities?"
81
- - Apply optional filters (budget, study level, countries)
82
- - Get AI-powered responses with source citations
83
-
84
- #### 3. Share Results
85
-
86
- - Each query generates a unique shareable link
87
- - Share results with friends, family, or education consultants
88
- - Access shared results without needing to upload documents again
89
-
90
- ## 📁 Project Structure
91
-
92
- ```
93
- pansea/
94
- ├── app.py # Main Streamlit application
95
- ├── rag_system.py # RAG system implementation
96
- ├── requirements.txt # Python dependencies
97
- ├── .env # Environment variables
98
- ├── .venv/ # Virtual environment
99
- ├── chroma_db/ # Vector database storage
100
- ├── documents/ # Uploaded documents storage
101
- ├── query_results/ # Shared query results
102
- └── README.md # This file
103
- ```
104
-
105
- ## 🛠️ Core Components
106
-
107
- ### DocumentIngestion Class
108
-
109
- - Handles PDF text extraction using PyPDF2
110
- - Creates document chunks with metadata
111
- - Builds and persists ChromaDB vector store
112
- - Manages document preprocessing and storage
113
-
114
- ### RAGSystem Class
115
-
116
- - Implements retrieval-augmented generation
117
- - Uses BGE-small-en-v1.5 embeddings for semantic search (with OpenAI fallback)
118
- - Leverages SEA-LION models for response generation:
119
- - **SEA-LION v3.5 Reasoning Model** for complex university queries
120
- - **SEA-LION v3 Instruct Model** for translation and simple questions
121
- - Provides multilingual query support with automatic model selection
122
-
123
- ### Streamlit UI
124
-
125
- - Clean, intuitive interface
126
- - Multi-page navigation
127
- - File upload with progress tracking
128
- - Advanced search filters
129
- - Shareable query results
130
-
131
- ## 🌏 Supported Languages
132
-
133
- The system supports queries and responses in:
134
-
135
- - **English** - Primary language
136
- - **中文 (Chinese)** - For Chinese-speaking students
137
- - **Bahasa Malaysia** - For Malaysian context
138
- - **ไทย (Thai)** - For Thai students
139
- - **Bahasa Indonesia** - For Indonesian students
140
- - **Tiếng Việt (Vietnamese)** - For Vietnamese students
141
- - **Filipino** - For Philippines context
142
-
143
- ## 🎯 Target ASEAN Countries
144
-
145
- - 🇸🇬 Singapore
146
- - 🇲🇾 Malaysia
147
- - 🇹🇭 Thailand
148
- - 🇮🇩 Indonesia
149
- - 🇵🇭 Philippines
150
- - 🇻🇳 Vietnam
151
- - 🇧🇳 Brunei
152
- - 🇰🇭 Cambodia
153
- - 🇱🇦 Laos
154
- - 🇲🇲 Myanmar
155
-
156
- ## 🔧 Configuration
157
-
158
- ### Environment Variables (.env)
159
-
160
- ```bash
161
- # SEA-LION API Configuration
162
- SEA_LION_API_KEY=your_sea_lion_api_key_here
163
- SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
164
-
165
- # OpenAI API Configuration (for embeddings)
166
- OPENAI_API_KEY=your_openai_api_key_here
167
-
168
- # Application Configuration
169
- APP_NAME=Top.Edu University Search
170
- APP_VERSION=1.0.0
171
- CHROMA_PERSIST_DIRECTORY=./chroma_db
172
- UPLOAD_FOLDER=./documents
173
- MAX_FILE_SIZE_MB=50
174
- ```
175
-
176
- ### Customization Options
177
-
178
- - **Chunk Size**: Adjust text splitting in `rag_system.py`
179
- - **Retrieval Count**: Modify number of retrieved documents (default: 5)
180
- - **Model Selection**: Configure SEA-LION model selection logic
181
- - **UI Themes**: Modify CSS in `app.py`
182
- - **Query Classification**: Adjust complex vs simple query detection
183
-
184
- ## 📊 Example Queries
185
-
186
- Try these sample queries to test the system and see different model usage:
187
-
188
- ### Complex Queries (Uses SEA-LION Reasoning Model)
189
-
190
- 1. **Multi-criteria Search**: "Show me universities in Thailand and Malaysia for engineering master's programs with tuition under $15,000 per year"
191
-
192
- 2. **Chinese Query**: "专科毕业,无雅思,想在马来西亚读硕士,学费不超过 4 万人民币/年"
193
-
194
- 3. **Comparative Analysis**: "Compare MBA programs in Singapore and Indonesia with GMAT requirements and scholarship opportunities"
195
-
196
- ### Simple Queries (Uses SEA-LION Instruct Model)
197
-
198
- 4. **Translation**: "How do you say 'application deadline' in Thai and Indonesian?"
199
-
200
- 5. **Definition**: "What is the difference between IELTS and TOEFL?"
201
-
202
- 6. **Basic Information**: "What does GPA stand for and how is it calculated?"
203
-
204
- ## 🔍 Technical Stack
205
-
206
- - **Backend**: Python 3.11, LangChain
207
- - **LLM Models**:
208
- - SEA-LION v3.5 8B Reasoning (complex queries)
209
- - SEA-LION v3 9B Instruct (simple queries & translation)
210
- - **Embeddings**: BGE-small-en-v1.5 (with OpenAI ada-002 fallback)
211
- - **Vector Database**: ChromaDB with persistence
212
- - **Frontend**: Streamlit with custom CSS
213
- - **Document Processing**: PyPDF2, PyCryptodome (for encrypted PDFs), RecursiveCharacterTextSplitter
214
-
215
- ## 📈 Roadmap
216
-
217
- - [ ] Support for additional document formats (Word, Excel)
218
- - [x] Integration with SEA-LION multilingual models
219
- - [ ] Real-time web scraping of university websites
220
- - [ ] Mobile-responsive design
221
- - [ ] User authentication and query history
222
- - [ ] Advanced analytics and insights
223
- - [ ] Integration with university application systems
224
- - [ ] Fine-tuning SEA-LION models on university-specific data
225
-
226
- ## 🤝 Contributing
227
-
228
- 1. Fork the repository
229
- 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
230
- 3. Commit your changes (`git commit -m 'Add amazing feature'`)
231
- 4. Push to the branch (`git push origin feature/amazing-feature`)
232
- 5. Open a Pull Request
233
-
234
- ## 📄 License
235
-
236
- This project is licensed under the MIT License - see the LICENSE file for details.
237
-
238
- ## 💡 Tips for Best Results
239
-
240
- 1. **Upload Quality Documents**: Use official admission guides and requirements documents
241
- 2. **Be Specific**: Include specific criteria in your queries (budget, location, program type)
242
- 3. **Use Natural Language**: Ask questions as you would to a human counselor
243
- 4. **Try Multiple Languages**: The system works well with mixed-language queries
244
- 5. **Check Sources**: Always review the source documents cited in responses
245
-
246
- ## 🆘 Troubleshooting
247
-
248
- ### Common Issues
249
-
250
- **"No documents found"**: Upload PDF documents first in the Upload Documents page
251
-
252
- **"API Key not found"**: Add your SEA-LION API key to the .env file
253
-
254
- **"No embeddings available"**: BGE embeddings are used by default. If issues occur, add your OpenAI API key for fallback embeddings
255
-
256
- **"Import errors"**: Install dependencies using `pip install -r requirements.txt`
257
-
258
- **"ChromaDB errors"**: Delete the `chroma_db` folder and restart the application
259
-
260
- **"PyCryptodome is required for AES algorithm"**: This error occurs with encrypted PDFs. PyCryptodome is now included in requirements.txt
261
-
262
- **"Could not extract text from PDF"**: This can happen with:
263
-
264
- - Password-protected PDFs (provide unprotected versions)
265
- - Scanned PDFs or image-based documents (consider OCR tools)
266
- - Heavily encrypted or corrupted PDF files
267
-
268
- ## 📞 Support
269
-
270
- For support, please create an issue on GitHub or contact the development team.
271
-
272
- ---
273
-
274
- **Made with ❤️ for students seeking education opportunities in ASEAN** 🎓
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README_GRADIO.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🌏 ASEAN University Search - Gradio Version
2
+
3
+ An AI-powered university document search and Q&A system built with Gradio, specifically designed for ASEAN universities. This version uses **SEA-LION AI models** for intelligent responses and supports multiple Southeast Asian languages.
4
+
5
+ ## ✨ Features
6
+
7
+ - 🤖 **AI-Powered Search**: Uses SEA-LION models for intelligent document analysis
8
+ - 🌍 **Multi-Language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
9
+ - 📚 **Automatic Metadata Extraction**: Detects university names, countries, and document types
10
+ - 🔍 **Semantic Document Chunking**: Intelligent text splitting for better retrieval
11
+ - 📱 **Shareable Links**: Built-in Gradio sharing for easy deployment
12
+ - 🎯 **Source Citations**: Always shows which documents were used for answers
13
+
14
+ ## 🚀 Quick Start
15
+
16
+ ### Option 1: Using the Startup Script (Recommended)
17
+
18
+ ```bash
19
+ ./start_gradio.sh
20
+ ```
21
+
22
+ ### Option 2: Manual Setup
23
+
24
+ ```bash
25
+ # Create virtual environment
26
+ python3 -m venv venv
27
+ source venv/bin/activate
28
+
29
+ # Install requirements
30
+ pip install -r requirements_gradio.txt
31
+
32
+ # Run the application
33
+ python app_gradio.py
34
+ ```
35
+
36
+ ## 🌐 Deployment Options
37
+
38
+ ### 1. **Local with Public Link** (Immediate)
39
+
40
+ - Run the app locally
41
+ - Gradio automatically creates a public shareable link
42
+ - Perfect for testing and sharing
43
+
44
+ ### 2. **HuggingFace Spaces** (Free, Recommended)
45
+
46
+ 1. Go to [HuggingFace Spaces](https://huggingface.co/spaces)
47
+ 2. Create new space with Gradio SDK
48
+ 3. Upload your files:
49
+ - `app_gradio.py`
50
+ - `requirements_gradio.txt` (rename to `requirements.txt`)
51
+ - `utils/` folder
52
+ - `.env` file (with your API keys)
53
+ 4. Deploy automatically!
54
+
55
+ ### 3. **Google Colab** (Free)
56
+
57
+ ```python
58
+ # Upload files to Colab
59
+ !pip install -r requirements_gradio.txt
60
+ !python app_gradio.py
61
+ ```
62
+
63
+ ### 4. **Railway/Render** (Paid but reliable)
64
+
65
+ - Push to GitHub
66
+ - Connect to Railway/Render
67
+ - Auto-deploy with custom domain
68
+
69
+ ## 🔧 Configuration
70
+
71
+ ### Environment Variables
72
+
73
+ Create a `.env` file:
74
+
75
+ ```env
76
+ # Required for SEA-LION models
77
+ SEA_LION_API_KEY=your_sea_lion_api_key_here
78
+ SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
79
+
80
+ # Optional: For OpenAI embeddings fallback
81
+ OPENAI_API_KEY=your_openai_api_key_here
82
+
83
+ # Optional: Custom vector database location
84
+ CHROMA_PERSIST_DIRECTORY=./chroma_db
85
+ ```
86
+
87
+ ### Model Configuration
88
+
89
+ The system automatically chooses the appropriate model:
90
+
91
+ - **Simple queries**: SEA-LION Instruct (faster)
92
+ - **Complex analysis**: SEA-LION Reasoning (more thorough)
93
+
94
+ ## 📋 How to Use
95
+
96
+ 1. **Initialize System** 🚀
97
+
98
+ - Click "Initialize Systems"
99
+ - Wait for models to download (first time only)
100
+
101
+ 2. **Upload Documents** 📄
102
+
103
+ - Upload PDF university documents
104
+ - System automatically extracts metadata
105
+ - Supports multiple documents at once
106
+
107
+ 3. **Ask Questions** 🔍
108
+ - Type questions in natural language
109
+ - Choose response language
110
+ - Get AI answers with source citations
111
+
112
+ ## 🎯 Example Questions
113
+
114
+ - "What are the admission requirements for Computer Science in Singapore?"
115
+ - "Which universities offer scholarships under $5000?"
116
+ - "Compare MBA programs in Thailand and Malaysia"
117
+ - "找到学费低于 5000 美元的工程专业" (Chinese)
118
+ - "Cari universitas dengan beasiswa di Indonesia" (Indonesian)
119
+
120
+ ## 🛠️ Troubleshooting
121
+
122
+ ### Common Issues
123
+
124
+ **"No embedding model available"**
125
+
126
+ ```bash
127
+ # Install sentence transformers
128
+ pip install sentence-transformers torch
129
+
130
+ # Or set OpenAI API key
131
+ export OPENAI_API_KEY=your_key_here
132
+ ```
133
+
134
+ **"Cannot load model"**
135
+
136
+ - Ensure internet connection for model download
137
+ - Try smaller model: set `EMBEDDING_MODEL=all-MiniLM-L6-v2`
138
+
139
+ **PDF extraction fails**
140
+
141
+ - Ensure PDFs are text-based (not scanned images)
142
+ - Check if PDF is password-protected
143
+
144
+ ## 🔄 Differences from Streamlit Version
145
+
146
+ | Feature | Streamlit | Gradio |
147
+ | ----------------- | ------------------------ | ------------------------ |
148
+ | **Deployment** | Complex, SQLite issues | Simple, multiple options |
149
+ | **Sharing** | Limited | Built-in public links |
150
+ | **UI** | More customizable | Clean, mobile-friendly |
151
+ | **Dependencies** | Heavy, version conflicts | Lighter, more stable |
152
+ | **Cloud Hosting** | Streamlit Cloud only | HF Spaces, Colab, etc. |
153
+
154
+ ## 📁 Project Structure
155
+
156
+ ```
157
+ 📦 ASEAN University Search (Gradio)
158
+ ├── 🚀 app_gradio.py # Main Gradio application
159
+ ├── 📋 requirements_gradio.txt # Gradio-specific dependencies
160
+ ├── ⚡ start_gradio.sh # Quick startup script
161
+ ├── 🔧 utils/
162
+ │ ├── rag_system.py # Core RAG logic (Streamlit-free)
163
+ │ ├── display.py # Display utilities
164
+ │ └── translations.py # Language translations
165
+ ├── 📁 documents/ # Document storage
166
+ ├── 🗄️ chroma_db/ # Vector database
167
+ ├── 📊 query_results/ # Saved query results
168
+ └── 🔐 .env # Environment variables
169
+ ```
170
+
171
+ ## 🌟 Benefits of Gradio Version
172
+
173
+ 1. **🚀 Faster Deployment**: No SQLite version conflicts
174
+ 2. **🌐 Built-in Sharing**: Automatic public links
175
+ 3. **📱 Mobile-Friendly**: Responsive design
176
+ 4. **🔧 Fewer Dependencies**: More stable installation
177
+ 5. **🎯 Multiple Hosting Options**: HF Spaces, Colab, Railway, etc.
178
+ 6. **🛠️ Better Error Handling**: Clearer error messages
179
+ 7. **⚡ Faster Loading**: Optimized model initialization
180
+
181
+ ## 🤝 Contributing
182
+
183
+ 1. Fork the repository
184
+ 2. Create a feature branch: `git checkout -b feature-name`
185
+ 3. Make your changes
186
+ 4. Commit: `git commit -m "Add feature"`
187
+ 5. Push: `git push origin feature-name`
188
+ 6. Create a Pull Request
189
+
190
+ ## 📄 License
191
+
192
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
193
+
194
+ ## 🙏 Acknowledgments
195
+
196
+ - **SEA-LION AI**: For the amazing Southeast Asia-focused language models
197
+ - **Gradio**: For the excellent web interface framework
198
+ - **LangChain**: For the robust RAG pipeline
199
+ - **ChromaDB**: For efficient vector storage
200
+ - **Sentence Transformers**: For semantic embeddings
201
+
202
+ ---
203
+
204
+ **Built with ❤️ for the ASEAN education community**
app.py DELETED
@@ -1,123 +0,0 @@
1
- import streamlit as st
2
- import os
3
- from urllib.parse import urlparse, parse_qs
4
- from utils.rag_system import DocumentIngestion, RAGSystem, save_query_result, load_shared_query
5
- from datetime import datetime
6
- import uuid
7
- from utils.translations import translations, get_text, get_language_code
8
- from pathlib import Path
9
- from my_pages.search_uni import search_page
10
- from my_pages.upload_documents import upload_documents_page
11
- from my_pages.manage_documents import manage_documents_page
12
- from my_pages.about import about_page
13
- from utils.display import display_shared_query
14
-
15
- # Load external CSS
16
- def load_css(file_name):
17
- css_file = Path(file_name)
18
- if css_file.exists():
19
- with open(css_file) as f:
20
- st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
21
-
22
- load_css("styles.css")
23
-
24
- # Configure Streamlit page
25
- st.set_page_config(
26
- page_title="PanSea University Search",
27
- page_icon="🎓",
28
- layout="wide",
29
- initial_sidebar_state="expanded"
30
- )
31
-
32
- def main():
33
- # Initialize language in session state if not present
34
- if 'app_language' not in st.session_state:
35
- st.session_state.app_language = "English"
36
-
37
- # Get current language from session state
38
- current_lang = st.session_state.app_language
39
-
40
- # Check for shared query in URL
41
- query_params = st.query_params
42
- shared_query_id = query_params.get("share", [None])[0]
43
-
44
- if shared_query_id:
45
- display_shared_query(shared_query_id)
46
- return
47
-
48
- # Main header
49
- st.markdown(f"""
50
- <div class="main-header">
51
- <h1>{get_text("app_title", current_lang)}</h1>
52
- <h5>{get_text("app_subtitle", current_lang)}</h5>
53
- </div>
54
- """, unsafe_allow_html=True)
55
-
56
- # Sidebar
57
- with st.sidebar:
58
- # Global language selector
59
- selected_language = st.selectbox(
60
- "🌐 Language / 语言 / Bahasa",
61
- ["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
62
- "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"],
63
- index=["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
64
- "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"].index(
65
- next((lang for lang in ["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
66
- "Bahasa Indonesia", "Tiếng Việt (Vietnamese)"]
67
- if get_language_code(lang) == current_lang), "English")),
68
- key="global_language_selector"
69
- )
70
-
71
- # Update session state when language changes
72
- new_lang = get_language_code(selected_language)
73
- if new_lang != current_lang:
74
- st.session_state.app_language = new_lang
75
- st.rerun()
76
-
77
- # Update current_lang after potential change
78
- current_lang = st.session_state.app_language
79
-
80
- st.divider()
81
-
82
- # Navigation header
83
- st.markdown(f"## {get_text('navigation', current_lang)}")
84
-
85
-
86
- # Define the pages
87
- page_keys = ["search_universities", "upload_documents", "manage_documents", "about"]
88
- page_translations = {key: get_text(key, current_lang) for key in page_keys}
89
-
90
- # Initialize current page if needed
91
- if "current_page_key" not in st.session_state:
92
- st.session_state.current_page_key = page_keys[0]
93
-
94
- # Sidebar buttons
95
- for key in page_keys:
96
- if st.button(page_translations[key], use_container_width=True):
97
- st.session_state.current_page_key = key
98
-
99
- # Main content
100
- if st.session_state.current_page_key == "upload_documents":
101
- upload_documents_page()
102
- elif st.session_state.current_page_key == "manage_documents":
103
- manage_documents_page()
104
- elif st.session_state.current_page_key == "about":
105
- about_page()
106
- else:
107
- search_page()
108
-
109
-
110
-
111
- if __name__ == "__main__":
112
- # Check if SEA-LION API key is set
113
- if not os.getenv("SEA_LION_API_KEY"):
114
- st.error("🚨 SEA-LION API Key not found! Please set your SEA_LION_API_KEY in the .env file.")
115
- st.code("SEA_LION_API_KEY=your_api_key_here")
116
- st.stop()
117
-
118
- # Check if OpenAI API key is set (needed for embeddings)
119
- if not os.getenv("OPENAI_API_KEY") or os.getenv("OPENAI_API_KEY") == "your_openai_api_key_here":
120
- st.warning("⚠️ OpenAI API Key not configured properly. You'll need it for document embeddings.")
121
- st.info("The system will use SEA-LION models for text generation, but OpenAI for document embeddings.")
122
-
123
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app_gradio.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PANSEA University Requirements Assistant - Gradio Version (Modular)
3
+ A comprehensive tool for navigating university admission requirements across Southeast Asia.
4
+ """
5
+ import gradio as gr
6
+ import os
7
+ import sys
8
+ from datetime import datetime
9
+
10
+ # Add the current directory to Python path for imports
11
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)))
12
+
13
+ # Import our RAG system
14
+ from utils.rag_system import DocumentIngestion, RAGSystem
15
+
16
+ # Import modular tab components
17
+ from tabs.initialize import create_initialize_tab
18
+ from tabs.upload import create_upload_tab
19
+ from tabs.query import create_query_tab
20
+ from tabs.manage import create_manage_tab
21
+ from tabs.help import create_help_tab
22
+
23
+ def create_interface():
24
+ """Create the main Gradio interface using modular components"""
25
+
26
+ # Global state management - shared across all tabs
27
+ global_vars = {
28
+ 'doc_ingestion': None,
29
+ 'rag_system': None,
30
+ 'vectorstore': None
31
+ }
32
+
33
+ # Custom CSS for better styling
34
+ custom_css = """
35
+ .gradio-container {
36
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
37
+ }
38
+ .tab-nav button {
39
+ font-weight: 500;
40
+ font-size: 14px;
41
+ }
42
+ .tab-nav button[aria-selected="true"] {
43
+ background: linear-gradient(45deg, #1e3a8a, #3b82f6);
44
+ color: white;
45
+ }
46
+ .feedback-box {
47
+ background: #f8fafc;
48
+ border: 1px solid #e2e8f0;
49
+ border-radius: 8px;
50
+ padding: 16px;
51
+ margin: 8px 0;
52
+ }
53
+ .success-message {
54
+ background: #dcfce7;
55
+ color: #166534;
56
+ border: 1px solid #bbf7d0;
57
+ padding: 12px;
58
+ border-radius: 6px;
59
+ margin: 8px 0;
60
+ }
61
+ .error-message {
62
+ background: #fef2f2;
63
+ color: #dc2626;
64
+ border: 1px solid #fecaca;
65
+ padding: 12px;
66
+ border-radius: 6px;
67
+ margin: 8px 0;
68
+ }
69
+ """
70
+
71
+ # Create the main interface
72
+ with gr.Blocks(
73
+ title="🌏 PANSEA University Assistant",
74
+ theme=gr.themes.Soft(
75
+ primary_hue="blue",
76
+ secondary_hue="slate"
77
+ ),
78
+ css=custom_css,
79
+ analytics_enabled=False
80
+ ) as interface:
81
+
82
+ # Header
83
+ gr.Markdown("""
84
+ # 🌏 TopEdu
85
+
86
+ **Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
87
+
88
+ Upload university documents, ask questions, and get intelligent answers about admission requirements,
89
+ programs, deadlines, and more across Southeast Asian universities.
90
+
91
+ ---
92
+ """)
93
+
94
+ # Main tabs using modular components
95
+ with gr.Tabs():
96
+ create_initialize_tab(global_vars)
97
+ create_upload_tab(global_vars)
98
+ create_query_tab(global_vars)
99
+ create_manage_tab(global_vars)
100
+ create_help_tab(global_vars)
101
+
102
+ # Footer
103
+ gr.Markdown(f"""
104
+ ---
105
+
106
+ **🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
107
+
108
+ 💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
109
+ """)
110
+
111
+ return interface
112
+
113
+ def main():
114
+ """Launch the application"""
115
+ interface = create_interface()
116
+
117
+ # Launch configuration
118
+ interface.launch(
119
+ share=False, # Set to True for public sharing
120
+ server_name="0.0.0.0", # Allow external connections
121
+ server_port=7860, # Default Gradio port
122
+ show_api=False, # Hide API documentation
123
+ show_error=True, # Show detailed error messages
124
+ quiet=False, # Show startup messages
125
+ favicon_path=None, # Could add custom favicon
126
+ app_kwargs={
127
+ "docs_url": None, # Disable FastAPI docs
128
+ "redoc_url": None # Disable ReDoc docs
129
+ }
130
+ )
131
+
132
+ if __name__ == "__main__":
133
+ print("🚀 Starting PANSEA University Requirements Assistant...")
134
+ print("📍 Access the application at: http://localhost:7860")
135
+ print("🔗 For public sharing, set share=True in the launch() method")
136
+ print("-" * 60)
137
+ main()
app_gradio_modular.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PANSEA University Requirements Assistant - Gradio Version (Modular)
3
+ A comprehensive tool for navigating university admission requirements across Southeast Asia.
4
+ """
5
+ import gradio as gr
6
+ import os
7
+ import sys
8
+ from datetime import datetime
9
+
10
+ # Add the current directory to Python path for imports
11
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)))
12
+
13
+ # Import our RAG system
14
+ from utils.rag_system import DocumentIngestion, RAGSystem
15
+
16
+ # Import modular tab components
17
+ from tabs.initialize import create_initialize_tab
18
+ from tabs.upload import create_upload_tab
19
+ from tabs.query import create_query_tab
20
+ from tabs.manage import create_manage_tab
21
+ from tabs.help import create_help_tab
22
+
23
+ def create_interface():
24
+ """Create the main Gradio interface using modular components"""
25
+
26
+ # Global state management - shared across all tabs
27
+ global_vars = {
28
+ 'doc_ingestion': None,
29
+ 'rag_system': None,
30
+ 'vectorstore': None
31
+ }
32
+
33
+ # Custom CSS for better styling
34
+ custom_css = """
35
+ .gradio-container {
36
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
37
+ }
38
+ .tab-nav button {
39
+ font-weight: 500;
40
+ font-size: 14px;
41
+ }
42
+ .tab-nav button[aria-selected="true"] {
43
+ background: linear-gradient(45deg, #1e3a8a, #3b82f6);
44
+ color: white;
45
+ }
46
+ .feedback-box {
47
+ background: #f8fafc;
48
+ border: 1px solid #e2e8f0;
49
+ border-radius: 8px;
50
+ padding: 16px;
51
+ margin: 8px 0;
52
+ }
53
+ .success-message {
54
+ background: #dcfce7;
55
+ color: #166534;
56
+ border: 1px solid #bbf7d0;
57
+ padding: 12px;
58
+ border-radius: 6px;
59
+ margin: 8px 0;
60
+ }
61
+ .error-message {
62
+ background: #fef2f2;
63
+ color: #dc2626;
64
+ border: 1px solid #fecaca;
65
+ padding: 12px;
66
+ border-radius: 6px;
67
+ margin: 8px 0;
68
+ }
69
+ """
70
+
71
+ # Create the main interface
72
+ with gr.Blocks(
73
+ title="🌏 PANSEA University Assistant",
74
+ theme=gr.themes.Soft(
75
+ primary_hue="blue",
76
+ secondary_hue="slate"
77
+ ),
78
+ css=custom_css,
79
+ analytics_enabled=False
80
+ ) as interface:
81
+
82
+ # Header
83
+ gr.Markdown("""
84
+ # 🌏 TopEdu
85
+
86
+ **Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
87
+
88
+ Upload university documents, ask questions, and get intelligent answers about admission requirements,
89
+ programs, deadlines, and more across Southeast Asian universities.
90
+
91
+ ---
92
+ """)
93
+
94
+ # Main tabs using modular components
95
+ with gr.Tabs():
96
+ create_initialize_tab(global_vars)
97
+ create_upload_tab(global_vars)
98
+ create_query_tab(global_vars)
99
+ create_manage_tab(global_vars)
100
+ create_help_tab(global_vars)
101
+
102
+ # Footer
103
+ gr.Markdown(f"""
104
+ ---
105
+
106
+ **🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
107
+
108
+ 💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
109
+ """)
110
+
111
+ return interface
112
+
113
+ def main():
114
+ """Launch the application"""
115
+ interface = create_interface()
116
+
117
+ # Launch configuration
118
+ interface.launch(
119
+ share=False, # Set to True for public sharing
120
+ server_name="0.0.0.0", # Allow external connections
121
+ server_port=7860, # Default Gradio port
122
+ show_api=False, # Hide API documentation
123
+ show_error=True, # Show detailed error messages
124
+ quiet=False, # Show startup messages
125
+ favicon_path=None, # Could add custom favicon
126
+ app_kwargs={
127
+ "docs_url": None, # Disable FastAPI docs
128
+ "redoc_url": None # Disable ReDoc docs
129
+ }
130
+ )
131
+
132
+ if __name__ == "__main__":
133
+ print("🚀 Starting PANSEA University Requirements Assistant...")
134
+ print("📍 Access the application at: http://localhost:7860")
135
+ print("🔗 For public sharing, set share=True in the launch() method")
136
+ print("-" * 60)
137
+ main()
installed_packages.txt DELETED
@@ -1,178 +0,0 @@
1
- aiohappyeyeballs==2.6.1
2
- aiohttp==3.12.15
3
- aiosignal==1.4.0
4
- altair==5.5.0
5
- altex==0.2.0
6
- annotated-types==0.7.0
7
- anyio==4.10.0
8
- asgiref==3.9.1
9
- async-timeout==4.0.3
10
- attrs==25.3.0
11
- backoff==2.2.1
12
- bcrypt==4.3.0
13
- beautifulsoup4==4.13.4
14
- blinker==1.9.0
15
- build==1.3.0
16
- cachetools==5.5.2
17
- certifi==2025.8.3
18
- charset-normalizer==3.4.3
19
- chroma-hnswlib==0.7.3
20
- chromadb==1.0.16
21
- click==8.2.1
22
- coloredlogs==15.0.1
23
- contourpy==1.3.2
24
- cycler==0.12.1
25
- dataclasses-json==0.6.7
26
- Deprecated==1.2.18
27
- distro==1.9.0
28
- durationpy==0.10
29
- entrypoints==0.4
30
- exceptiongroup==1.3.0
31
- faiss-cpu==1.7.4
32
- Faker==37.5.3
33
- fastapi==0.116.1
34
- favicon==0.7.0
35
- filelock==3.18.0
36
- flatbuffers==25.2.10
37
- fonttools==4.59.0
38
- frozenlist==1.7.0
39
- fsspec==2025.7.0
40
- gitdb==4.0.12
41
- GitPython==3.1.45
42
- google-auth==2.40.3
43
- googleapis-common-protos==1.70.0
44
- grpcio==1.74.0
45
- h11==0.16.0
46
- hf-xet==1.1.7
47
- htbuilder==0.9.0
48
- httpcore==1.0.9
49
- httptools==0.6.4
50
- httpx==0.28.1
51
- huggingface-hub==0.34.4
52
- humanfriendly==10.0
53
- idna==3.10
54
- importlib-metadata==6.11.0
55
- importlib_resources==6.5.2
56
- Jinja2==3.1.6
57
- jiter==0.10.0
58
- joblib==1.5.1
59
- jsonpatch==1.33
60
- jsonpointer==3.0.0
61
- jsonschema==4.25.0
62
- jsonschema-specifications==2025.4.1
63
- kiwisolver==1.4.9
64
- kubernetes==33.1.0
65
- langchain-text-splitters==0.3.9
66
- lxml==6.0.0
67
- Markdown==3.8.2
68
- markdown-it-py==4.0.0
69
- markdownlit==0.0.7
70
- MarkupSafe==3.0.2
71
- marshmallow==3.26.1
72
- matplotlib==3.10.5
73
- mdurl==0.1.2
74
- mmh3==5.2.0
75
- mpmath==1.3.0
76
- multidict==6.6.4
77
- mypy_extensions==1.1.0
78
- narwhals==2.1.0
79
- networkx==3.4.2
80
- numpy==1.26.4
81
- oauthlib==3.3.1
82
- onnxruntime==1.22.1
83
- opentelemetry-api==1.27.0
84
- opentelemetry-exporter-otlp-proto-common==1.27.0
85
- opentelemetry-exporter-otlp-proto-grpc==1.27.0
86
- opentelemetry-instrumentation==0.48b0
87
- opentelemetry-instrumentation-asgi==0.48b0
88
- opentelemetry-instrumentation-fastapi==0.48b0
89
- opentelemetry-proto==1.27.0
90
- opentelemetry-sdk==1.27.0
91
- opentelemetry-semantic-conventions==0.48b0
92
- opentelemetry-util-http==0.48b0
93
- orjson==3.11.2
94
- overrides==7.7.0
95
- packaging==23.2
96
- pandas==2.3.1
97
- pillow==10.4.0
98
- posthog==5.4.0
99
- propcache==0.3.2
100
- protobuf==4.25.8
101
- pulsar-client==3.8.0
102
- pyarrow==21.0.0
103
- pyasn1==0.6.1
104
- pyasn1_modules==0.4.2
105
- pybase64==1.4.2
106
- pycryptodome==3.23.0
107
- pydantic==2.11.7
108
- pydantic_core==2.33.2
109
- pydeck==0.9.1
110
- Pygments==2.19.2
111
- pymdown-extensions==10.16.1
112
- pyparsing==3.2.3
113
- PyPDF2==3.0.1
114
- PyPika==0.48.9
115
- pyproject_hooks==1.2.0
116
- python-dateutil==2.9.0.post0
117
- python-dotenv==1.0.0
118
- pytz==2025.2
119
- PyYAML==6.0.2
120
- referencing==0.36.2
121
- regex==2025.7.34
122
- requests==2.32.4
123
- requests-oauthlib==2.0.0
124
- requests-toolbelt==1.0.0
125
- rich==13.9.4
126
- rpds-py==0.27.0
127
- rsa==4.9.1
128
- safetensors==0.6.2
129
- scikit-learn==1.7.1
130
- scipy==1.15.3
131
- sentence-transformers==5.1.0
132
- shellingham==1.5.4
133
- six==1.17.0
134
- smmap==5.0.2
135
- sniffio==1.3.1
136
- soupsieve==2.7
137
- SQLAlchemy==2.0.43
138
- st-annotated-text==4.0.2
139
- starlette==0.47.2
140
- streamlit==1.48.0
141
- streamlit-camera-input-live==0.2.0
142
- streamlit-card==1.0.2
143
- streamlit-embedcode==0.1.2
144
- streamlit-extras==0.3.5
145
- streamlit-image-coordinates==0.1.9
146
- streamlit-keyup==0.3.0
147
- streamlit-toggle-switch==1.0.2
148
- streamlit-vertical-slider==2.5.5
149
- streamlit_faker==0.0.4
150
- sympy==1.14.0
151
- tenacity==8.5.0
152
- threadpoolctl==3.6.0
153
- tiktoken==0.11.0
154
- tokenizers==0.21.4
155
- toml==0.10.2
156
- tomli==2.2.1
157
- torch==2.8.0
158
- tornado==6.5.2
159
- tqdm==4.67.1
160
- transformers==4.55.0
161
- typer==0.16.0
162
- typing-inspect==0.9.0
163
- typing-inspection==0.4.1
164
- typing_extensions==4.14.1
165
- tzdata==2025.2
166
- tzlocal==5.3.1
167
- urllib3==2.5.0
168
- uvicorn==0.35.0
169
- uvloop==0.21.0
170
- validators==0.35.0
171
- watchdog==3.0.0
172
- watchfiles==1.1.0
173
- websocket-client==1.8.0
174
- websockets==15.0.1
175
- wrapt==1.17.3
176
- yarl==1.20.1
177
- zipp==3.23.0
178
- zstandard==0.23.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
my_pages/about.py DELETED
@@ -1,37 +0,0 @@
1
- import streamlit as st
2
- from utils.translations import get_text
3
-
4
- def about_page():
5
- # Get current language from session state
6
- lang = st.session_state.get('app_language', 'English')
7
-
8
- st.header(get_text("about_header", lang))
9
-
10
- # col1, col2 = st.columns([2, 1])
11
-
12
- # with col1:
13
- st.markdown(f"""
14
- ### {get_text("who_we_are", lang)}
15
- {get_text("who_we_are_description", lang)}
16
-
17
- ### {get_text("what_we_do", lang)}
18
- {get_text("what_we_do_description", lang)}
19
-
20
- ### {get_text("supported_languages", lang)}
21
- - English
22
- - 中文 (Chinese / Mandarin)
23
- - Bahasa Malaysia
24
- - ไทย (Thai)
25
- - Bahasa Indonesia
26
- - Tiếng Việt (Vietnamese)
27
- - Filipino
28
- - ភាសាខ្មែរ (Khmer)
29
- - ພາສາລາວ (Lao)
30
- - မြန်မာဘာသာ (Burmese)
31
- """)
32
-
33
- # with col2:
34
- # st.markdown(f"""
35
- # ### {get_text("contact", lang)}
36
- # Reach out to us for support or inquiries!
37
- # """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
my_pages/manage_documents.py DELETED
@@ -1,73 +0,0 @@
1
- import streamlit as st
2
- from utils.rag_system import DocumentIngestion
3
- from utils.translations import get_text
4
-
5
-
6
- def manage_documents_page():
7
- # Get current language from session state
8
- current_lang = st.session_state.get('app_language', 'English')
9
-
10
- st.header(get_text("manage_header", current_lang))
11
- st.write(get_text("manage_description", current_lang))
12
-
13
- from utils.rag_system import DocumentIngestion
14
- doc_ingestion = DocumentIngestion()
15
- vectorstore = doc_ingestion.load_existing_vectorstore()
16
-
17
- if not vectorstore:
18
- st.warning("No files found. Upload documents first.")
19
- return
20
-
21
- # Get all documents (chunks) in the vectorstore
22
- try:
23
- # Chroma stores documents as chunks, but we want to show original metadata
24
- # We'll group by file_id to show unique documents
25
- collection = vectorstore._collection
26
- all_docs = collection.get(include=["metadatas", "documents"]) # Removed 'ids'
27
- metadatas = all_docs["metadatas"]
28
- ids = all_docs["ids"] # ids are always returned
29
- documents = all_docs["documents"]
30
-
31
- # Group by file_id
32
- doc_map = {}
33
- for meta, doc_id, doc_text in zip(metadatas, ids, documents):
34
- file_id = meta.get("file_id", doc_id)
35
- if file_id not in doc_map:
36
- doc_map[file_id] = {
37
- "source": meta.get("source", "Unknown"),
38
- "university": meta.get("university", "Unknown"),
39
- "country": meta.get("country", "Unknown"),
40
- "document_type": meta.get("document_type", "Unknown"),
41
- "upload_timestamp": meta.get("upload_timestamp", "Unknown"),
42
- "file_id": file_id,
43
- "chunks": []
44
- }
45
- doc_map[file_id]["chunks"].append(doc_text)
46
-
47
- if not doc_map:
48
- st.info(get_text("no_documents", current_lang))
49
- return
50
-
51
- st.subheader(get_text("document_list", current_lang))
52
- for file_id, info in doc_map.items():
53
- with st.expander(f"{info['source']} ({info['university']}, {info['country']})"):
54
- st.write(f"**Type:** {info['document_type']}")
55
- st.write(f"**{get_text('last_updated', current_lang)}:** {info['upload_timestamp']}")
56
- st.write(f"**File ID:** {file_id}")
57
- st.write(f"**{get_text('total_chunks', current_lang)}:** {len(info['chunks'])}")
58
- if st.button(f"🗑️ Delete Document", key=f"del_{file_id}"):
59
- # Delete all chunks with this file_id
60
- ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == file_id]
61
- vectorstore._collection.delete(ids=ids_to_delete)
62
- st.success(f"Deleted document: {info['source']}")
63
- st.rerun()
64
-
65
- # Add Delete All button
66
- if doc_map:
67
- if st.button(get_text("delete_all", current_lang), key="del_all_docs", type="secondary"):
68
- all_ids = list(ids)
69
- vectorstore._collection.delete(ids=all_ids)
70
- st.success(get_text("documents_deleted", current_lang))
71
- st.rerun()
72
- except Exception as e:
73
- st.error(f"Error loading documents: {str(e)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
my_pages/search_uni.py DELETED
@@ -1,104 +0,0 @@
1
- import streamlit as st
2
- from utils.translations import get_text
3
- from utils.rag_system import RAGSystem, save_query_result
4
-
5
- def search_page():
6
- lang = st.session_state.get('app_language', 'English')
7
-
8
- # --- Header & description ---
9
- st.header(get_text("search_header", lang))
10
- st.write(get_text("search_description", lang))
11
- if lang != "English":
12
- st.info(f'{get_text("responses_in", lang)} **{lang}**')
13
-
14
- # --- Initialize query_text ---
15
- if "query_text" not in st.session_state:
16
- st.session_state.query_text = ""
17
-
18
- # --- Example queries ---
19
- complex_examples = [
20
- get_text("example_complex_1", lang),
21
- get_text("example_complex_2", lang),
22
- get_text("example_complex_3", lang),
23
- get_text("example_complex_4", lang)
24
- ]
25
- simple_examples = [
26
- get_text("example_simple_1", lang),
27
- get_text("example_simple_2", lang),
28
- get_text("example_simple_3", lang),
29
- get_text("example_simple_4", lang)
30
- ]
31
-
32
- with st.expander(get_text("example_queries", lang)):
33
- tab1, tab2 = st.tabs([get_text("complex_queries", lang), get_text("simple_queries", lang)])
34
- with tab1:
35
- for i, ex in enumerate(complex_examples):
36
- if st.button(ex, key=f"complex_{i}", use_container_width=True):
37
- st.session_state.query_text = ex
38
- with tab2:
39
- for i, ex in enumerate(simple_examples):
40
- if st.button(ex, key=f"simple_{i}", use_container_width=True):
41
- st.session_state.query_text = ex
42
-
43
- # --- Query input ---
44
- st.text_area(
45
- get_text("your_question", lang),
46
- height=120,
47
- placeholder=get_text("placeholder_text", lang),
48
- key="query_text"
49
- )
50
-
51
- # --- Optional filters (initially empty) ---
52
- with st.expander(get_text("advanced_filters", lang)):
53
- col1, col2, col3 = st.columns(3)
54
-
55
- budget_options = [get_text(opt, lang) for opt in ["any", "under_10k", "10k_20k", "20k_30k", "30k_40k", "over_40k"]]
56
- study_level_options = [get_text(lvl, lang) for lvl in ["diploma", "bachelor", "master", "phd"]]
57
- country_options = [get_text(c, lang) for c in ["singapore", "malaysia", "thailand", "indonesia", "philippines", "vietnam", "brunei"]]
58
-
59
- selected_budget = col1.select_slider(get_text("budget_range", lang), options=budget_options, value=budget_options[0])
60
- selected_levels = col2.multiselect(get_text("study_level", lang), study_level_options, default=[])
61
- selected_countries = col3.multiselect(get_text("preferred_countries", lang), country_options, default=[])
62
-
63
- # --- Ensure RAG system is initialized once ---
64
- if "rag_system_ready" not in st.session_state:
65
- st.session_state.rag_system_ready = False
66
- try:
67
- st.session_state.rag_system = RAGSystem()
68
- st.session_state.rag_system_ready = True
69
- except Exception as e:
70
- st.error(f"Failed to initialize RAG system: {e}")
71
-
72
- # --- Search button ---
73
- search_disabled = not st.session_state.query_text.strip() or not st.session_state.rag_system_ready
74
-
75
- if st.button(get_text("search_button", lang), disabled=search_disabled):
76
- placeholder = st.empty()
77
- placeholder.info("Searching...")
78
-
79
- # Combine query with filter info
80
- filter_info = {
81
- "budget": selected_budget if selected_budget != budget_options[0] else None,
82
- "study_levels": selected_levels,
83
- "countries": selected_countries
84
- }
85
- full_query = f"{st.session_state.query_text.strip()}\nFilters: {filter_info}"
86
-
87
- # Call RAG system with filters
88
- query_result = st.session_state.rag_system.query(
89
- question=full_query,
90
- language=lang
91
- )
92
-
93
- placeholder.empty()
94
- save_query_result(query_result)
95
-
96
- st.success(query_result["answer"])
97
-
98
- if query_result["source_documents"]:
99
- st.markdown("#### Source Documents")
100
- for i, doc in enumerate(query_result["source_documents"], 1):
101
- st.markdown(
102
- f"- **{i}. {doc.metadata.get('source', 'Unknown')}** "
103
- f"({doc.metadata.get('university', 'Unknown')}, {doc.metadata.get('country', 'Unknown')})"
104
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
my_pages/upload_documents.py DELETED
@@ -1,202 +0,0 @@
1
- from langchain.schema import Document
2
- import streamlit as st
3
- from utils.rag_system import DocumentIngestion
4
- from utils.translations import get_text
5
-
6
- def upload_documents_page():
7
- # Get current language from session state
8
- current_lang = st.session_state.get('app_language', 'English')
9
-
10
- st.header(get_text("upload_header", current_lang))
11
- st.write(get_text("upload_description", current_lang))
12
-
13
- # Add information about automatic metadata detection
14
- st.info("🤖 **Automatic Metadata Detection Enabled**: The system will automatically detect university name, country, and document type from your uploaded files using AI.")
15
-
16
- # File upload (removed manual metadata input fields)
17
- uploaded_files = st.file_uploader(
18
- get_text("choose_files", current_lang),
19
- accept_multiple_files=True,
20
- type=['pdf'],
21
- help=get_text("file_limit", current_lang)
22
- )
23
-
24
- # # Optional: Add language selection for processing (if needed for multilingual documents)
25
- # col1, col2 = st.columns(2)
26
- # with col1:
27
- # processing_language = st.selectbox(
28
- # f"🌐 Processing Language (Optional)",
29
- # ["Auto-detect", "English", "Chinese", "Malay", "Thai", "Indonesian", "Vietnamese", "Filipino"],
30
- # help="Select the primary language of your documents for better metadata extraction"
31
- # )
32
-
33
- # with col2:
34
- # # Optional: Allow users to override detected metadata if needed
35
- # allow_manual_override = st.checkbox(
36
- # "🔧 Allow manual metadata correction after processing",
37
- # value=False,
38
- # help="Enable this to manually correct any incorrectly detected metadata"
39
- # )
40
-
41
- if uploaded_files and st.button(get_text("process_documents", current_lang), type="primary"):
42
- with st.spinner(f"{get_text('processing_docs', current_lang)} (with automatic metadata detection)..."):
43
- try:
44
- # Initialize document ingestion
45
- doc_ingestion = DocumentIngestion()
46
-
47
- # Process documents with automatic metadata extraction
48
- documents = doc_ingestion.process_documents(uploaded_files)
49
-
50
- if documents:
51
- # Show detected metadata for review/correction if enabled
52
- # if allow_manual_override and documents:
53
- # st.subheader("🔍 Review Detected Metadata")
54
- # st.write("Review and correct the automatically detected metadata if needed:")
55
-
56
- # corrected_documents = []
57
- # for i, doc in enumerate(documents):
58
- # with st.expander(f"📄 {doc.metadata['source']}", expanded=False):
59
- # col1, col2, col3 = st.columns(3)
60
-
61
- # with col1:
62
- # corrected_university = st.text_input(
63
- # "University Name",
64
- # value=doc.metadata['university'],
65
- # key=f"uni_{i}"
66
- # )
67
-
68
- # with col2:
69
- # corrected_country = st.selectbox(
70
- # "Country",
71
- # ["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
72
- # "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"],
73
- # index=0 if doc.metadata['country'] == "Unknown" else
74
- # (["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
75
- # "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"].index(doc.metadata['country'])
76
- # if doc.metadata['country'] in ["Singapore", "Malaysia", "Thailand", "Indonesia",
77
- # "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"] else 0),
78
- # key=f"country_{i}"
79
- # )
80
-
81
- # with col3:
82
- # corrected_doc_type = st.selectbox(
83
- # "Document Type",
84
- # ["admission_requirements", "tuition_fees", "program_information",
85
- # "scholarship_info", "application_deadlines", "general_info"],
86
- # index=["admission_requirements", "tuition_fees", "program_information",
87
- # "scholarship_info", "application_deadlines", "general_info"].index(doc.metadata['document_type']),
88
- # key=f"doctype_{i}"
89
- # )
90
-
91
- # # Update document metadata with corrections
92
- # corrected_doc = Document(
93
- # page_content=doc.page_content,
94
- # metadata={
95
- # **doc.metadata,
96
- # "university": corrected_university,
97
- # "country": corrected_country,
98
- # "document_type": corrected_doc_type,
99
- # "manually_corrected": True
100
- # }
101
- # )
102
- # corrected_documents.append(corrected_doc)
103
-
104
- # # Use corrected documents
105
- # documents = corrected_documents
106
-
107
- # if st.button("✅ Confirm and Save Documents", type="primary"):
108
- # # Create or update vector store with corrected metadata
109
- # vectorstore = doc_ingestion.create_vector_store(documents)
110
-
111
- # if vectorstore:
112
- # st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with corrected metadata!")
113
-
114
- # # Show final processed files
115
- # with st.expander("📋 Final Processed Files"):
116
- # for doc in documents:
117
- # st.write(f"• **{doc.metadata['source']}**")
118
- # st.write(f" - University: {doc.metadata['university']}")
119
- # st.write(f" - Country: {doc.metadata['country']}")
120
- # st.write(f" - Type: {doc.metadata['document_type']}")
121
- # if doc.metadata.get('manually_corrected'):
122
- # st.write(f" - ✏️ Manually corrected")
123
- # st.write("---")
124
- # else:
125
- # Process normally without manual override
126
- vectorstore = doc_ingestion.create_vector_store(documents)
127
-
128
- if vectorstore:
129
- st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with automatic metadata detection!")
130
-
131
- # Show processed files with detected metadata
132
- with st.expander("📋 Processed Files with Detected Metadata"):
133
- for doc in documents:
134
- st.write(f"• **{doc.metadata['source']}**")
135
- st.write(f" - 🏫 University: {doc.metadata['university']}")
136
- st.write(f" - 🌏 Country: {doc.metadata['country']}")
137
- st.write(f" - 📋 Type: {doc.metadata['document_type']}")
138
- st.write(f" - 🤖 Auto-detected: Yes")
139
- st.write("---")
140
-
141
- # Show summary of detected metadata
142
- universities = list(set([doc.metadata['university'] for doc in documents if doc.metadata['university'] != 'Unknown']))
143
- countries = list(set([doc.metadata['country'] for doc in documents if doc.metadata['country'] != 'Unknown']))
144
- doc_types = list(set([doc.metadata['document_type'] for doc in documents]))
145
-
146
- if universities or countries or doc_types:
147
- st.subheader("📊 Detection Summary")
148
- if universities:
149
- st.write(f"🏫 **Universities detected**: {', '.join(universities)}")
150
- if countries:
151
- st.write(f"🌏 **Countries detected**: {', '.join(countries)}")
152
- if doc_types:
153
- st.write(f"📋 **Document types detected**: {', '.join(doc_types)}")
154
- else:
155
- st.error(get_text("no_docs_processed", current_lang))
156
-
157
- except Exception as e:
158
- st.error(f"{get_text('failed_to_process', current_lang)}: {str(e)}")
159
- st.error("Please check your API keys and model configurations.")
160
-
161
- # Additional helper function for metadata validation
162
- def validate_metadata(metadata: dict) -> dict:
163
- """Validate and clean extracted metadata"""
164
-
165
- # List of valid countries for ASEAN region
166
- valid_countries = [
167
- "Singapore", "Malaysia", "Thailand", "Indonesia",
168
- "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"
169
- ]
170
-
171
- # List of valid document types
172
- valid_doc_types = [
173
- "admission_requirements", "tuition_fees", "program_information",
174
- "scholarship_info", "application_deadlines", "general_info"
175
- ]
176
-
177
- # Clean and validate country
178
- if metadata.get('country', '').strip():
179
- country = metadata['country'].strip()
180
- # Try to match with valid countries (case insensitive)
181
- for valid_country in valid_countries:
182
- if valid_country.lower() in country.lower() or country.lower() in valid_country.lower():
183
- metadata['country'] = valid_country
184
- break
185
- else:
186
- # If no match found, keep original but mark as unvalidated
187
- if country.lower() not in [c.lower() for c in valid_countries]:
188
- metadata['country'] = country # Keep original
189
-
190
- # Validate document type
191
- if metadata.get('document_type') not in valid_doc_types:
192
- metadata['document_type'] = "general_info" # Default fallback
193
-
194
- # Clean university name
195
- if metadata.get('university_name'):
196
- # Remove common prefixes/suffixes that might be incorrectly included
197
- university = metadata['university_name'].strip()
198
- # Remove quotes if present
199
- university = university.strip('"\'')
200
- metadata['university_name'] = university
201
-
202
- return metadata
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,5 +1,4 @@
1
- #change requirements
2
-
3
  aiohappyeyeballs==2.6.1
4
  aiohttp==3.12.15
5
  aiosignal==1.4.0
@@ -10,6 +9,7 @@ attrs==25.3.0
10
  backoff==2.2.1
11
  bcrypt==4.3.0
12
  blinker==1.9.0
 
13
  build==1.3.0
14
  cachetools==5.5.2
15
  certifi==2025.8.3
@@ -20,6 +20,8 @@ coloredlogs==15.0.1
20
  dataclasses-json==0.6.7
21
  distro==1.9.0
22
  durationpy==0.10
 
 
23
  filelock==3.18.0
24
  flatbuffers==25.2.10
25
  frozenlist==1.7.0
@@ -28,6 +30,9 @@ gitdb==4.0.12
28
  GitPython==3.1.45
29
  google-auth==2.40.3
30
  googleapis-common-protos==1.70.0
 
 
 
31
  grpcio==1.74.0
32
  h11==0.16.0
33
  hf-xet==1.1.7
@@ -91,12 +96,14 @@ pydantic==2.11.7
91
  pydantic-settings==2.10.1
92
  pydantic_core==2.33.2
93
  pydeck==0.9.1
 
94
  Pygments==2.19.2
95
  PyPDF2==3.0.1
96
  PyPika==0.48.9
97
  pyproject_hooks==1.2.0
98
  python-dateutil==2.9.0.post0
99
  python-dotenv==1.1.1
 
100
  pytz==2025.2
101
  PyYAML==6.0.2
102
  referencing==0.36.2
@@ -107,15 +114,19 @@ requests-toolbelt==1.0.0
107
  rich==14.1.0
108
  rpds-py==0.27.0
109
  rsa==4.9.1
 
 
110
  safetensors==0.6.2
111
  scikit-learn==1.7.1
112
  scipy==1.16.1
 
113
  sentence-transformers==5.1.0
114
  shellingham==1.5.4
115
  six==1.17.0
116
  smmap==5.0.2
117
  sniffio==1.3.1
118
-
 
119
  streamlit==1.48.0
120
  sympy==1.14.0
121
  tenacity==9.1.2
@@ -123,6 +134,7 @@ threadpoolctl==3.6.0
123
  tiktoken==0.11.0
124
  tokenizers==0.21.4
125
  toml==0.10.2
 
126
  torch==2.8.0
127
  tornado==6.5.2
128
  tqdm==4.67.1
 
1
+ aiofiles==24.1.0
 
2
  aiohappyeyeballs==2.6.1
3
  aiohttp==3.12.15
4
  aiosignal==1.4.0
 
9
  backoff==2.2.1
10
  bcrypt==4.3.0
11
  blinker==1.9.0
12
+ Brotli==1.1.0
13
  build==1.3.0
14
  cachetools==5.5.2
15
  certifi==2025.8.3
 
20
  dataclasses-json==0.6.7
21
  distro==1.9.0
22
  durationpy==0.10
23
+ fastapi==0.116.1
24
+ ffmpy==0.6.1
25
  filelock==3.18.0
26
  flatbuffers==25.2.10
27
  frozenlist==1.7.0
 
30
  GitPython==3.1.45
31
  google-auth==2.40.3
32
  googleapis-common-protos==1.70.0
33
+ gradio==5.42.0
34
+ gradio_client==1.11.1
35
+ groovy==0.1.2
36
  grpcio==1.74.0
37
  h11==0.16.0
38
  hf-xet==1.1.7
 
96
  pydantic-settings==2.10.1
97
  pydantic_core==2.33.2
98
  pydeck==0.9.1
99
+ pydub==0.25.1
100
  Pygments==2.19.2
101
  PyPDF2==3.0.1
102
  PyPika==0.48.9
103
  pyproject_hooks==1.2.0
104
  python-dateutil==2.9.0.post0
105
  python-dotenv==1.1.1
106
+ python-multipart==0.0.20
107
  pytz==2025.2
108
  PyYAML==6.0.2
109
  referencing==0.36.2
 
114
  rich==14.1.0
115
  rpds-py==0.27.0
116
  rsa==4.9.1
117
+ ruff==0.12.8
118
+ safehttpx==0.1.6
119
  safetensors==0.6.2
120
  scikit-learn==1.7.1
121
  scipy==1.16.1
122
+ semantic-version==2.10.0
123
  sentence-transformers==5.1.0
124
  shellingham==1.5.4
125
  six==1.17.0
126
  smmap==5.0.2
127
  sniffio==1.3.1
128
+ SQLAlchemy==2.0.43
129
+ starlette==0.47.2
130
  streamlit==1.48.0
131
  sympy==1.14.0
132
  tenacity==9.1.2
 
134
  tiktoken==0.11.0
135
  tokenizers==0.21.4
136
  toml==0.10.2
137
+ tomlkit==0.13.3
138
  torch==2.8.0
139
  tornado==6.5.2
140
  tqdm==4.67.1
runtime.txt DELETED
@@ -1 +0,0 @@
1
- python-3.10.12
 
 
start.sh DELETED
@@ -1,43 +0,0 @@
1
- #!/bin/bash
2
-
3
- # PanSea University Search - Startup Script
4
-
5
- echo "🎓 Starting PanSea University Search..."
6
-
7
- # Check if virtual environment exists
8
- if [ ! -d ".venv" ]; then
9
- echo "❌ Virtual environment not found. Please run setup first."
10
- exit 1
11
- fi
12
-
13
- # Activate virtual environment
14
- source .venv/bin/activate
15
-
16
- # Check if .env file exists
17
- if [ ! -f ".env" ]; then
18
- echo "⚠️ .env file not found. Please create one with your OpenAI API key."
19
- echo "Example:"
20
- echo "OPENAI_API_KEY=your_api_key_here"
21
- exit 1
22
- fi
23
-
24
- # Create necessary directories
25
- mkdir -p chroma_db
26
- mkdir -p documents
27
- mkdir -p query_results
28
-
29
- # Check if required packages are installed
30
- echo "🔍 Checking dependencies..."
31
- python -c "import streamlit, langchain, chromadb" 2>/dev/null
32
- if [ $? -ne 0 ]; then
33
- echo "❌ Dependencies not found. Installing..."
34
- pip install -r requirements.txt
35
- fi
36
-
37
- echo "🚀 Starting Streamlit application..."
38
- echo "📱 Open your browser to: http://localhost:8501"
39
- echo "🛑 Press Ctrl+C to stop the application"
40
- echo ""
41
-
42
- # Start the Streamlit app
43
- streamlit run app.py --server.port=8501 --server.address=0.0.0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tabs/help.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Help tab functionality for the Gradio app
3
+ """
4
+ import gradio as gr
5
+
6
+ def create_help_tab(global_vars):
7
+ """Create the Help tab with comprehensive documentation"""
8
+ with gr.Tab("❓ Help", id="help"):
9
+ gr.Markdown("""
10
+ # 🌏 PANSEA University Requirements Assistant - User Guide
11
+
12
+ Welcome to the PANSEA (Pan-Southeast Asian) University Requirements Assistant! This tool helps you navigate university admission requirements across Southeast Asian countries using advanced AI-powered document analysis.
13
+
14
+ ---
15
+
16
+ ## 🚀 Getting Started
17
+
18
+ ### Step 1: Initialize the System
19
+ 1. Go to the **🔧 Initialize** tab
20
+ 2. Click **"Initialize All Systems"**
21
+ 3. Wait for the success message
22
+ 4. The system will set up AI models and document processing capabilities
23
+
24
+ ### Step 2: Upload Documents
25
+ 1. Navigate to the **📤 Upload Documents** tab
26
+ 2. Select one or more PDF files containing university requirement information
27
+ 3. Fill in the document metadata:
28
+ - **University Name**: Official name of the institution
29
+ - **Country**: Select from Southeast Asian countries
30
+ - **Document Type**: Choose the type of document
31
+ - **Language**: Document language
32
+ 4. Click **"Process Documents"**
33
+ 5. Wait for processing completion
34
+
35
+ ### Step 3: Query Documents
36
+ 1. Go to the **🔍 Query Documents** tab
37
+ 2. Type your question in the query box
38
+ 3. Click **"Search Documents"**
39
+ 4. Review the AI-generated answer and source references
40
+ 5. Use example questions to explore different types of queries
41
+
42
+ ### Step 4: Manage Documents
43
+ 1. Visit the **🗂 Manage Documents** tab
44
+ 2. View all uploaded documents and statistics
45
+ 3. Delete individual documents or clear all documents as needed
46
+
47
+ ---
48
+
49
+ ## 📖 Features Overview
50
+
51
+ ### 🤖 AI-Powered Analysis
52
+ - Uses advanced SEA-LION AI models optimized for Southeast Asian contexts
53
+ - Semantic search across your document collection
54
+ - Contextual answers with source citations
55
+ - Multi-language document support
56
+
57
+ ### 📚 Document Management
58
+ - Support for PDF documents
59
+ - Intelligent text chunking for better search results
60
+ - Metadata tracking (university, country, document type, language)
61
+ - Easy document deletion and management
62
+
63
+ ### 🌐 Regional Focus
64
+ - Specialized for Southeast Asian universities
65
+ - Supports multiple countries and languages
66
+ - Culturally aware responses
67
+ - Up-to-date admission requirement information
68
+
69
+ ---
70
+
71
+ ## 💡 Usage Tips
72
+
73
+ ### Asking Better Questions
74
+ - **Be Specific**: "What are the English proficiency requirements for Computer Science at NUS?" instead of "What are the requirements?"
75
+ - **Include Context**: Mention specific programs, countries, or universities you're interested in
76
+ - **Use Keywords**: Include terms like "admission", "requirements", "GPA", "test scores", etc.
77
+
78
+ ### Document Upload Best Practices
79
+ - **Quality Documents**: Upload official university brochures, requirement documents, or application guides
80
+ - **Accurate Metadata**: Fill in all metadata fields correctly for better search results
81
+ - **Regular Updates**: Replace outdated documents with current versions
82
+ - **Organized Approach**: Upload documents systematically by country or university
83
+
84
+ ### Managing Your Knowledge Base
85
+ - **Regular Maintenance**: Remove outdated documents periodically
86
+ - **Logical Organization**: Group related documents together
87
+ - **Backup Important Queries**: Save important answers for future reference
88
+
89
+ ---
90
+
91
+ ## 🛠 Troubleshooting
92
+
93
+ ### Common Issues
94
+
95
+ **Problem**: "Please initialize systems first" error
96
+ - **Solution**: Go to the Initialize tab and click "Initialize All Systems"
97
+
98
+ **Problem**: Document upload fails
99
+ - **Solution**: Ensure PDF files are not corrupted and contain text (not just images)
100
+
101
+ **Problem**: No search results
102
+ - **Solution**: Check if documents are uploaded and try different keywords
103
+
104
+ **Problem**: Slow performance
105
+ - **Solution**: Wait for processing to complete, avoid uploading too many large documents at once
106
+
107
+ ### Technical Requirements
108
+ - **File Format**: PDF documents only
109
+ - **File Size**: Reasonable size limits (avoid extremely large files)
110
+ - **Content**: Text-based PDFs work best (scanned images may not work well)
111
+ - **Internet**: Required for AI model access
112
+
113
+ ---
114
+
115
+ ## 📊 Understanding Results
116
+
117
+ ### Query Responses
118
+ - **Answer**: AI-generated response based on your documents
119
+ - **Sources**: Specific document chunks used to generate the answer
120
+ - **Confidence**: Implied by the specificity and detail of the response
121
+ - **Context**: Related information that might be helpful
122
+
123
+ ### Document Statistics
124
+ - **Total Documents**: Number of unique documents uploaded
125
+ - **Total Chunks**: Number of text segments for searching
126
+ - **Metadata**: Information about each document's origin and type
127
+
128
+ ---
129
+
130
+ ## 🌟 Best Practices for University Research
131
+
132
+ ### Research Strategy
133
+ 1. **Start Broad**: Upload general university information first
134
+ 2. **Get Specific**: Add detailed program requirements
135
+ 3. **Compare Options**: Query for comparisons between universities
136
+ 4. **Verify Information**: Cross-reference with official university websites
137
+
138
+ ### Question Types to Try
139
+ - **Admission Requirements**: "What are the minimum GPA requirements for..."
140
+ - **Test Scores**: "What IELTS/TOEFL scores are needed for..."
141
+ - **Application Deadlines**: "When is the application deadline for..."
142
+ - **Program Details**: "What courses are included in the... program at..."
143
+ - **Scholarships**: "What scholarship opportunities are available for..."
144
+
145
+ ---
146
+
147
+ ## 🆘 Support & Feedback
148
+
149
+ If you encounter issues or have suggestions for improvement:
150
+
151
+ 1. **Check Documentation**: Review this help section first
152
+ 2. **Try Different Approaches**: Rephrase your queries or check document formats
153
+ 3. **Document Issues**: Note specific error messages or unexpected behavior
154
+ 4. **Feature Requests**: Consider what additional functionality would be helpful
155
+
156
+ ---
157
+
158
+ ## 🔄 Version Information
159
+
160
+ **Current Version**: Gradio-based PANSEA Assistant
161
+ **AI Models**: SEA-LION optimized for Southeast Asian contexts
162
+ **Document Processing**: Advanced semantic chunking and embedding
163
+ **Search Technology**: Vector similarity search with contextual ranking
164
+
165
+ ---
166
+
167
+ *Happy university hunting! 🎓 We hope this tool helps you find the perfect educational opportunity in Southeast Asia.*
168
+ """)
tabs/initialize.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Initialize tab functionality for the Gradio app
3
+ """
4
+ import gradio as gr
5
+ from utils.rag_system import DocumentIngestion, RAGSystem
6
+
7
+ def initialize_systems(global_vars):
8
+ """Initialize the RAG systems"""
9
+ try:
10
+ print("🚀 Initializing document ingestion system...")
11
+ global_vars['doc_ingestion'] = DocumentIngestion()
12
+ print("🚀 Initializing RAG system...")
13
+ global_vars['rag_system'] = RAGSystem()
14
+ return "✅ Systems initialized successfully! You can now upload documents."
15
+ except Exception as e:
16
+ error_msg = f"❌ Error initializing systems: {str(e)}\n\n"
17
+
18
+ if "sentence-transformers" in str(e):
19
+ error_msg += """
20
+ **Possible solutions:**
21
+ 1. Install sentence-transformers: `pip install sentence-transformers`
22
+ 2. Or provide OpenAI API key in environment variables
23
+ 3. Check that PyTorch is properly installed
24
+
25
+ **For deployment:**
26
+ - Ensure requirements.txt includes: sentence-transformers, torch, transformers
27
+ """
28
+ return error_msg
29
+
30
+ def create_initialize_tab(global_vars):
31
+ """Create the Initialize System tab"""
32
+ with gr.Tab("🚀 Initialize System", id="init"):
33
+ gr.Markdown("""
34
+ ### Step 1: Initialize the System
35
+ Click the button below to initialize the AI models and embedding systems.
36
+ This may take a few moments on first run as models are downloaded.
37
+ """)
38
+
39
+ init_btn = gr.Button(
40
+ "🚀 Initialize Systems",
41
+ variant="primary",
42
+ size="lg"
43
+ )
44
+
45
+ init_status = gr.Textbox(
46
+ label="Initialization Status",
47
+ interactive=False,
48
+ lines=8,
49
+ placeholder="Click 'Initialize Systems' to start..."
50
+ )
51
+
52
+ init_btn.click(
53
+ lambda: initialize_systems(global_vars),
54
+ outputs=init_status
55
+ )
tabs/manage.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Manage documents tab functionality for the Gradio app
3
+ """
4
+ import gradio as gr
5
+
6
+ def manage_documents(global_vars):
7
+ """Manage uploaded documents - view, delete individual or all documents"""
8
+ doc_ingestion = global_vars.get('doc_ingestion')
9
+
10
+ if not doc_ingestion:
11
+ return "❌ Please initialize systems first!", "", ""
12
+
13
+ try:
14
+ vectorstore = doc_ingestion.load_existing_vectorstore()
15
+
16
+ if not vectorstore:
17
+ return "⚠️ No documents found. Upload documents first.", "", ""
18
+
19
+ # Get all documents from vectorstore
20
+ collection = vectorstore._collection
21
+ all_docs = collection.get(include=["metadatas", "documents"])
22
+ metadatas = all_docs["metadatas"]
23
+ ids = all_docs["ids"]
24
+ documents = all_docs["documents"]
25
+
26
+ # Group by file_id to show unique documents
27
+ doc_map = {}
28
+ for meta, doc_id, doc_text in zip(metadatas, ids, documents):
29
+ file_id = meta.get("file_id", doc_id)
30
+ if file_id not in doc_map:
31
+ doc_map[file_id] = {
32
+ "source": meta.get("source", "Unknown"),
33
+ "university": meta.get("university", "Unknown"),
34
+ "country": meta.get("country", "Unknown"),
35
+ "document_type": meta.get("document_type", "Unknown"),
36
+ "language": meta.get("language", "Unknown"),
37
+ "upload_timestamp": meta.get("upload_timestamp", "Unknown"),
38
+ "file_id": file_id,
39
+ "chunks": []
40
+ }
41
+ doc_map[file_id]["chunks"].append(doc_text)
42
+
43
+ if not doc_map:
44
+ return "ℹ️ No documents found in the system.", "", ""
45
+
46
+ # Create summary
47
+ total_documents = len(doc_map)
48
+ total_chunks = sum(len(info["chunks"]) for info in doc_map.values())
49
+
50
+ summary = f"""## 📊 Document Statistics
51
+
52
+ **Total Documents:** {total_documents}
53
+ **Total Text Chunks:** {total_chunks}
54
+ **Storage Status:** Active
55
+
56
+ ## 📚 Document List
57
+ """
58
+
59
+ # Create document list with details
60
+ document_list = ""
61
+ file_id_list = []
62
+
63
+ for i, (file_id, info) in enumerate(doc_map.items(), 1):
64
+ timestamp = info['upload_timestamp'][:19] if len(info['upload_timestamp']) > 19 else info['upload_timestamp']
65
+
66
+ document_list += f"""
67
+ **{i}. {info['source']}**
68
+ - University: {info['university']}
69
+ - Country: {info['country']}
70
+ - Type: {info['document_type']}
71
+ - Language: {info['language']}
72
+ - Chunks: {len(info['chunks'])}
73
+ - Uploaded: {timestamp}
74
+ - File ID: `{file_id}`
75
+
76
+ ---
77
+ """
78
+ file_id_list.append(file_id)
79
+
80
+ # Create dropdown options for individual deletion
81
+ file_options = [f"{info['source']} ({info['university']})" for info in doc_map.values()]
82
+
83
+ return summary, document_list, file_options
84
+
85
+ except Exception as e:
86
+ return f"❌ Error loading documents: {str(e)}", "", []
87
+
88
+ def delete_document(selected_file, current_doc_list, global_vars):
89
+ """Delete a specific document"""
90
+ doc_ingestion = global_vars.get('doc_ingestion')
91
+
92
+ if not doc_ingestion or not selected_file:
93
+ return "❌ Please select a document to delete.", current_doc_list
94
+
95
+ try:
96
+ vectorstore = doc_ingestion.load_existing_vectorstore()
97
+ if not vectorstore:
98
+ return "❌ No vectorstore found.", current_doc_list
99
+
100
+ # Get all documents and find the matching file_id
101
+ collection = vectorstore._collection
102
+ all_docs = collection.get(include=["metadatas"])
103
+ metadatas = all_docs["metadatas"]
104
+ ids = all_docs["ids"]
105
+
106
+ # Find file_id for the selected document
107
+ target_file_id = None
108
+ for meta, doc_id in zip(metadatas, ids):
109
+ source = meta.get("source", "Unknown")
110
+ university = meta.get("university", "Unknown")
111
+ if f"{source} ({university})" == selected_file:
112
+ target_file_id = meta.get("file_id", doc_id)
113
+ break
114
+
115
+ if not target_file_id:
116
+ return "❌ Document not found.", current_doc_list
117
+
118
+ # Delete all chunks with this file_id
119
+ ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == target_file_id]
120
+ collection.delete(ids=ids_to_delete)
121
+
122
+ # Refresh the document list
123
+ _, new_doc_list, _ = manage_documents(global_vars)
124
+
125
+ return f"✅ Successfully deleted document: {selected_file}", new_doc_list
126
+
127
+ except Exception as e:
128
+ return f"❌ Error deleting document: {str(e)}", current_doc_list
129
+
130
+ def delete_all_documents(global_vars):
131
+ """Delete all documents from the vectorstore"""
132
+ doc_ingestion = global_vars.get('doc_ingestion')
133
+
134
+ if not doc_ingestion:
135
+ return "❌ Please initialize systems first.", ""
136
+
137
+ try:
138
+ vectorstore_instance = doc_ingestion.load_existing_vectorstore()
139
+ if not vectorstore_instance:
140
+ return "⚠️ No documents found to delete.", ""
141
+
142
+ # Get all document IDs
143
+ collection = vectorstore_instance._collection
144
+ all_docs = collection.get()
145
+ all_ids = all_docs["ids"]
146
+
147
+ # Delete all documents
148
+ if all_ids:
149
+ collection.delete(ids=all_ids)
150
+ # Clear global vectorstore
151
+ global_vars['vectorstore'] = None
152
+ return f"✅ Successfully deleted all {len(all_ids)} document chunks.", ""
153
+ else:
154
+ return "ℹ️ No documents found to delete.", ""
155
+
156
+ except Exception as e:
157
+ return f"❌ Error deleting all documents: {str(e)}", ""
158
+
159
+ def create_manage_tab(global_vars):
160
+ """Create the Manage Documents tab"""
161
+ with gr.Tab("🗂 Manage Documents", id="manage"):
162
+ gr.Markdown("""
163
+ ### Step 4: Manage Your Documents
164
+ View, inspect, and manage all uploaded documents in your knowledge base.
165
+ You can see document details and delete individual documents or all documents.
166
+ """)
167
+
168
+ # Buttons for actions
169
+ with gr.Row():
170
+ refresh_btn = gr.Button("🔄 Refresh Document List", variant="secondary")
171
+ delete_all_btn = gr.Button("🗑️ Delete All Documents", variant="stop")
172
+
173
+ # Document statistics and list
174
+ doc_summary = gr.Markdown(
175
+ value="📊 Click 'Refresh Document List' to view your documents.",
176
+ label="Document Summary"
177
+ )
178
+
179
+ doc_list = gr.Markdown(
180
+ value="📚 Document details will appear here after refresh.",
181
+ label="Document List"
182
+ )
183
+
184
+ # Individual document deletion
185
+ gr.Markdown("### 🗑️ Delete Individual Document")
186
+
187
+ with gr.Row():
188
+ file_selector = gr.Dropdown(
189
+ choices=[],
190
+ label="Select Document to Delete",
191
+ interactive=True,
192
+ info="First click 'Refresh Document List' to see available documents"
193
+ )
194
+ delete_single_btn = gr.Button("🗑️ Delete Selected", variant="stop")
195
+
196
+ delete_status = gr.Textbox(
197
+ label="Action Status",
198
+ interactive=False,
199
+ lines=2,
200
+ placeholder="Deletion status will appear here..."
201
+ )
202
+
203
+ # Event handlers
204
+ def refresh_documents():
205
+ summary, documents, file_options = manage_documents(global_vars)
206
+ # Update dropdown choices
207
+ return summary, documents, gr.Dropdown(choices=file_options, value=None)
208
+
209
+ def delete_selected_document(selected_file, current_list):
210
+ if not selected_file:
211
+ return "❌ Please select a document to delete first.", current_list, gr.Dropdown(choices=[])
212
+
213
+ status, new_list = delete_document(selected_file, current_list, global_vars)
214
+ # Also refresh the file options after deletion
215
+ _, _, new_options = manage_documents(global_vars)
216
+ return status, new_list, gr.Dropdown(choices=new_options, value=None)
217
+
218
+ def delete_all_docs():
219
+ status, empty_list = delete_all_documents(global_vars)
220
+ return status, "📚 No documents in the system.", gr.Dropdown(choices=[], value=None)
221
+
222
+ # Connect event handlers
223
+ refresh_btn.click(
224
+ refresh_documents,
225
+ outputs=[doc_summary, doc_list, file_selector]
226
+ )
227
+
228
+ delete_single_btn.click(
229
+ delete_selected_document,
230
+ inputs=[file_selector, doc_list],
231
+ outputs=[delete_status, doc_list, file_selector]
232
+ )
233
+
234
+ delete_all_btn.click(
235
+ delete_all_docs,
236
+ outputs=[delete_status, doc_list, file_selector]
237
+ )
tabs/query.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Query documents tab functionality for the Gradio app
3
+ """
4
+ import gradio as gr
5
+
6
+ def query_documents(question, language, global_vars):
7
+ """Handle document queries"""
8
+ rag_system = global_vars.get('rag_system')
9
+ vectorstore = global_vars.get('vectorstore')
10
+
11
+ if not rag_system:
12
+ return "❌ Please initialize systems first using the 'Initialize System' tab!"
13
+
14
+ if not vectorstore:
15
+ return "❌ Please upload and process documents first using the 'Upload Documents' tab!"
16
+
17
+ if not question.strip():
18
+ return "❌ Please enter a question."
19
+
20
+ try:
21
+ print(f"🔍 Processing query: {question}")
22
+ result = rag_system.query(question, language)
23
+
24
+ # Format response
25
+ answer = result["answer"]
26
+ sources = result.get("source_documents", [])
27
+ model_used = result.get("model_used", "SEA-LION")
28
+
29
+ # Add model information
30
+ response = f"**Model Used:** {model_used}\n\n"
31
+ response += f"**Answer:**\n{answer}\n\n"
32
+
33
+ if sources:
34
+ response += "**📚 Sources:**\n"
35
+ for i, doc in enumerate(sources[:3], 1):
36
+ metadata = doc.metadata
37
+ source_name = metadata.get('source', 'Unknown')
38
+ university = metadata.get('university', 'Unknown')
39
+ country = metadata.get('country', 'Unknown')
40
+ doc_type = metadata.get('document_type', 'Unknown')
41
+
42
+ response += f"{i}. **{source_name}**\n"
43
+ response += f" - University: {university}\n"
44
+ response += f" - Country: {country}\n"
45
+ response += f" - Type: {doc_type}\n"
46
+ response += f" - Preview: {doc.page_content[:150]}...\n\n"
47
+ else:
48
+ response += "\n*No specific sources found. This might be a general response.*"
49
+
50
+ return response
51
+
52
+ except Exception as e:
53
+ return f"❌ Error querying documents: {str(e)}\n\nPlease check the console for more details."
54
+
55
+ def get_example_questions():
56
+ """Return example questions for the interface"""
57
+ return [
58
+ "What are the admission requirements for Computer Science programs in Singapore?",
59
+ "Which universities offer scholarships for international students?",
60
+ "What are the tuition fees for MBA programs in Thailand?",
61
+ "Find universities with engineering programs under $5000 per year",
62
+ "What are the application deadlines for programs in Malaysia?",
63
+ "Compare admission requirements between different ASEAN countries"
64
+ ]
65
+
66
+ def create_query_tab(global_vars):
67
+ """Create the Search & Query tab"""
68
+ with gr.Tab("🔍 Search & Query", id="query"):
69
+ gr.Markdown("""
70
+ ### Step 3: Ask Questions
71
+ Ask questions about the uploaded documents in your preferred language.
72
+ The AI will provide detailed answers with source citations.
73
+ """)
74
+
75
+ with gr.Row():
76
+ with gr.Column(scale=3):
77
+ question_input = gr.Textbox(
78
+ label="💭 Your Question",
79
+ placeholder="Ask anything about the universities...",
80
+ lines=3
81
+ )
82
+
83
+ with gr.Column(scale=1):
84
+ language_dropdown = gr.Dropdown(
85
+ choices=[
86
+ "English", "Chinese", "Malay", "Thai",
87
+ "Indonesian", "Vietnamese", "Filipino"
88
+ ],
89
+ value="English",
90
+ label="🌍 Response Language"
91
+ )
92
+
93
+ query_btn = gr.Button(
94
+ "🔍 Search Documents",
95
+ variant="primary",
96
+ size="lg"
97
+ )
98
+
99
+ answer_output = gr.Textbox(
100
+ label="🤖 AI Response",
101
+ interactive=False,
102
+ lines=20,
103
+ placeholder="Ask a question to get AI-powered answers..."
104
+ )
105
+
106
+ # Example questions section
107
+ gr.Markdown("### 💡 Example Questions")
108
+ example_questions = get_example_questions()
109
+
110
+ with gr.Row():
111
+ for i in range(0, len(example_questions), 2):
112
+ with gr.Column():
113
+ if i < len(example_questions):
114
+ example_btn = gr.Button(
115
+ example_questions[i],
116
+ size="sm",
117
+ variant="secondary"
118
+ )
119
+ example_btn.click(
120
+ lambda x=example_questions[i]: x,
121
+ outputs=question_input
122
+ )
123
+
124
+ if i + 1 < len(example_questions):
125
+ example_btn2 = gr.Button(
126
+ example_questions[i + 1],
127
+ size="sm",
128
+ variant="secondary"
129
+ )
130
+ example_btn2.click(
131
+ lambda x=example_questions[i + 1]: x,
132
+ outputs=question_input
133
+ )
134
+
135
+ query_btn.click(
136
+ lambda question, language: query_documents(question, language, global_vars),
137
+ inputs=[question_input, language_dropdown],
138
+ outputs=answer_output
139
+ )
tabs/upload.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Upload documents tab functionality for the Gradio app
3
+ """
4
+ import gradio as gr
5
+
6
+ def upload_documents(files, global_vars):
7
+ """Handle document upload and processing"""
8
+ doc_ingestion = global_vars.get('doc_ingestion')
9
+
10
+ if not doc_ingestion:
11
+ return "❌ Please initialize systems first using the 'Initialize System' tab!"
12
+
13
+ if not files:
14
+ return "❌ Please upload at least one PDF file."
15
+
16
+ try:
17
+ # Filter for PDF files only
18
+ pdf_files = []
19
+ for file_path in files:
20
+ if file_path.endswith('.pdf'):
21
+ pdf_files.append(file_path)
22
+
23
+ if not pdf_files:
24
+ return "❌ Please upload PDF files only."
25
+
26
+ print(f"📄 Processing {len(pdf_files)} PDF file(s)...")
27
+
28
+ # Process documents
29
+ documents = doc_ingestion.process_documents(pdf_files)
30
+
31
+ if documents:
32
+ print("🔗 Creating vector store...")
33
+ # Create vector store
34
+ vectorstore = doc_ingestion.create_vector_store(documents)
35
+
36
+ if vectorstore:
37
+ # Store vectorstore in global vars
38
+ global_vars['vectorstore'] = vectorstore
39
+
40
+ # Create summary
41
+ summary = f"✅ Successfully processed {len(documents)} document(s):\n\n"
42
+
43
+ for i, doc in enumerate(documents, 1):
44
+ metadata = doc.metadata
45
+ university = metadata.get('university', 'Unknown')
46
+ country = metadata.get('country', 'Unknown')
47
+ doc_type = metadata.get('document_type', 'Unknown')
48
+ language = metadata.get('language', 'Unknown')
49
+
50
+ summary += f"{i}. **{metadata['source']}**\n"
51
+ summary += f" - University: {university}\n"
52
+ summary += f" - Country: {country}\n"
53
+ summary += f" - Type: {doc_type}\n"
54
+ summary += f" - Language: {language}\n\n"
55
+
56
+ summary += "🎉 **Ready for queries!** Go to the 'Search & Query' tab to start asking questions."
57
+ return summary
58
+ else:
59
+ return "❌ Failed to create vector store from documents."
60
+ else:
61
+ return "❌ No documents were successfully processed. Please check if your PDFs are readable."
62
+
63
+ except Exception as e:
64
+ return f"❌ Error processing documents: {str(e)}\n\nPlease check the console for more details."
65
+
66
+ def create_upload_tab(global_vars):
67
+ """Create the Upload Documents tab"""
68
+ with gr.Tab("📄 Upload Documents", id="upload"):
69
+ gr.Markdown("""
70
+ ### Step 2: Upload PDF Documents
71
+ Upload university documents (brochures, admission guides, etc.) in PDF format.
72
+ The system will automatically extract metadata including university name, country, and document type.
73
+ """)
74
+
75
+ file_upload = gr.File(
76
+ label="📁 Upload PDF Documents",
77
+ file_types=[".pdf"],
78
+ file_count="multiple",
79
+ height=120
80
+ )
81
+
82
+ upload_btn = gr.Button(
83
+ "📄 Process Documents",
84
+ variant="primary",
85
+ size="lg"
86
+ )
87
+
88
+ upload_status = gr.Textbox(
89
+ label="Processing Status",
90
+ interactive=False,
91
+ lines=12,
92
+ placeholder="Upload PDF files and click 'Process Documents'..."
93
+ )
94
+
95
+ upload_btn.click(
96
+ lambda files: upload_documents(files, global_vars),
97
+ inputs=file_upload,
98
+ outputs=upload_status
99
+ )
utils/rag_system.py CHANGED
@@ -2,7 +2,6 @@ import os
2
  import uuid
3
  import tempfile
4
  from typing import List, Optional, Dict, Any
5
- import streamlit as st
6
  from pathlib import Path
7
  import PyPDF2
8
  from langchain.text_splitter import RecursiveCharacterTextSplitter
@@ -27,24 +26,54 @@ class AlternativeEmbeddings:
27
  """Alternative embeddings using Sentence Transformers when OpenAI is not available"""
28
 
29
  def __init__(self):
 
 
 
30
  try:
31
  from sentence_transformers import SentenceTransformer
32
- # Use BGE-small-en for better performance
33
- self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")
34
- self.embedding_size = 384
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  except ImportError:
36
- st.error("sentence-transformers not available. Please install it or provide OpenAI API key.")
37
- self.model = None
38
 
39
  def embed_documents(self, texts):
40
  if not self.model:
41
- return []
42
- return self.model.encode(texts).tolist()
 
 
 
 
43
 
44
  def embed_query(self, text):
45
  if not self.model:
46
- return []
47
- return self.model.encode([text])[0].tolist()
 
 
 
 
48
 
49
  class SEALionLLM:
50
  """Custom LLM class for SEA-LION models"""
@@ -168,7 +197,7 @@ class SEALionLLM:
168
  return response_text
169
 
170
  except Exception as e:
171
- st.error(f"Error with SEA-LION model: {str(e)}")
172
  return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
173
 
174
  def extract_metadata(self, document_text: str) -> Dict[str, str]:
@@ -213,33 +242,33 @@ class SEALionLLM:
213
  )
214
 
215
  response_text = response.choices[0].message.content.strip()
216
- st.subheader("--- DEBUG: LLM Metadata Extraction Details ---")
217
- st.write(f"**Input Text for LLM (first 2 pages):**\n```\n{document_text[:1000]}...\n```") # Show first 1000 chars of input
218
- st.write(f"**Raw LLM Response:**\n```json\n{response_text}\n```")
219
 
220
  json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
221
  if json_match:
222
  json_str = json_match.group(0)
223
  try:
224
  metadata = json.loads(json_str)
225
- st.write(f"**Parsed JSON Metadata:**\n```json\n{json.dumps(metadata, indent=2)}\n```")
226
  required_keys = ["university_name", "country", "document_type", "language"]
227
  if all(key in metadata for key in required_keys):
228
- st.success("DEBUG: Successfully extracted and parsed metadata from LLM.")
229
  return metadata
230
  else:
231
- st.warning("DEBUG: LLM response missing required keys, attempting fallback or using defaults.")
232
  return self._get_default_metadata()
233
  except json.JSONDecodeError as e:
234
- st.error(f"DEBUG: JSON Parsing Failed: {e}")
235
- st.write(f"DEBUG: Attempting fallback text extraction from raw response.")
236
  return self._extract_from_text_response(response_text)
237
  else:
238
- st.error("DEBUG: No JSON object found in LLM response.")
239
  return self._extract_from_text_response(response_text)
240
 
241
  except Exception as e:
242
- st.error(f"DEBUG: Error during LLM Metadata Extraction: {str(e)}")
243
  return self._get_default_metadata()
244
 
245
  def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
@@ -260,7 +289,7 @@ class SEALionLLM:
260
  elif "language" in line.lower() and ":" in line:
261
  value = line.split(":", 1)[1].strip().strip('",')
262
  metadata["language"] = value
263
- st.write(f"DEBUG: Fallback text extraction result: {metadata}")
264
  return metadata
265
 
266
  def _get_default_metadata(self) -> Dict[str, str]:
@@ -301,10 +330,10 @@ class DocumentIngestion:
301
  self.embeddings = OpenAIEmbeddings()
302
  self.embedding_type = "OpenAI"
303
  except Exception as e:
304
- st.error("Both BGE and OpenAI embeddings failed. Please check your setup.")
305
  raise e
306
  else:
307
- st.error("No embedding model available. Please install sentence-transformers or provide OpenAI API key.")
308
  raise Exception("No embedding model available")
309
 
310
  self.text_splitter = SemanticChunker(
@@ -321,80 +350,77 @@ class DocumentIngestion:
321
  self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
322
  os.makedirs(self.persist_directory, exist_ok=True)
323
 
324
- def extract_text_from_pdf(self, pdf_file) -> List[str]:
325
- """Extract text from uploaded PDF file with multiple fallback methods."""
326
  try:
327
  # Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
328
- pdf_reader = PyPDF2.PdfReader(pdf_file)
329
-
330
- # Check if PDF is encrypted
331
- if pdf_reader.is_encrypted:
332
- # Try to decrypt with empty password (common for protected but not password-protected PDFs)
333
- try:
334
- pdf_reader.decrypt("")
335
- except Exception:
336
- st.warning(f"PDF {pdf_file.name} is password-protected. Please provide an unprotected version.")
337
- return [] # Return empty list for password-protected PDFs
338
-
339
- text_per_page = []
340
- for page_num, page in enumerate(pdf_reader.pages):
341
- try:
342
- page_text = page.extract_text()
343
- text_per_page.append(page_text)
344
- except Exception as e:
345
- st.warning(f"Could not extract text from page {page_num + 1} of {pdf_file.name}: {str(e)}")
346
- text_per_page.append("") # Append empty string for failed pages
347
-
348
- if any(text.strip() for text in text_per_page):
349
- return text_per_page
350
- else:
351
- st.warning(f"No extractable text found in {pdf_file.name}. This might be a scanned PDF or image-based document.")
352
- return []
 
353
 
354
  except Exception as e:
355
  error_msg = str(e)
356
  if "PyCryptodome" in error_msg:
357
- st.error(f"Encryption error with {pdf_file.name}: {error_msg}")
358
- st.info("💡 The PDF uses encryption. PyCryptodome has been installed to handle this.")
359
  elif "password" in error_msg.lower():
360
- st.error(f"Password-protected PDF: {pdf_file.name}")
361
- st.info("💡 Please provide an unprotected version of this PDF.")
362
  else:
363
- st.error(f"Error extracting text from {pdf_file.name}: {error_msg}")
364
  return []
365
 
366
- def process_documents(self, uploaded_files) -> List[Document]: # Removed university_name, country, document_type parameters
367
- """Process uploaded PDF files and convert to documents with automatic metadata extraction."""
368
  documents = []
369
  processed_count = 0
370
  failed_count = 0
371
 
372
- st.info(f"📄 Processing {len(uploaded_files)} document(s) with automatic metadata detection...") # Changed to print
373
 
374
- for uploaded_file in uploaded_files:
375
- if uploaded_file.type == "application/pdf":
376
- st.write(f"🔍 Extracting text from: **{uploaded_file.name}**") # Changed to print
 
377
 
378
  # Extract text per page
379
- text_per_page = self.extract_text_from_pdf(uploaded_file)
380
- st.write(f"DEBUG: Extracted {len(text_per_page)} pages from {uploaded_file.name}")
381
 
382
  if text_per_page:
383
  # Combine first two pages for metadata extraction
384
  text_for_metadata = "\n".join(text_per_page[:2])
385
- st.write(f"DEBUG: Text for metadata extraction (first 500 chars): {text_for_metadata[:500]}")
386
  # Extract metadata using LLM
387
- st.write(f"🤖 Detecting metadata for: **{uploaded_file.name}**") # Changed to print
388
  extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
389
 
390
- # Validate and clean metadata (assuming validate_metadata is defined elsewhere or will be added)
391
- # For now, we\'ll use the extracted_metadata directly.
392
- # If you want me to add validate_metadata here, please provide its content.
393
- # extracted_metadata = validate_metadata(extracted_metadata)
394
-
395
  # Create metadata
396
  metadata = {
397
- "source": uploaded_file.name,
398
  "university": extracted_metadata.get("university_name", "Unknown"),
399
  "country": extracted_metadata.get("country", "Unknown"),
400
  "document_type": extracted_metadata.get("document_type", "general_info"),
@@ -410,26 +436,27 @@ class DocumentIngestion:
410
  )
411
  documents.append(doc)
412
  processed_count += 1
413
- st.success(f"✅ Successfully processed: **{uploaded_file.name}** ({len(doc.page_content)} characters)") # Changed to print
414
  else:
415
  failed_count += 1
416
- st.warning(f"⚠️ Could not extract text from **{uploaded_file.name}**") # Changed to print
417
  else:
418
  failed_count += 1
419
- st.error(f"❌ Unsupported file type: **{uploaded_file.type}** for {uploaded_file.name}") # Changed to print
 
420
 
421
  # Summary
422
  if processed_count > 0:
423
- st.success(f"🎉 Successfully processed **{processed_count}** document(s)") # Changed to print
424
  if failed_count > 0:
425
- st.warning(f"⚠️ Failed to process **{failed_count}** document(s)") # Changed to print
426
 
427
  return documents
428
 
429
  def create_vector_store(self, documents: List[Document]) -> Chroma:
430
  """Create and persist vector store from documents."""
431
  if not documents:
432
- st.error("No documents to process") # Changed to print
433
  return None
434
 
435
  # Split documents into chunks
@@ -453,7 +480,7 @@ class DocumentIngestion:
453
  )
454
  return vectorstore
455
  except Exception as e:
456
- st.warning(f"Could not load existing vector store: {str(e)}") # Changed to print
457
  return None
458
 
459
  class RAGSystem:
@@ -480,7 +507,7 @@ class RAGSystem:
480
  )
481
  return vectorstore
482
  except Exception as e:
483
- st.error(f"Error loading vector store: {str(e)}")
484
  return None
485
 
486
  def query(self, question: str, language: str = "English") -> Dict[str, Any]:
@@ -532,7 +559,7 @@ Document {i} (Source: {source_info}, University: {university}, Country: {country
532
  }
533
 
534
  except Exception as e:
535
- st.error(f"Error querying system: {str(e)}")
536
  return {
537
  "answer": f"Error processing your question: {str(e)}",
538
  "source_documents": [],
@@ -570,7 +597,7 @@ def save_query_result(query_result: Dict[str, Any]):
570
  json.dump(save_data, f, indent=2, ensure_ascii=False)
571
  return True
572
  except Exception as e:
573
- st.error(f"Error saving query result: {str(e)}")
574
  return False
575
  return False
576
 
@@ -583,6 +610,6 @@ def load_shared_query(query_id: str) -> Optional[Dict[str, Any]]:
583
  with open(result_file, 'r', encoding='utf-8') as f:
584
  return json.load(f)
585
  except Exception as e:
586
- st.error(f"Error loading shared query: {str(e)}")
587
 
588
  return None
 
2
  import uuid
3
  import tempfile
4
  from typing import List, Optional, Dict, Any
 
5
  from pathlib import Path
6
  import PyPDF2
7
  from langchain.text_splitter import RecursiveCharacterTextSplitter
 
26
  """Alternative embeddings using Sentence Transformers when OpenAI is not available"""
27
 
28
  def __init__(self):
29
+ self.model = None
30
+ self.embedding_size = 384
31
+
32
  try:
33
  from sentence_transformers import SentenceTransformer
34
+
35
+ # Try smaller models in order of preference for better cloud compatibility
36
+ model_options = [
37
+ ("all-MiniLM-L6-v2", 384), # Very small and reliable
38
+ ("paraphrase-MiniLM-L3-v2", 384), # Even smaller
39
+ ("BAAI/bge-small-en-v1.5", 384) # Original choice
40
+ ]
41
+
42
+ for model_name, embed_size in model_options:
43
+ try:
44
+ print(f"🔄 Trying to load model: {model_name}")
45
+ self.model = SentenceTransformer(model_name)
46
+ self.embedding_size = embed_size
47
+ print(f"✅ Successfully loaded: {model_name}")
48
+ break
49
+ except Exception as e:
50
+ print(f"⚠️ Failed to load {model_name}: {str(e)}")
51
+ continue
52
+
53
+ if not self.model:
54
+ raise Exception("All embedding models failed to load")
55
+
56
  except ImportError:
57
+ print("sentence-transformers not available. Please install it or provide OpenAI API key.")
58
+ raise ImportError("sentence-transformers not available")
59
 
60
  def embed_documents(self, texts):
61
  if not self.model:
62
+ raise Exception("No embedding model available")
63
+ try:
64
+ return self.model.encode(texts, convert_to_numpy=True).tolist()
65
+ except Exception as e:
66
+ print(f"Error encoding documents: {e}")
67
+ raise
68
 
69
  def embed_query(self, text):
70
  if not self.model:
71
+ raise Exception("No embedding model available")
72
+ try:
73
+ return self.model.encode([text], convert_to_numpy=True)[0].tolist()
74
+ except Exception as e:
75
+ print(f"Error encoding query: {e}")
76
+ raise
77
 
78
  class SEALionLLM:
79
  """Custom LLM class for SEA-LION models"""
 
197
  return response_text
198
 
199
  except Exception as e:
200
+ print(f"Error with SEA-LION model: {str(e)}")
201
  return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
202
 
203
  def extract_metadata(self, document_text: str) -> Dict[str, str]:
 
242
  )
243
 
244
  response_text = response.choices[0].message.content.strip()
245
+ print("--- DEBUG: LLM Metadata Extraction Details ---")
246
+ print(f"**Input Text for LLM (first 2 pages):**\n```\n{document_text[:1000]}...\n```") # Show first 1000 chars of input
247
+ print(f"**Raw LLM Response:**\n```json\n{response_text}\n```")
248
 
249
  json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
250
  if json_match:
251
  json_str = json_match.group(0)
252
  try:
253
  metadata = json.loads(json_str)
254
+ print(f"**Parsed JSON Metadata:**\n```json\n{json.dumps(metadata, indent=2)}\n```")
255
  required_keys = ["university_name", "country", "document_type", "language"]
256
  if all(key in metadata for key in required_keys):
257
+ print("DEBUG: Successfully extracted and parsed metadata from LLM.")
258
  return metadata
259
  else:
260
+ print("DEBUG: LLM response missing required keys, attempting fallback or using defaults.")
261
  return self._get_default_metadata()
262
  except json.JSONDecodeError as e:
263
+ print(f"DEBUG: JSON Parsing Failed: {e}")
264
+ print(f"DEBUG: Attempting fallback text extraction from raw response.")
265
  return self._extract_from_text_response(response_text)
266
  else:
267
+ print("DEBUG: No JSON object found in LLM response.")
268
  return self._extract_from_text_response(response_text)
269
 
270
  except Exception as e:
271
+ print(f"DEBUG: Error during LLM Metadata Extraction: {str(e)}")
272
  return self._get_default_metadata()
273
 
274
  def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
 
289
  elif "language" in line.lower() and ":" in line:
290
  value = line.split(":", 1)[1].strip().strip('",')
291
  metadata["language"] = value
292
+ print(f"DEBUG: Fallback text extraction result: {metadata}")
293
  return metadata
294
 
295
  def _get_default_metadata(self) -> Dict[str, str]:
 
330
  self.embeddings = OpenAIEmbeddings()
331
  self.embedding_type = "OpenAI"
332
  except Exception as e:
333
+ print("Both BGE and OpenAI embeddings failed. Please check your setup.")
334
  raise e
335
  else:
336
+ print("No embedding model available. Please install sentence-transformers or provide OpenAI API key.")
337
  raise Exception("No embedding model available")
338
 
339
  self.text_splitter = SemanticChunker(
 
350
  self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
351
  os.makedirs(self.persist_directory, exist_ok=True)
352
 
353
+ def extract_text_from_pdf(self, pdf_file_path) -> List[str]:
354
+ """Extract text from PDF file path with multiple fallback methods."""
355
  try:
356
  # Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
357
+ with open(pdf_file_path, 'rb') as pdf_file:
358
+ pdf_reader = PyPDF2.PdfReader(pdf_file)
359
+
360
+ # Check if PDF is encrypted
361
+ if pdf_reader.is_encrypted:
362
+ # Try to decrypt with empty password (common for protected but not password-protected PDFs)
363
+ try:
364
+ pdf_reader.decrypt("")
365
+ except Exception:
366
+ print(f"PDF {os.path.basename(pdf_file_path)} is password-protected. Please provide an unprotected version.")
367
+ return [] # Return empty list for password-protected PDFs
368
+
369
+ text_per_page = []
370
+ for page_num, page in enumerate(pdf_reader.pages):
371
+ try:
372
+ page_text = page.extract_text()
373
+ text_per_page.append(page_text)
374
+ except Exception as e:
375
+ print(f"Could not extract text from page {page_num + 1} of {os.path.basename(pdf_file_path)}: {str(e)}")
376
+ text_per_page.append("") # Append empty string for failed pages
377
+
378
+ if any(text.strip() for text in text_per_page):
379
+ return text_per_page
380
+ else:
381
+ print(f"No extractable text found in {os.path.basename(pdf_file_path)}. This might be a scanned PDF or image-based document.")
382
+ return []
383
 
384
  except Exception as e:
385
  error_msg = str(e)
386
  if "PyCryptodome" in error_msg:
387
+ print(f"Encryption error with {os.path.basename(pdf_file_path)}: {error_msg}")
388
+ print("💡 The PDF uses encryption. PyCryptodome has been installed to handle this.")
389
  elif "password" in error_msg.lower():
390
+ print(f"Password-protected PDF: {os.path.basename(pdf_file_path)}")
391
+ print("💡 Please provide an unprotected version of this PDF.")
392
  else:
393
+ print(f"Error extracting text from {os.path.basename(pdf_file_path)}: {error_msg}")
394
  return []
395
 
396
+ def process_documents(self, pdf_file_paths) -> List[Document]:
397
+ """Process PDF file paths and convert to documents with automatic metadata extraction."""
398
  documents = []
399
  processed_count = 0
400
  failed_count = 0
401
 
402
+ print(f"📄 Processing {len(pdf_file_paths)} document(s) with automatic metadata detection...") # Changed to print
403
 
404
+ for pdf_file_path in pdf_file_paths:
405
+ if pdf_file_path.endswith('.pdf'):
406
+ filename = os.path.basename(pdf_file_path)
407
+ print(f"🔍 Extracting text from: **{filename}**") # Changed to print
408
 
409
  # Extract text per page
410
+ text_per_page = self.extract_text_from_pdf(pdf_file_path)
411
+ print(f"DEBUG: Extracted {len(text_per_page)} pages from {filename}")
412
 
413
  if text_per_page:
414
  # Combine first two pages for metadata extraction
415
  text_for_metadata = "\n".join(text_per_page[:2])
416
+ print(f"DEBUG: Text for metadata extraction (first 500 chars): {text_for_metadata[:500]}")
417
  # Extract metadata using LLM
418
+ print(f"🤖 Detecting metadata for: **{filename}**") # Changed to print
419
  extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
420
 
 
 
 
 
 
421
  # Create metadata
422
  metadata = {
423
+ "source": filename,
424
  "university": extracted_metadata.get("university_name", "Unknown"),
425
  "country": extracted_metadata.get("country", "Unknown"),
426
  "document_type": extracted_metadata.get("document_type", "general_info"),
 
436
  )
437
  documents.append(doc)
438
  processed_count += 1
439
+ print(f"✅ Successfully processed: **{filename}** ({len(doc.page_content)} characters)") # Changed to print
440
  else:
441
  failed_count += 1
442
+ print(f"⚠️ Could not extract text from **{filename}**") # Changed to print
443
  else:
444
  failed_count += 1
445
+ filename = os.path.basename(pdf_file_path)
446
+ print(f"❌ Unsupported file type for {filename} (expected .pdf)") # Changed to print
447
 
448
  # Summary
449
  if processed_count > 0:
450
+ print(f"🎉 Successfully processed **{processed_count}** document(s)") # Changed to print
451
  if failed_count > 0:
452
+ print(f"⚠️ Failed to process **{failed_count}** document(s)") # Changed to print
453
 
454
  return documents
455
 
456
  def create_vector_store(self, documents: List[Document]) -> Chroma:
457
  """Create and persist vector store from documents."""
458
  if not documents:
459
+ print("No documents to process") # Changed to print
460
  return None
461
 
462
  # Split documents into chunks
 
480
  )
481
  return vectorstore
482
  except Exception as e:
483
+ print(f"Could not load existing vector store: {str(e)}") # Changed to print
484
  return None
485
 
486
  class RAGSystem:
 
507
  )
508
  return vectorstore
509
  except Exception as e:
510
+ print(f"Error loading vector store: {str(e)}")
511
  return None
512
 
513
  def query(self, question: str, language: str = "English") -> Dict[str, Any]:
 
559
  }
560
 
561
  except Exception as e:
562
+ print(f"Error querying system: {str(e)}")
563
  return {
564
  "answer": f"Error processing your question: {str(e)}",
565
  "source_documents": [],
 
597
  json.dump(save_data, f, indent=2, ensure_ascii=False)
598
  return True
599
  except Exception as e:
600
+ print(f"Error saving query result: {str(e)}")
601
  return False
602
  return False
603
 
 
610
  with open(result_file, 'r', encoding='utf-8') as f:
611
  return json.load(f)
612
  except Exception as e:
613
+ print(f"Error loading shared query: {str(e)}")
614
 
615
  return None
utils/translations.py CHANGED
@@ -110,6 +110,40 @@ translations = {
110
  "example_simple_2": "What is the difference between bachelor and master degree?",
111
  "example_simple_3": "How to apply for student visa?",
112
  "example_simple_4": "What documents are needed for university application?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  },
114
 
115
  "中文": {
@@ -223,6 +257,40 @@ translations = {
223
  "example_simple_2": "学士学位和硕士学位有什么区别?",
224
  "example_simple_3": "如何申请学生签证?",
225
  "example_simple_4": "大学申请需要哪些文件?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  },
227
 
228
  "Malay": {
 
110
  "example_simple_2": "What is the difference between bachelor and master degree?",
111
  "example_simple_3": "How to apply for student visa?",
112
  "example_simple_4": "What documents are needed for university application?",
113
+
114
+ # System messages
115
+ "systems_initialized": "✅ Systems initialized successfully!",
116
+ "can_upload_documents": "You can now upload documents.",
117
+ "initialization_error": "Error initializing systems",
118
+ "installation_help": """**Possible solutions:**
119
+ 1. Install sentence-transformers: `pip install sentence-transformers`
120
+ 2. Or provide OpenAI API key in environment variables
121
+ 3. Check that PyTorch is properly installed
122
+
123
+ **For deployment:**
124
+ - Ensure requirements.txt includes: sentence-transformers, torch, transformers""",
125
+ "please_initialize_first": "Please initialize systems first using the 'Initialize System' tab!",
126
+ "please_upload_pdf": "Please upload at least one PDF file.",
127
+ "upload_pdf_only": "Please upload PDF files only.",
128
+ "successfully_processed_docs": "Successfully processed",
129
+ "failed_create_vectorstore": "Failed to create vector store from documents.",
130
+ "no_docs_successfully_processed": "No documents were successfully processed. Please check if your PDFs are readable.",
131
+ "error_processing_docs": "Error processing documents",
132
+ "check_console": "Please check the console for more details.",
133
+ "please_upload_process_first": "Please upload and process documents first using the 'Upload Documents' tab!",
134
+ "please_enter_question": "Please enter a question.",
135
+ "processing_query": "Processing query",
136
+ "model_used": "Model Used",
137
+ "answer": "Answer",
138
+ "sources": "Sources",
139
+ "no_sources_found": "No specific sources found. This might be a general response.",
140
+ "error_querying_docs": "Error querying documents",
141
+ "ready_for_queries": "Ready for queries! Go to the 'Search & Query' tab to start asking questions.",
142
+
143
+ # Interface elements
144
+ "initialize_system": "Initialize System",
145
+ "initialize_systems": "Initialize Systems",
146
+ "initialization_status": "Initialization Status",
147
  },
148
 
149
  "中文": {
 
257
  "example_simple_2": "学士学位和硕士学位有什么区别?",
258
  "example_simple_3": "如何申请学生签证?",
259
  "example_simple_4": "大学申请需要哪些文件?",
260
+
261
+ # System messages
262
+ "systems_initialized": "✅ 系统初始化成功!",
263
+ "can_upload_documents": "您现在可以上传文档。",
264
+ "initialization_error": "系统初始化错误",
265
+ "installation_help": """**可能的解决方案:**
266
+ 1. 安装 sentence-transformers: `pip install sentence-transformers`
267
+ 2. 或在环境变量中提供 OpenAI API 密钥
268
+ 3. 检查 PyTorch 是否正确安装
269
+
270
+ **部署时:**
271
+ - 确保 requirements.txt 包含:sentence-transformers, torch, transformers""",
272
+ "please_initialize_first": "请先使用'初始化系统'选项卡初始化系统!",
273
+ "please_upload_pdf": "请至少上传一个PDF文件。",
274
+ "upload_pdf_only": "请仅上传PDF文件。",
275
+ "successfully_processed_docs": "成功处理",
276
+ "failed_create_vectorstore": "创建向量存储失败。",
277
+ "no_docs_successfully_processed": "没有成功处理任何文档。请检查您的PDF是否可读。",
278
+ "error_processing_docs": "处理文档时出错",
279
+ "check_console": "请查看控制台获取更多详细信息。",
280
+ "please_upload_process_first": "请先使用'上传文档'选项卡上传和处理文档!",
281
+ "please_enter_question": "请输入问题。",
282
+ "processing_query": "正在处理查询",
283
+ "model_used": "使用的模型",
284
+ "answer": "答案",
285
+ "sources": "来源",
286
+ "no_sources_found": "未找到特定来源。这可能是一般性回答。",
287
+ "error_querying_docs": "查询文档时出错",
288
+ "ready_for_queries": "准备查询!前往'搜索与查询'选项卡开始提问。",
289
+
290
+ # Interface elements
291
+ "initialize_system": "初始化系统",
292
+ "initialize_systems": "初始化系统",
293
+ "initialization_status": "初始化状态",
294
  },
295
 
296
  "Malay": {