Ervinoreo
commited on
Commit
·
846f122
1
Parent(s):
ecf227f
gradio
Browse files- .gitignore +3 -0
- README.md +0 -274
- README_GRADIO.md +204 -0
- app.py +0 -123
- app_gradio.py +137 -0
- app_gradio_modular.py +137 -0
- installed_packages.txt +0 -178
- my_pages/about.py +0 -37
- my_pages/manage_documents.py +0 -73
- my_pages/search_uni.py +0 -104
- my_pages/upload_documents.py +0 -202
- requirements.txt +15 -3
- runtime.txt +0 -1
- start.sh +0 -43
- tabs/help.py +168 -0
- tabs/initialize.py +55 -0
- tabs/manage.py +237 -0
- tabs/query.py +139 -0
- tabs/upload.py +99 -0
- utils/rag_system.py +110 -83
- utils/translations.py +68 -0
.gitignore
CHANGED
|
@@ -22,6 +22,8 @@ share/python-wheels/
|
|
| 22 |
.installed.cfg
|
| 23 |
*.egg
|
| 24 |
MANIFEST
|
|
|
|
|
|
|
| 25 |
|
| 26 |
# Virtual Environment
|
| 27 |
.venv/
|
|
@@ -31,6 +33,7 @@ ENV/
|
|
| 31 |
env/
|
| 32 |
.venv
|
| 33 |
myenv/
|
|
|
|
| 34 |
|
| 35 |
# Environment Variables
|
| 36 |
.env
|
|
|
|
| 22 |
.installed.cfg
|
| 23 |
*.egg
|
| 24 |
MANIFEST
|
| 25 |
+
tabs/__pycache__/
|
| 26 |
+
.gradio
|
| 27 |
|
| 28 |
# Virtual Environment
|
| 29 |
.venv/
|
|
|
|
| 33 |
env/
|
| 34 |
.venv
|
| 35 |
myenv/
|
| 36 |
+
gradio/
|
| 37 |
|
| 38 |
# Environment Variables
|
| 39 |
.env
|
README.md
DELETED
|
@@ -1,274 +0,0 @@
|
|
| 1 |
-
# PanSea University Search
|
| 2 |
-
|
| 3 |
-
An AI-powered RAG (Retrieval-Augmented Generation) system for searching ASEAN university admission requirements, designed to help prospective students find accurate and up-to-date information about study opportunities across Southeast Asia.
|
| 4 |
-
|
| 5 |
-
## 🎯 Problem & Solution
|
| 6 |
-
|
| 7 |
-
**Problem:** Prospective students worldwide seeking to study abroad face difficulty finding accurate, up-to-date university admission requirements. Information is scattered across PDFs, brochures, and outdated agency websites. Many waste time applying to unsuitable programs due to missing criteria and pay high agent fees.
|
| 8 |
-
|
| 9 |
-
**Solution:** An LLM-powered, RAG-based platform powered by **SEA-LION multilingual models** that ingests official admissions documents from ASEAN universities. Students can query in any ASEAN language and receive ranked program matches with fees, entry requirements, deadlines, application windows, and source citations.
|
| 10 |
-
|
| 11 |
-
## 🌟 Features
|
| 12 |
-
|
| 13 |
-
- 📄 **PDF Document Ingestion**: Upload official university admission documents
|
| 14 |
-
- 🔍 **Intelligent Search**: Natural language queries in multiple ASEAN languages
|
| 15 |
-
- 🎯 **Accurate Responses**: AI-powered answers with source citations
|
| 16 |
-
- 🔗 **Shareable Results**: Generate links to share query results
|
| 17 |
-
- 🌏 **Multi-language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
|
| 18 |
-
- 💰 **Advanced Filtering**: Budget range, study level, country preferences
|
| 19 |
-
|
| 20 |
-
## 🚀 Quick Start
|
| 21 |
-
|
| 22 |
-
### Prerequisites
|
| 23 |
-
|
| 24 |
-
- Python 3.11+
|
| 25 |
-
- SEA-LION API Key
|
| 26 |
-
- OpenAI API Key (optional, for fallback embeddings)
|
| 27 |
-
|
| 28 |
-
### Installation
|
| 29 |
-
|
| 30 |
-
1. **Clone and navigate to the project:**
|
| 31 |
-
|
| 32 |
-
```bash
|
| 33 |
-
cd pansea
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
2. **Activate virtual environment:**
|
| 37 |
-
|
| 38 |
-
```bash
|
| 39 |
-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
3. **Install dependencies:**
|
| 43 |
-
|
| 44 |
-
```bash
|
| 45 |
-
pip install -r requirements.txt
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
4. **Set up environment variables:**
|
| 49 |
-
|
| 50 |
-
```bash
|
| 51 |
-
cp .env.example .env
|
| 52 |
-
# Edit .env and add your SEA-LION API key (OpenAI key optional for fallback)
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
5. **Run the application:**
|
| 56 |
-
|
| 57 |
-
```bash
|
| 58 |
-
streamlit run app.py
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
6. **Open your browser to:** `http://localhost:8501`
|
| 62 |
-
|
| 63 |
-
### Usage
|
| 64 |
-
|
| 65 |
-
#### 1. Upload Documents
|
| 66 |
-
|
| 67 |
-
- Go to the "Upload Documents" page
|
| 68 |
-
- Enter university name and country
|
| 69 |
-
- Select document type (admission requirements, tuition fees, etc.)
|
| 70 |
-
- Upload PDF files containing university information
|
| 71 |
-
- Click "Process Documents"
|
| 72 |
-
|
| 73 |
-
#### 2. Search Universities
|
| 74 |
-
|
| 75 |
-
- Go to the "Search Universities" page
|
| 76 |
-
- Choose your response language
|
| 77 |
-
- Enter questions like:
|
| 78 |
-
- "Show me universities in Malaysia for master's degrees with tuition under 40,000 RMB per year"
|
| 79 |
-
- "专科毕业,无雅思,想在马来西亚读硕士,学费不超过 4 万人民币/年"
|
| 80 |
-
- "What are the English proficiency requirements for Singapore universities?"
|
| 81 |
-
- Apply optional filters (budget, study level, countries)
|
| 82 |
-
- Get AI-powered responses with source citations
|
| 83 |
-
|
| 84 |
-
#### 3. Share Results
|
| 85 |
-
|
| 86 |
-
- Each query generates a unique shareable link
|
| 87 |
-
- Share results with friends, family, or education consultants
|
| 88 |
-
- Access shared results without needing to upload documents again
|
| 89 |
-
|
| 90 |
-
## 📁 Project Structure
|
| 91 |
-
|
| 92 |
-
```
|
| 93 |
-
pansea/
|
| 94 |
-
├── app.py # Main Streamlit application
|
| 95 |
-
├── rag_system.py # RAG system implementation
|
| 96 |
-
├── requirements.txt # Python dependencies
|
| 97 |
-
├── .env # Environment variables
|
| 98 |
-
├── .venv/ # Virtual environment
|
| 99 |
-
├── chroma_db/ # Vector database storage
|
| 100 |
-
├── documents/ # Uploaded documents storage
|
| 101 |
-
├── query_results/ # Shared query results
|
| 102 |
-
└── README.md # This file
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
## 🛠️ Core Components
|
| 106 |
-
|
| 107 |
-
### DocumentIngestion Class
|
| 108 |
-
|
| 109 |
-
- Handles PDF text extraction using PyPDF2
|
| 110 |
-
- Creates document chunks with metadata
|
| 111 |
-
- Builds and persists ChromaDB vector store
|
| 112 |
-
- Manages document preprocessing and storage
|
| 113 |
-
|
| 114 |
-
### RAGSystem Class
|
| 115 |
-
|
| 116 |
-
- Implements retrieval-augmented generation
|
| 117 |
-
- Uses BGE-small-en-v1.5 embeddings for semantic search (with OpenAI fallback)
|
| 118 |
-
- Leverages SEA-LION models for response generation:
|
| 119 |
-
- **SEA-LION v3.5 Reasoning Model** for complex university queries
|
| 120 |
-
- **SEA-LION v3 Instruct Model** for translation and simple questions
|
| 121 |
-
- Provides multilingual query support with automatic model selection
|
| 122 |
-
|
| 123 |
-
### Streamlit UI
|
| 124 |
-
|
| 125 |
-
- Clean, intuitive interface
|
| 126 |
-
- Multi-page navigation
|
| 127 |
-
- File upload with progress tracking
|
| 128 |
-
- Advanced search filters
|
| 129 |
-
- Shareable query results
|
| 130 |
-
|
| 131 |
-
## 🌏 Supported Languages
|
| 132 |
-
|
| 133 |
-
The system supports queries and responses in:
|
| 134 |
-
|
| 135 |
-
- **English** - Primary language
|
| 136 |
-
- **中文 (Chinese)** - For Chinese-speaking students
|
| 137 |
-
- **Bahasa Malaysia** - For Malaysian context
|
| 138 |
-
- **ไทย (Thai)** - For Thai students
|
| 139 |
-
- **Bahasa Indonesia** - For Indonesian students
|
| 140 |
-
- **Tiếng Việt (Vietnamese)** - For Vietnamese students
|
| 141 |
-
- **Filipino** - For Philippines context
|
| 142 |
-
|
| 143 |
-
## 🎯 Target ASEAN Countries
|
| 144 |
-
|
| 145 |
-
- 🇸🇬 Singapore
|
| 146 |
-
- 🇲🇾 Malaysia
|
| 147 |
-
- 🇹🇭 Thailand
|
| 148 |
-
- 🇮🇩 Indonesia
|
| 149 |
-
- 🇵🇭 Philippines
|
| 150 |
-
- 🇻🇳 Vietnam
|
| 151 |
-
- 🇧🇳 Brunei
|
| 152 |
-
- 🇰🇭 Cambodia
|
| 153 |
-
- 🇱🇦 Laos
|
| 154 |
-
- 🇲🇲 Myanmar
|
| 155 |
-
|
| 156 |
-
## 🔧 Configuration
|
| 157 |
-
|
| 158 |
-
### Environment Variables (.env)
|
| 159 |
-
|
| 160 |
-
```bash
|
| 161 |
-
# SEA-LION API Configuration
|
| 162 |
-
SEA_LION_API_KEY=your_sea_lion_api_key_here
|
| 163 |
-
SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
|
| 164 |
-
|
| 165 |
-
# OpenAI API Configuration (for embeddings)
|
| 166 |
-
OPENAI_API_KEY=your_openai_api_key_here
|
| 167 |
-
|
| 168 |
-
# Application Configuration
|
| 169 |
-
APP_NAME=Top.Edu University Search
|
| 170 |
-
APP_VERSION=1.0.0
|
| 171 |
-
CHROMA_PERSIST_DIRECTORY=./chroma_db
|
| 172 |
-
UPLOAD_FOLDER=./documents
|
| 173 |
-
MAX_FILE_SIZE_MB=50
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
### Customization Options
|
| 177 |
-
|
| 178 |
-
- **Chunk Size**: Adjust text splitting in `rag_system.py`
|
| 179 |
-
- **Retrieval Count**: Modify number of retrieved documents (default: 5)
|
| 180 |
-
- **Model Selection**: Configure SEA-LION model selection logic
|
| 181 |
-
- **UI Themes**: Modify CSS in `app.py`
|
| 182 |
-
- **Query Classification**: Adjust complex vs simple query detection
|
| 183 |
-
|
| 184 |
-
## 📊 Example Queries
|
| 185 |
-
|
| 186 |
-
Try these sample queries to test the system and see different model usage:
|
| 187 |
-
|
| 188 |
-
### Complex Queries (Uses SEA-LION Reasoning Model)
|
| 189 |
-
|
| 190 |
-
1. **Multi-criteria Search**: "Show me universities in Thailand and Malaysia for engineering master's programs with tuition under $15,000 per year"
|
| 191 |
-
|
| 192 |
-
2. **Chinese Query**: "专科毕业,无雅思,想在马来西亚读硕士,学费不超过 4 万人民币/年"
|
| 193 |
-
|
| 194 |
-
3. **Comparative Analysis**: "Compare MBA programs in Singapore and Indonesia with GMAT requirements and scholarship opportunities"
|
| 195 |
-
|
| 196 |
-
### Simple Queries (Uses SEA-LION Instruct Model)
|
| 197 |
-
|
| 198 |
-
4. **Translation**: "How do you say 'application deadline' in Thai and Indonesian?"
|
| 199 |
-
|
| 200 |
-
5. **Definition**: "What is the difference between IELTS and TOEFL?"
|
| 201 |
-
|
| 202 |
-
6. **Basic Information**: "What does GPA stand for and how is it calculated?"
|
| 203 |
-
|
| 204 |
-
## 🔍 Technical Stack
|
| 205 |
-
|
| 206 |
-
- **Backend**: Python 3.11, LangChain
|
| 207 |
-
- **LLM Models**:
|
| 208 |
-
- SEA-LION v3.5 8B Reasoning (complex queries)
|
| 209 |
-
- SEA-LION v3 9B Instruct (simple queries & translation)
|
| 210 |
-
- **Embeddings**: BGE-small-en-v1.5 (with OpenAI ada-002 fallback)
|
| 211 |
-
- **Vector Database**: ChromaDB with persistence
|
| 212 |
-
- **Frontend**: Streamlit with custom CSS
|
| 213 |
-
- **Document Processing**: PyPDF2, PyCryptodome (for encrypted PDFs), RecursiveCharacterTextSplitter
|
| 214 |
-
|
| 215 |
-
## 📈 Roadmap
|
| 216 |
-
|
| 217 |
-
- [ ] Support for additional document formats (Word, Excel)
|
| 218 |
-
- [x] Integration with SEA-LION multilingual models
|
| 219 |
-
- [ ] Real-time web scraping of university websites
|
| 220 |
-
- [ ] Mobile-responsive design
|
| 221 |
-
- [ ] User authentication and query history
|
| 222 |
-
- [ ] Advanced analytics and insights
|
| 223 |
-
- [ ] Integration with university application systems
|
| 224 |
-
- [ ] Fine-tuning SEA-LION models on university-specific data
|
| 225 |
-
|
| 226 |
-
## 🤝 Contributing
|
| 227 |
-
|
| 228 |
-
1. Fork the repository
|
| 229 |
-
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
|
| 230 |
-
3. Commit your changes (`git commit -m 'Add amazing feature'`)
|
| 231 |
-
4. Push to the branch (`git push origin feature/amazing-feature`)
|
| 232 |
-
5. Open a Pull Request
|
| 233 |
-
|
| 234 |
-
## 📄 License
|
| 235 |
-
|
| 236 |
-
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 237 |
-
|
| 238 |
-
## 💡 Tips for Best Results
|
| 239 |
-
|
| 240 |
-
1. **Upload Quality Documents**: Use official admission guides and requirements documents
|
| 241 |
-
2. **Be Specific**: Include specific criteria in your queries (budget, location, program type)
|
| 242 |
-
3. **Use Natural Language**: Ask questions as you would to a human counselor
|
| 243 |
-
4. **Try Multiple Languages**: The system works well with mixed-language queries
|
| 244 |
-
5. **Check Sources**: Always review the source documents cited in responses
|
| 245 |
-
|
| 246 |
-
## 🆘 Troubleshooting
|
| 247 |
-
|
| 248 |
-
### Common Issues
|
| 249 |
-
|
| 250 |
-
**"No documents found"**: Upload PDF documents first in the Upload Documents page
|
| 251 |
-
|
| 252 |
-
**"API Key not found"**: Add your SEA-LION API key to the .env file
|
| 253 |
-
|
| 254 |
-
**"No embeddings available"**: BGE embeddings are used by default. If issues occur, add your OpenAI API key for fallback embeddings
|
| 255 |
-
|
| 256 |
-
**"Import errors"**: Install dependencies using `pip install -r requirements.txt`
|
| 257 |
-
|
| 258 |
-
**"ChromaDB errors"**: Delete the `chroma_db` folder and restart the application
|
| 259 |
-
|
| 260 |
-
**"PyCryptodome is required for AES algorithm"**: This error occurs with encrypted PDFs. PyCryptodome is now included in requirements.txt
|
| 261 |
-
|
| 262 |
-
**"Could not extract text from PDF"**: This can happen with:
|
| 263 |
-
|
| 264 |
-
- Password-protected PDFs (provide unprotected versions)
|
| 265 |
-
- Scanned PDFs or image-based documents (consider OCR tools)
|
| 266 |
-
- Heavily encrypted or corrupted PDF files
|
| 267 |
-
|
| 268 |
-
## 📞 Support
|
| 269 |
-
|
| 270 |
-
For support, please create an issue on GitHub or contact the development team.
|
| 271 |
-
|
| 272 |
-
---
|
| 273 |
-
|
| 274 |
-
**Made with ❤️ for students seeking education opportunities in ASEAN** 🎓
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README_GRADIO.md
ADDED
|
@@ -0,0 +1,204 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🌏 ASEAN University Search - Gradio Version
|
| 2 |
+
|
| 3 |
+
An AI-powered university document search and Q&A system built with Gradio, specifically designed for ASEAN universities. This version uses **SEA-LION AI models** for intelligent responses and supports multiple Southeast Asian languages.
|
| 4 |
+
|
| 5 |
+
## ✨ Features
|
| 6 |
+
|
| 7 |
+
- 🤖 **AI-Powered Search**: Uses SEA-LION models for intelligent document analysis
|
| 8 |
+
- 🌍 **Multi-Language Support**: English, Chinese, Malay, Thai, Indonesian, Vietnamese, Filipino
|
| 9 |
+
- 📚 **Automatic Metadata Extraction**: Detects university names, countries, and document types
|
| 10 |
+
- 🔍 **Semantic Document Chunking**: Intelligent text splitting for better retrieval
|
| 11 |
+
- 📱 **Shareable Links**: Built-in Gradio sharing for easy deployment
|
| 12 |
+
- 🎯 **Source Citations**: Always shows which documents were used for answers
|
| 13 |
+
|
| 14 |
+
## 🚀 Quick Start
|
| 15 |
+
|
| 16 |
+
### Option 1: Using the Startup Script (Recommended)
|
| 17 |
+
|
| 18 |
+
```bash
|
| 19 |
+
./start_gradio.sh
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
### Option 2: Manual Setup
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
# Create virtual environment
|
| 26 |
+
python3 -m venv venv
|
| 27 |
+
source venv/bin/activate
|
| 28 |
+
|
| 29 |
+
# Install requirements
|
| 30 |
+
pip install -r requirements_gradio.txt
|
| 31 |
+
|
| 32 |
+
# Run the application
|
| 33 |
+
python app_gradio.py
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## 🌐 Deployment Options
|
| 37 |
+
|
| 38 |
+
### 1. **Local with Public Link** (Immediate)
|
| 39 |
+
|
| 40 |
+
- Run the app locally
|
| 41 |
+
- Gradio automatically creates a public shareable link
|
| 42 |
+
- Perfect for testing and sharing
|
| 43 |
+
|
| 44 |
+
### 2. **HuggingFace Spaces** (Free, Recommended)
|
| 45 |
+
|
| 46 |
+
1. Go to [HuggingFace Spaces](https://huggingface.co/spaces)
|
| 47 |
+
2. Create new space with Gradio SDK
|
| 48 |
+
3. Upload your files:
|
| 49 |
+
- `app_gradio.py`
|
| 50 |
+
- `requirements_gradio.txt` (rename to `requirements.txt`)
|
| 51 |
+
- `utils/` folder
|
| 52 |
+
- `.env` file (with your API keys)
|
| 53 |
+
4. Deploy automatically!
|
| 54 |
+
|
| 55 |
+
### 3. **Google Colab** (Free)
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
# Upload files to Colab
|
| 59 |
+
!pip install -r requirements_gradio.txt
|
| 60 |
+
!python app_gradio.py
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### 4. **Railway/Render** (Paid but reliable)
|
| 64 |
+
|
| 65 |
+
- Push to GitHub
|
| 66 |
+
- Connect to Railway/Render
|
| 67 |
+
- Auto-deploy with custom domain
|
| 68 |
+
|
| 69 |
+
## 🔧 Configuration
|
| 70 |
+
|
| 71 |
+
### Environment Variables
|
| 72 |
+
|
| 73 |
+
Create a `.env` file:
|
| 74 |
+
|
| 75 |
+
```env
|
| 76 |
+
# Required for SEA-LION models
|
| 77 |
+
SEA_LION_API_KEY=your_sea_lion_api_key_here
|
| 78 |
+
SEA_LION_BASE_URL=https://api.sea-lion.ai/v1
|
| 79 |
+
|
| 80 |
+
# Optional: For OpenAI embeddings fallback
|
| 81 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 82 |
+
|
| 83 |
+
# Optional: Custom vector database location
|
| 84 |
+
CHROMA_PERSIST_DIRECTORY=./chroma_db
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Model Configuration
|
| 88 |
+
|
| 89 |
+
The system automatically chooses the appropriate model:
|
| 90 |
+
|
| 91 |
+
- **Simple queries**: SEA-LION Instruct (faster)
|
| 92 |
+
- **Complex analysis**: SEA-LION Reasoning (more thorough)
|
| 93 |
+
|
| 94 |
+
## 📋 How to Use
|
| 95 |
+
|
| 96 |
+
1. **Initialize System** 🚀
|
| 97 |
+
|
| 98 |
+
- Click "Initialize Systems"
|
| 99 |
+
- Wait for models to download (first time only)
|
| 100 |
+
|
| 101 |
+
2. **Upload Documents** 📄
|
| 102 |
+
|
| 103 |
+
- Upload PDF university documents
|
| 104 |
+
- System automatically extracts metadata
|
| 105 |
+
- Supports multiple documents at once
|
| 106 |
+
|
| 107 |
+
3. **Ask Questions** 🔍
|
| 108 |
+
- Type questions in natural language
|
| 109 |
+
- Choose response language
|
| 110 |
+
- Get AI answers with source citations
|
| 111 |
+
|
| 112 |
+
## 🎯 Example Questions
|
| 113 |
+
|
| 114 |
+
- "What are the admission requirements for Computer Science in Singapore?"
|
| 115 |
+
- "Which universities offer scholarships under $5000?"
|
| 116 |
+
- "Compare MBA programs in Thailand and Malaysia"
|
| 117 |
+
- "找到学费低于 5000 美元的工程专业" (Chinese)
|
| 118 |
+
- "Cari universitas dengan beasiswa di Indonesia" (Indonesian)
|
| 119 |
+
|
| 120 |
+
## 🛠️ Troubleshooting
|
| 121 |
+
|
| 122 |
+
### Common Issues
|
| 123 |
+
|
| 124 |
+
**"No embedding model available"**
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
# Install sentence transformers
|
| 128 |
+
pip install sentence-transformers torch
|
| 129 |
+
|
| 130 |
+
# Or set OpenAI API key
|
| 131 |
+
export OPENAI_API_KEY=your_key_here
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**"Cannot load model"**
|
| 135 |
+
|
| 136 |
+
- Ensure internet connection for model download
|
| 137 |
+
- Try smaller model: set `EMBEDDING_MODEL=all-MiniLM-L6-v2`
|
| 138 |
+
|
| 139 |
+
**PDF extraction fails**
|
| 140 |
+
|
| 141 |
+
- Ensure PDFs are text-based (not scanned images)
|
| 142 |
+
- Check if PDF is password-protected
|
| 143 |
+
|
| 144 |
+
## 🔄 Differences from Streamlit Version
|
| 145 |
+
|
| 146 |
+
| Feature | Streamlit | Gradio |
|
| 147 |
+
| ----------------- | ------------------------ | ------------------------ |
|
| 148 |
+
| **Deployment** | Complex, SQLite issues | Simple, multiple options |
|
| 149 |
+
| **Sharing** | Limited | Built-in public links |
|
| 150 |
+
| **UI** | More customizable | Clean, mobile-friendly |
|
| 151 |
+
| **Dependencies** | Heavy, version conflicts | Lighter, more stable |
|
| 152 |
+
| **Cloud Hosting** | Streamlit Cloud only | HF Spaces, Colab, etc. |
|
| 153 |
+
|
| 154 |
+
## 📁 Project Structure
|
| 155 |
+
|
| 156 |
+
```
|
| 157 |
+
📦 ASEAN University Search (Gradio)
|
| 158 |
+
├── 🚀 app_gradio.py # Main Gradio application
|
| 159 |
+
├── 📋 requirements_gradio.txt # Gradio-specific dependencies
|
| 160 |
+
├── ⚡ start_gradio.sh # Quick startup script
|
| 161 |
+
├── 🔧 utils/
|
| 162 |
+
│ ├── rag_system.py # Core RAG logic (Streamlit-free)
|
| 163 |
+
│ ├── display.py # Display utilities
|
| 164 |
+
│ └── translations.py # Language translations
|
| 165 |
+
├── 📁 documents/ # Document storage
|
| 166 |
+
├── 🗄️ chroma_db/ # Vector database
|
| 167 |
+
├── 📊 query_results/ # Saved query results
|
| 168 |
+
└── 🔐 .env # Environment variables
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
## 🌟 Benefits of Gradio Version
|
| 172 |
+
|
| 173 |
+
1. **🚀 Faster Deployment**: No SQLite version conflicts
|
| 174 |
+
2. **🌐 Built-in Sharing**: Automatic public links
|
| 175 |
+
3. **📱 Mobile-Friendly**: Responsive design
|
| 176 |
+
4. **🔧 Fewer Dependencies**: More stable installation
|
| 177 |
+
5. **🎯 Multiple Hosting Options**: HF Spaces, Colab, Railway, etc.
|
| 178 |
+
6. **🛠️ Better Error Handling**: Clearer error messages
|
| 179 |
+
7. **⚡ Faster Loading**: Optimized model initialization
|
| 180 |
+
|
| 181 |
+
## 🤝 Contributing
|
| 182 |
+
|
| 183 |
+
1. Fork the repository
|
| 184 |
+
2. Create a feature branch: `git checkout -b feature-name`
|
| 185 |
+
3. Make your changes
|
| 186 |
+
4. Commit: `git commit -m "Add feature"`
|
| 187 |
+
5. Push: `git push origin feature-name`
|
| 188 |
+
6. Create a Pull Request
|
| 189 |
+
|
| 190 |
+
## 📄 License
|
| 191 |
+
|
| 192 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
| 193 |
+
|
| 194 |
+
## 🙏 Acknowledgments
|
| 195 |
+
|
| 196 |
+
- **SEA-LION AI**: For the amazing Southeast Asia-focused language models
|
| 197 |
+
- **Gradio**: For the excellent web interface framework
|
| 198 |
+
- **LangChain**: For the robust RAG pipeline
|
| 199 |
+
- **ChromaDB**: For efficient vector storage
|
| 200 |
+
- **Sentence Transformers**: For semantic embeddings
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
**Built with ❤️ for the ASEAN education community**
|
app.py
DELETED
|
@@ -1,123 +0,0 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
import os
|
| 3 |
-
from urllib.parse import urlparse, parse_qs
|
| 4 |
-
from utils.rag_system import DocumentIngestion, RAGSystem, save_query_result, load_shared_query
|
| 5 |
-
from datetime import datetime
|
| 6 |
-
import uuid
|
| 7 |
-
from utils.translations import translations, get_text, get_language_code
|
| 8 |
-
from pathlib import Path
|
| 9 |
-
from my_pages.search_uni import search_page
|
| 10 |
-
from my_pages.upload_documents import upload_documents_page
|
| 11 |
-
from my_pages.manage_documents import manage_documents_page
|
| 12 |
-
from my_pages.about import about_page
|
| 13 |
-
from utils.display import display_shared_query
|
| 14 |
-
|
| 15 |
-
# Load external CSS
|
| 16 |
-
def load_css(file_name):
|
| 17 |
-
css_file = Path(file_name)
|
| 18 |
-
if css_file.exists():
|
| 19 |
-
with open(css_file) as f:
|
| 20 |
-
st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
|
| 21 |
-
|
| 22 |
-
load_css("styles.css")
|
| 23 |
-
|
| 24 |
-
# Configure Streamlit page
|
| 25 |
-
st.set_page_config(
|
| 26 |
-
page_title="PanSea University Search",
|
| 27 |
-
page_icon="🎓",
|
| 28 |
-
layout="wide",
|
| 29 |
-
initial_sidebar_state="expanded"
|
| 30 |
-
)
|
| 31 |
-
|
| 32 |
-
def main():
|
| 33 |
-
# Initialize language in session state if not present
|
| 34 |
-
if 'app_language' not in st.session_state:
|
| 35 |
-
st.session_state.app_language = "English"
|
| 36 |
-
|
| 37 |
-
# Get current language from session state
|
| 38 |
-
current_lang = st.session_state.app_language
|
| 39 |
-
|
| 40 |
-
# Check for shared query in URL
|
| 41 |
-
query_params = st.query_params
|
| 42 |
-
shared_query_id = query_params.get("share", [None])[0]
|
| 43 |
-
|
| 44 |
-
if shared_query_id:
|
| 45 |
-
display_shared_query(shared_query_id)
|
| 46 |
-
return
|
| 47 |
-
|
| 48 |
-
# Main header
|
| 49 |
-
st.markdown(f"""
|
| 50 |
-
<div class="main-header">
|
| 51 |
-
<h1>{get_text("app_title", current_lang)}</h1>
|
| 52 |
-
<h5>{get_text("app_subtitle", current_lang)}</h5>
|
| 53 |
-
</div>
|
| 54 |
-
""", unsafe_allow_html=True)
|
| 55 |
-
|
| 56 |
-
# Sidebar
|
| 57 |
-
with st.sidebar:
|
| 58 |
-
# Global language selector
|
| 59 |
-
selected_language = st.selectbox(
|
| 60 |
-
"🌐 Language / 语言 / Bahasa",
|
| 61 |
-
["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
|
| 62 |
-
"Bahasa Indonesia", "Tiếng Việt (Vietnamese)"],
|
| 63 |
-
index=["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
|
| 64 |
-
"Bahasa Indonesia", "Tiếng Việt (Vietnamese)"].index(
|
| 65 |
-
next((lang for lang in ["English", "中文 (Chinese)", "Bahasa Malaysia", "ไทย (Thai)",
|
| 66 |
-
"Bahasa Indonesia", "Tiếng Việt (Vietnamese)"]
|
| 67 |
-
if get_language_code(lang) == current_lang), "English")),
|
| 68 |
-
key="global_language_selector"
|
| 69 |
-
)
|
| 70 |
-
|
| 71 |
-
# Update session state when language changes
|
| 72 |
-
new_lang = get_language_code(selected_language)
|
| 73 |
-
if new_lang != current_lang:
|
| 74 |
-
st.session_state.app_language = new_lang
|
| 75 |
-
st.rerun()
|
| 76 |
-
|
| 77 |
-
# Update current_lang after potential change
|
| 78 |
-
current_lang = st.session_state.app_language
|
| 79 |
-
|
| 80 |
-
st.divider()
|
| 81 |
-
|
| 82 |
-
# Navigation header
|
| 83 |
-
st.markdown(f"## {get_text('navigation', current_lang)}")
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
# Define the pages
|
| 87 |
-
page_keys = ["search_universities", "upload_documents", "manage_documents", "about"]
|
| 88 |
-
page_translations = {key: get_text(key, current_lang) for key in page_keys}
|
| 89 |
-
|
| 90 |
-
# Initialize current page if needed
|
| 91 |
-
if "current_page_key" not in st.session_state:
|
| 92 |
-
st.session_state.current_page_key = page_keys[0]
|
| 93 |
-
|
| 94 |
-
# Sidebar buttons
|
| 95 |
-
for key in page_keys:
|
| 96 |
-
if st.button(page_translations[key], use_container_width=True):
|
| 97 |
-
st.session_state.current_page_key = key
|
| 98 |
-
|
| 99 |
-
# Main content
|
| 100 |
-
if st.session_state.current_page_key == "upload_documents":
|
| 101 |
-
upload_documents_page()
|
| 102 |
-
elif st.session_state.current_page_key == "manage_documents":
|
| 103 |
-
manage_documents_page()
|
| 104 |
-
elif st.session_state.current_page_key == "about":
|
| 105 |
-
about_page()
|
| 106 |
-
else:
|
| 107 |
-
search_page()
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
if __name__ == "__main__":
|
| 112 |
-
# Check if SEA-LION API key is set
|
| 113 |
-
if not os.getenv("SEA_LION_API_KEY"):
|
| 114 |
-
st.error("🚨 SEA-LION API Key not found! Please set your SEA_LION_API_KEY in the .env file.")
|
| 115 |
-
st.code("SEA_LION_API_KEY=your_api_key_here")
|
| 116 |
-
st.stop()
|
| 117 |
-
|
| 118 |
-
# Check if OpenAI API key is set (needed for embeddings)
|
| 119 |
-
if not os.getenv("OPENAI_API_KEY") or os.getenv("OPENAI_API_KEY") == "your_openai_api_key_here":
|
| 120 |
-
st.warning("⚠️ OpenAI API Key not configured properly. You'll need it for document embeddings.")
|
| 121 |
-
st.info("The system will use SEA-LION models for text generation, but OpenAI for document embeddings.")
|
| 122 |
-
|
| 123 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app_gradio.py
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PANSEA University Requirements Assistant - Gradio Version (Modular)
|
| 3 |
+
A comprehensive tool for navigating university admission requirements across Southeast Asia.
|
| 4 |
+
"""
|
| 5 |
+
import gradio as gr
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
from datetime import datetime
|
| 9 |
+
|
| 10 |
+
# Add the current directory to Python path for imports
|
| 11 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 12 |
+
|
| 13 |
+
# Import our RAG system
|
| 14 |
+
from utils.rag_system import DocumentIngestion, RAGSystem
|
| 15 |
+
|
| 16 |
+
# Import modular tab components
|
| 17 |
+
from tabs.initialize import create_initialize_tab
|
| 18 |
+
from tabs.upload import create_upload_tab
|
| 19 |
+
from tabs.query import create_query_tab
|
| 20 |
+
from tabs.manage import create_manage_tab
|
| 21 |
+
from tabs.help import create_help_tab
|
| 22 |
+
|
| 23 |
+
def create_interface():
|
| 24 |
+
"""Create the main Gradio interface using modular components"""
|
| 25 |
+
|
| 26 |
+
# Global state management - shared across all tabs
|
| 27 |
+
global_vars = {
|
| 28 |
+
'doc_ingestion': None,
|
| 29 |
+
'rag_system': None,
|
| 30 |
+
'vectorstore': None
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
# Custom CSS for better styling
|
| 34 |
+
custom_css = """
|
| 35 |
+
.gradio-container {
|
| 36 |
+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
| 37 |
+
}
|
| 38 |
+
.tab-nav button {
|
| 39 |
+
font-weight: 500;
|
| 40 |
+
font-size: 14px;
|
| 41 |
+
}
|
| 42 |
+
.tab-nav button[aria-selected="true"] {
|
| 43 |
+
background: linear-gradient(45deg, #1e3a8a, #3b82f6);
|
| 44 |
+
color: white;
|
| 45 |
+
}
|
| 46 |
+
.feedback-box {
|
| 47 |
+
background: #f8fafc;
|
| 48 |
+
border: 1px solid #e2e8f0;
|
| 49 |
+
border-radius: 8px;
|
| 50 |
+
padding: 16px;
|
| 51 |
+
margin: 8px 0;
|
| 52 |
+
}
|
| 53 |
+
.success-message {
|
| 54 |
+
background: #dcfce7;
|
| 55 |
+
color: #166534;
|
| 56 |
+
border: 1px solid #bbf7d0;
|
| 57 |
+
padding: 12px;
|
| 58 |
+
border-radius: 6px;
|
| 59 |
+
margin: 8px 0;
|
| 60 |
+
}
|
| 61 |
+
.error-message {
|
| 62 |
+
background: #fef2f2;
|
| 63 |
+
color: #dc2626;
|
| 64 |
+
border: 1px solid #fecaca;
|
| 65 |
+
padding: 12px;
|
| 66 |
+
border-radius: 6px;
|
| 67 |
+
margin: 8px 0;
|
| 68 |
+
}
|
| 69 |
+
"""
|
| 70 |
+
|
| 71 |
+
# Create the main interface
|
| 72 |
+
with gr.Blocks(
|
| 73 |
+
title="🌏 PANSEA University Assistant",
|
| 74 |
+
theme=gr.themes.Soft(
|
| 75 |
+
primary_hue="blue",
|
| 76 |
+
secondary_hue="slate"
|
| 77 |
+
),
|
| 78 |
+
css=custom_css,
|
| 79 |
+
analytics_enabled=False
|
| 80 |
+
) as interface:
|
| 81 |
+
|
| 82 |
+
# Header
|
| 83 |
+
gr.Markdown("""
|
| 84 |
+
# 🌏 TopEdu
|
| 85 |
+
|
| 86 |
+
**Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
|
| 87 |
+
|
| 88 |
+
Upload university documents, ask questions, and get intelligent answers about admission requirements,
|
| 89 |
+
programs, deadlines, and more across Southeast Asian universities.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
""")
|
| 93 |
+
|
| 94 |
+
# Main tabs using modular components
|
| 95 |
+
with gr.Tabs():
|
| 96 |
+
create_initialize_tab(global_vars)
|
| 97 |
+
create_upload_tab(global_vars)
|
| 98 |
+
create_query_tab(global_vars)
|
| 99 |
+
create_manage_tab(global_vars)
|
| 100 |
+
create_help_tab(global_vars)
|
| 101 |
+
|
| 102 |
+
# Footer
|
| 103 |
+
gr.Markdown(f"""
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
**🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
|
| 107 |
+
|
| 108 |
+
💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
|
| 109 |
+
""")
|
| 110 |
+
|
| 111 |
+
return interface
|
| 112 |
+
|
| 113 |
+
def main():
|
| 114 |
+
"""Launch the application"""
|
| 115 |
+
interface = create_interface()
|
| 116 |
+
|
| 117 |
+
# Launch configuration
|
| 118 |
+
interface.launch(
|
| 119 |
+
share=False, # Set to True for public sharing
|
| 120 |
+
server_name="0.0.0.0", # Allow external connections
|
| 121 |
+
server_port=7860, # Default Gradio port
|
| 122 |
+
show_api=False, # Hide API documentation
|
| 123 |
+
show_error=True, # Show detailed error messages
|
| 124 |
+
quiet=False, # Show startup messages
|
| 125 |
+
favicon_path=None, # Could add custom favicon
|
| 126 |
+
app_kwargs={
|
| 127 |
+
"docs_url": None, # Disable FastAPI docs
|
| 128 |
+
"redoc_url": None # Disable ReDoc docs
|
| 129 |
+
}
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
if __name__ == "__main__":
|
| 133 |
+
print("🚀 Starting PANSEA University Requirements Assistant...")
|
| 134 |
+
print("📍 Access the application at: http://localhost:7860")
|
| 135 |
+
print("🔗 For public sharing, set share=True in the launch() method")
|
| 136 |
+
print("-" * 60)
|
| 137 |
+
main()
|
app_gradio_modular.py
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PANSEA University Requirements Assistant - Gradio Version (Modular)
|
| 3 |
+
A comprehensive tool for navigating university admission requirements across Southeast Asia.
|
| 4 |
+
"""
|
| 5 |
+
import gradio as gr
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
from datetime import datetime
|
| 9 |
+
|
| 10 |
+
# Add the current directory to Python path for imports
|
| 11 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 12 |
+
|
| 13 |
+
# Import our RAG system
|
| 14 |
+
from utils.rag_system import DocumentIngestion, RAGSystem
|
| 15 |
+
|
| 16 |
+
# Import modular tab components
|
| 17 |
+
from tabs.initialize import create_initialize_tab
|
| 18 |
+
from tabs.upload import create_upload_tab
|
| 19 |
+
from tabs.query import create_query_tab
|
| 20 |
+
from tabs.manage import create_manage_tab
|
| 21 |
+
from tabs.help import create_help_tab
|
| 22 |
+
|
| 23 |
+
def create_interface():
|
| 24 |
+
"""Create the main Gradio interface using modular components"""
|
| 25 |
+
|
| 26 |
+
# Global state management - shared across all tabs
|
| 27 |
+
global_vars = {
|
| 28 |
+
'doc_ingestion': None,
|
| 29 |
+
'rag_system': None,
|
| 30 |
+
'vectorstore': None
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
# Custom CSS for better styling
|
| 34 |
+
custom_css = """
|
| 35 |
+
.gradio-container {
|
| 36 |
+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
| 37 |
+
}
|
| 38 |
+
.tab-nav button {
|
| 39 |
+
font-weight: 500;
|
| 40 |
+
font-size: 14px;
|
| 41 |
+
}
|
| 42 |
+
.tab-nav button[aria-selected="true"] {
|
| 43 |
+
background: linear-gradient(45deg, #1e3a8a, #3b82f6);
|
| 44 |
+
color: white;
|
| 45 |
+
}
|
| 46 |
+
.feedback-box {
|
| 47 |
+
background: #f8fafc;
|
| 48 |
+
border: 1px solid #e2e8f0;
|
| 49 |
+
border-radius: 8px;
|
| 50 |
+
padding: 16px;
|
| 51 |
+
margin: 8px 0;
|
| 52 |
+
}
|
| 53 |
+
.success-message {
|
| 54 |
+
background: #dcfce7;
|
| 55 |
+
color: #166534;
|
| 56 |
+
border: 1px solid #bbf7d0;
|
| 57 |
+
padding: 12px;
|
| 58 |
+
border-radius: 6px;
|
| 59 |
+
margin: 8px 0;
|
| 60 |
+
}
|
| 61 |
+
.error-message {
|
| 62 |
+
background: #fef2f2;
|
| 63 |
+
color: #dc2626;
|
| 64 |
+
border: 1px solid #fecaca;
|
| 65 |
+
padding: 12px;
|
| 66 |
+
border-radius: 6px;
|
| 67 |
+
margin: 8px 0;
|
| 68 |
+
}
|
| 69 |
+
"""
|
| 70 |
+
|
| 71 |
+
# Create the main interface
|
| 72 |
+
with gr.Blocks(
|
| 73 |
+
title="🌏 PANSEA University Assistant",
|
| 74 |
+
theme=gr.themes.Soft(
|
| 75 |
+
primary_hue="blue",
|
| 76 |
+
secondary_hue="slate"
|
| 77 |
+
),
|
| 78 |
+
css=custom_css,
|
| 79 |
+
analytics_enabled=False
|
| 80 |
+
) as interface:
|
| 81 |
+
|
| 82 |
+
# Header
|
| 83 |
+
gr.Markdown("""
|
| 84 |
+
# 🌏 TopEdu
|
| 85 |
+
|
| 86 |
+
**Navigate University Admission Requirements Across Southeast Asia with AI-Powered Assistance**
|
| 87 |
+
|
| 88 |
+
Upload university documents, ask questions, and get intelligent answers about admission requirements,
|
| 89 |
+
programs, deadlines, and more across Southeast Asian universities.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
""")
|
| 93 |
+
|
| 94 |
+
# Main tabs using modular components
|
| 95 |
+
with gr.Tabs():
|
| 96 |
+
create_initialize_tab(global_vars)
|
| 97 |
+
create_upload_tab(global_vars)
|
| 98 |
+
create_query_tab(global_vars)
|
| 99 |
+
create_manage_tab(global_vars)
|
| 100 |
+
create_help_tab(global_vars)
|
| 101 |
+
|
| 102 |
+
# Footer
|
| 103 |
+
gr.Markdown(f"""
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
**🔧 System Status**: Ready | **📅 Session**: {datetime.now().strftime('%Y-%m-%d %H:%M')} | **🔄 Version**: Modular Gradio
|
| 107 |
+
|
| 108 |
+
💡 **Tip**: Start by initializing the system, then upload your university documents, and begin querying!
|
| 109 |
+
""")
|
| 110 |
+
|
| 111 |
+
return interface
|
| 112 |
+
|
| 113 |
+
def main():
|
| 114 |
+
"""Launch the application"""
|
| 115 |
+
interface = create_interface()
|
| 116 |
+
|
| 117 |
+
# Launch configuration
|
| 118 |
+
interface.launch(
|
| 119 |
+
share=False, # Set to True for public sharing
|
| 120 |
+
server_name="0.0.0.0", # Allow external connections
|
| 121 |
+
server_port=7860, # Default Gradio port
|
| 122 |
+
show_api=False, # Hide API documentation
|
| 123 |
+
show_error=True, # Show detailed error messages
|
| 124 |
+
quiet=False, # Show startup messages
|
| 125 |
+
favicon_path=None, # Could add custom favicon
|
| 126 |
+
app_kwargs={
|
| 127 |
+
"docs_url": None, # Disable FastAPI docs
|
| 128 |
+
"redoc_url": None # Disable ReDoc docs
|
| 129 |
+
}
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
if __name__ == "__main__":
|
| 133 |
+
print("🚀 Starting PANSEA University Requirements Assistant...")
|
| 134 |
+
print("📍 Access the application at: http://localhost:7860")
|
| 135 |
+
print("🔗 For public sharing, set share=True in the launch() method")
|
| 136 |
+
print("-" * 60)
|
| 137 |
+
main()
|
installed_packages.txt
DELETED
|
@@ -1,178 +0,0 @@
|
|
| 1 |
-
aiohappyeyeballs==2.6.1
|
| 2 |
-
aiohttp==3.12.15
|
| 3 |
-
aiosignal==1.4.0
|
| 4 |
-
altair==5.5.0
|
| 5 |
-
altex==0.2.0
|
| 6 |
-
annotated-types==0.7.0
|
| 7 |
-
anyio==4.10.0
|
| 8 |
-
asgiref==3.9.1
|
| 9 |
-
async-timeout==4.0.3
|
| 10 |
-
attrs==25.3.0
|
| 11 |
-
backoff==2.2.1
|
| 12 |
-
bcrypt==4.3.0
|
| 13 |
-
beautifulsoup4==4.13.4
|
| 14 |
-
blinker==1.9.0
|
| 15 |
-
build==1.3.0
|
| 16 |
-
cachetools==5.5.2
|
| 17 |
-
certifi==2025.8.3
|
| 18 |
-
charset-normalizer==3.4.3
|
| 19 |
-
chroma-hnswlib==0.7.3
|
| 20 |
-
chromadb==1.0.16
|
| 21 |
-
click==8.2.1
|
| 22 |
-
coloredlogs==15.0.1
|
| 23 |
-
contourpy==1.3.2
|
| 24 |
-
cycler==0.12.1
|
| 25 |
-
dataclasses-json==0.6.7
|
| 26 |
-
Deprecated==1.2.18
|
| 27 |
-
distro==1.9.0
|
| 28 |
-
durationpy==0.10
|
| 29 |
-
entrypoints==0.4
|
| 30 |
-
exceptiongroup==1.3.0
|
| 31 |
-
faiss-cpu==1.7.4
|
| 32 |
-
Faker==37.5.3
|
| 33 |
-
fastapi==0.116.1
|
| 34 |
-
favicon==0.7.0
|
| 35 |
-
filelock==3.18.0
|
| 36 |
-
flatbuffers==25.2.10
|
| 37 |
-
fonttools==4.59.0
|
| 38 |
-
frozenlist==1.7.0
|
| 39 |
-
fsspec==2025.7.0
|
| 40 |
-
gitdb==4.0.12
|
| 41 |
-
GitPython==3.1.45
|
| 42 |
-
google-auth==2.40.3
|
| 43 |
-
googleapis-common-protos==1.70.0
|
| 44 |
-
grpcio==1.74.0
|
| 45 |
-
h11==0.16.0
|
| 46 |
-
hf-xet==1.1.7
|
| 47 |
-
htbuilder==0.9.0
|
| 48 |
-
httpcore==1.0.9
|
| 49 |
-
httptools==0.6.4
|
| 50 |
-
httpx==0.28.1
|
| 51 |
-
huggingface-hub==0.34.4
|
| 52 |
-
humanfriendly==10.0
|
| 53 |
-
idna==3.10
|
| 54 |
-
importlib-metadata==6.11.0
|
| 55 |
-
importlib_resources==6.5.2
|
| 56 |
-
Jinja2==3.1.6
|
| 57 |
-
jiter==0.10.0
|
| 58 |
-
joblib==1.5.1
|
| 59 |
-
jsonpatch==1.33
|
| 60 |
-
jsonpointer==3.0.0
|
| 61 |
-
jsonschema==4.25.0
|
| 62 |
-
jsonschema-specifications==2025.4.1
|
| 63 |
-
kiwisolver==1.4.9
|
| 64 |
-
kubernetes==33.1.0
|
| 65 |
-
langchain-text-splitters==0.3.9
|
| 66 |
-
lxml==6.0.0
|
| 67 |
-
Markdown==3.8.2
|
| 68 |
-
markdown-it-py==4.0.0
|
| 69 |
-
markdownlit==0.0.7
|
| 70 |
-
MarkupSafe==3.0.2
|
| 71 |
-
marshmallow==3.26.1
|
| 72 |
-
matplotlib==3.10.5
|
| 73 |
-
mdurl==0.1.2
|
| 74 |
-
mmh3==5.2.0
|
| 75 |
-
mpmath==1.3.0
|
| 76 |
-
multidict==6.6.4
|
| 77 |
-
mypy_extensions==1.1.0
|
| 78 |
-
narwhals==2.1.0
|
| 79 |
-
networkx==3.4.2
|
| 80 |
-
numpy==1.26.4
|
| 81 |
-
oauthlib==3.3.1
|
| 82 |
-
onnxruntime==1.22.1
|
| 83 |
-
opentelemetry-api==1.27.0
|
| 84 |
-
opentelemetry-exporter-otlp-proto-common==1.27.0
|
| 85 |
-
opentelemetry-exporter-otlp-proto-grpc==1.27.0
|
| 86 |
-
opentelemetry-instrumentation==0.48b0
|
| 87 |
-
opentelemetry-instrumentation-asgi==0.48b0
|
| 88 |
-
opentelemetry-instrumentation-fastapi==0.48b0
|
| 89 |
-
opentelemetry-proto==1.27.0
|
| 90 |
-
opentelemetry-sdk==1.27.0
|
| 91 |
-
opentelemetry-semantic-conventions==0.48b0
|
| 92 |
-
opentelemetry-util-http==0.48b0
|
| 93 |
-
orjson==3.11.2
|
| 94 |
-
overrides==7.7.0
|
| 95 |
-
packaging==23.2
|
| 96 |
-
pandas==2.3.1
|
| 97 |
-
pillow==10.4.0
|
| 98 |
-
posthog==5.4.0
|
| 99 |
-
propcache==0.3.2
|
| 100 |
-
protobuf==4.25.8
|
| 101 |
-
pulsar-client==3.8.0
|
| 102 |
-
pyarrow==21.0.0
|
| 103 |
-
pyasn1==0.6.1
|
| 104 |
-
pyasn1_modules==0.4.2
|
| 105 |
-
pybase64==1.4.2
|
| 106 |
-
pycryptodome==3.23.0
|
| 107 |
-
pydantic==2.11.7
|
| 108 |
-
pydantic_core==2.33.2
|
| 109 |
-
pydeck==0.9.1
|
| 110 |
-
Pygments==2.19.2
|
| 111 |
-
pymdown-extensions==10.16.1
|
| 112 |
-
pyparsing==3.2.3
|
| 113 |
-
PyPDF2==3.0.1
|
| 114 |
-
PyPika==0.48.9
|
| 115 |
-
pyproject_hooks==1.2.0
|
| 116 |
-
python-dateutil==2.9.0.post0
|
| 117 |
-
python-dotenv==1.0.0
|
| 118 |
-
pytz==2025.2
|
| 119 |
-
PyYAML==6.0.2
|
| 120 |
-
referencing==0.36.2
|
| 121 |
-
regex==2025.7.34
|
| 122 |
-
requests==2.32.4
|
| 123 |
-
requests-oauthlib==2.0.0
|
| 124 |
-
requests-toolbelt==1.0.0
|
| 125 |
-
rich==13.9.4
|
| 126 |
-
rpds-py==0.27.0
|
| 127 |
-
rsa==4.9.1
|
| 128 |
-
safetensors==0.6.2
|
| 129 |
-
scikit-learn==1.7.1
|
| 130 |
-
scipy==1.15.3
|
| 131 |
-
sentence-transformers==5.1.0
|
| 132 |
-
shellingham==1.5.4
|
| 133 |
-
six==1.17.0
|
| 134 |
-
smmap==5.0.2
|
| 135 |
-
sniffio==1.3.1
|
| 136 |
-
soupsieve==2.7
|
| 137 |
-
SQLAlchemy==2.0.43
|
| 138 |
-
st-annotated-text==4.0.2
|
| 139 |
-
starlette==0.47.2
|
| 140 |
-
streamlit==1.48.0
|
| 141 |
-
streamlit-camera-input-live==0.2.0
|
| 142 |
-
streamlit-card==1.0.2
|
| 143 |
-
streamlit-embedcode==0.1.2
|
| 144 |
-
streamlit-extras==0.3.5
|
| 145 |
-
streamlit-image-coordinates==0.1.9
|
| 146 |
-
streamlit-keyup==0.3.0
|
| 147 |
-
streamlit-toggle-switch==1.0.2
|
| 148 |
-
streamlit-vertical-slider==2.5.5
|
| 149 |
-
streamlit_faker==0.0.4
|
| 150 |
-
sympy==1.14.0
|
| 151 |
-
tenacity==8.5.0
|
| 152 |
-
threadpoolctl==3.6.0
|
| 153 |
-
tiktoken==0.11.0
|
| 154 |
-
tokenizers==0.21.4
|
| 155 |
-
toml==0.10.2
|
| 156 |
-
tomli==2.2.1
|
| 157 |
-
torch==2.8.0
|
| 158 |
-
tornado==6.5.2
|
| 159 |
-
tqdm==4.67.1
|
| 160 |
-
transformers==4.55.0
|
| 161 |
-
typer==0.16.0
|
| 162 |
-
typing-inspect==0.9.0
|
| 163 |
-
typing-inspection==0.4.1
|
| 164 |
-
typing_extensions==4.14.1
|
| 165 |
-
tzdata==2025.2
|
| 166 |
-
tzlocal==5.3.1
|
| 167 |
-
urllib3==2.5.0
|
| 168 |
-
uvicorn==0.35.0
|
| 169 |
-
uvloop==0.21.0
|
| 170 |
-
validators==0.35.0
|
| 171 |
-
watchdog==3.0.0
|
| 172 |
-
watchfiles==1.1.0
|
| 173 |
-
websocket-client==1.8.0
|
| 174 |
-
websockets==15.0.1
|
| 175 |
-
wrapt==1.17.3
|
| 176 |
-
yarl==1.20.1
|
| 177 |
-
zipp==3.23.0
|
| 178 |
-
zstandard==0.23.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
my_pages/about.py
DELETED
|
@@ -1,37 +0,0 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
from utils.translations import get_text
|
| 3 |
-
|
| 4 |
-
def about_page():
|
| 5 |
-
# Get current language from session state
|
| 6 |
-
lang = st.session_state.get('app_language', 'English')
|
| 7 |
-
|
| 8 |
-
st.header(get_text("about_header", lang))
|
| 9 |
-
|
| 10 |
-
# col1, col2 = st.columns([2, 1])
|
| 11 |
-
|
| 12 |
-
# with col1:
|
| 13 |
-
st.markdown(f"""
|
| 14 |
-
### {get_text("who_we_are", lang)}
|
| 15 |
-
{get_text("who_we_are_description", lang)}
|
| 16 |
-
|
| 17 |
-
### {get_text("what_we_do", lang)}
|
| 18 |
-
{get_text("what_we_do_description", lang)}
|
| 19 |
-
|
| 20 |
-
### {get_text("supported_languages", lang)}
|
| 21 |
-
- English
|
| 22 |
-
- 中文 (Chinese / Mandarin)
|
| 23 |
-
- Bahasa Malaysia
|
| 24 |
-
- ไทย (Thai)
|
| 25 |
-
- Bahasa Indonesia
|
| 26 |
-
- Tiếng Việt (Vietnamese)
|
| 27 |
-
- Filipino
|
| 28 |
-
- ភាសាខ្មែរ (Khmer)
|
| 29 |
-
- ພາສາລາວ (Lao)
|
| 30 |
-
- မြန်မာဘာသာ (Burmese)
|
| 31 |
-
""")
|
| 32 |
-
|
| 33 |
-
# with col2:
|
| 34 |
-
# st.markdown(f"""
|
| 35 |
-
# ### {get_text("contact", lang)}
|
| 36 |
-
# Reach out to us for support or inquiries!
|
| 37 |
-
# """)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
my_pages/manage_documents.py
DELETED
|
@@ -1,73 +0,0 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
from utils.rag_system import DocumentIngestion
|
| 3 |
-
from utils.translations import get_text
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
def manage_documents_page():
|
| 7 |
-
# Get current language from session state
|
| 8 |
-
current_lang = st.session_state.get('app_language', 'English')
|
| 9 |
-
|
| 10 |
-
st.header(get_text("manage_header", current_lang))
|
| 11 |
-
st.write(get_text("manage_description", current_lang))
|
| 12 |
-
|
| 13 |
-
from utils.rag_system import DocumentIngestion
|
| 14 |
-
doc_ingestion = DocumentIngestion()
|
| 15 |
-
vectorstore = doc_ingestion.load_existing_vectorstore()
|
| 16 |
-
|
| 17 |
-
if not vectorstore:
|
| 18 |
-
st.warning("No files found. Upload documents first.")
|
| 19 |
-
return
|
| 20 |
-
|
| 21 |
-
# Get all documents (chunks) in the vectorstore
|
| 22 |
-
try:
|
| 23 |
-
# Chroma stores documents as chunks, but we want to show original metadata
|
| 24 |
-
# We'll group by file_id to show unique documents
|
| 25 |
-
collection = vectorstore._collection
|
| 26 |
-
all_docs = collection.get(include=["metadatas", "documents"]) # Removed 'ids'
|
| 27 |
-
metadatas = all_docs["metadatas"]
|
| 28 |
-
ids = all_docs["ids"] # ids are always returned
|
| 29 |
-
documents = all_docs["documents"]
|
| 30 |
-
|
| 31 |
-
# Group by file_id
|
| 32 |
-
doc_map = {}
|
| 33 |
-
for meta, doc_id, doc_text in zip(metadatas, ids, documents):
|
| 34 |
-
file_id = meta.get("file_id", doc_id)
|
| 35 |
-
if file_id not in doc_map:
|
| 36 |
-
doc_map[file_id] = {
|
| 37 |
-
"source": meta.get("source", "Unknown"),
|
| 38 |
-
"university": meta.get("university", "Unknown"),
|
| 39 |
-
"country": meta.get("country", "Unknown"),
|
| 40 |
-
"document_type": meta.get("document_type", "Unknown"),
|
| 41 |
-
"upload_timestamp": meta.get("upload_timestamp", "Unknown"),
|
| 42 |
-
"file_id": file_id,
|
| 43 |
-
"chunks": []
|
| 44 |
-
}
|
| 45 |
-
doc_map[file_id]["chunks"].append(doc_text)
|
| 46 |
-
|
| 47 |
-
if not doc_map:
|
| 48 |
-
st.info(get_text("no_documents", current_lang))
|
| 49 |
-
return
|
| 50 |
-
|
| 51 |
-
st.subheader(get_text("document_list", current_lang))
|
| 52 |
-
for file_id, info in doc_map.items():
|
| 53 |
-
with st.expander(f"{info['source']} ({info['university']}, {info['country']})"):
|
| 54 |
-
st.write(f"**Type:** {info['document_type']}")
|
| 55 |
-
st.write(f"**{get_text('last_updated', current_lang)}:** {info['upload_timestamp']}")
|
| 56 |
-
st.write(f"**File ID:** {file_id}")
|
| 57 |
-
st.write(f"**{get_text('total_chunks', current_lang)}:** {len(info['chunks'])}")
|
| 58 |
-
if st.button(f"🗑️ Delete Document", key=f"del_{file_id}"):
|
| 59 |
-
# Delete all chunks with this file_id
|
| 60 |
-
ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == file_id]
|
| 61 |
-
vectorstore._collection.delete(ids=ids_to_delete)
|
| 62 |
-
st.success(f"Deleted document: {info['source']}")
|
| 63 |
-
st.rerun()
|
| 64 |
-
|
| 65 |
-
# Add Delete All button
|
| 66 |
-
if doc_map:
|
| 67 |
-
if st.button(get_text("delete_all", current_lang), key="del_all_docs", type="secondary"):
|
| 68 |
-
all_ids = list(ids)
|
| 69 |
-
vectorstore._collection.delete(ids=all_ids)
|
| 70 |
-
st.success(get_text("documents_deleted", current_lang))
|
| 71 |
-
st.rerun()
|
| 72 |
-
except Exception as e:
|
| 73 |
-
st.error(f"Error loading documents: {str(e)}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
my_pages/search_uni.py
DELETED
|
@@ -1,104 +0,0 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
from utils.translations import get_text
|
| 3 |
-
from utils.rag_system import RAGSystem, save_query_result
|
| 4 |
-
|
| 5 |
-
def search_page():
|
| 6 |
-
lang = st.session_state.get('app_language', 'English')
|
| 7 |
-
|
| 8 |
-
# --- Header & description ---
|
| 9 |
-
st.header(get_text("search_header", lang))
|
| 10 |
-
st.write(get_text("search_description", lang))
|
| 11 |
-
if lang != "English":
|
| 12 |
-
st.info(f'{get_text("responses_in", lang)} **{lang}**')
|
| 13 |
-
|
| 14 |
-
# --- Initialize query_text ---
|
| 15 |
-
if "query_text" not in st.session_state:
|
| 16 |
-
st.session_state.query_text = ""
|
| 17 |
-
|
| 18 |
-
# --- Example queries ---
|
| 19 |
-
complex_examples = [
|
| 20 |
-
get_text("example_complex_1", lang),
|
| 21 |
-
get_text("example_complex_2", lang),
|
| 22 |
-
get_text("example_complex_3", lang),
|
| 23 |
-
get_text("example_complex_4", lang)
|
| 24 |
-
]
|
| 25 |
-
simple_examples = [
|
| 26 |
-
get_text("example_simple_1", lang),
|
| 27 |
-
get_text("example_simple_2", lang),
|
| 28 |
-
get_text("example_simple_3", lang),
|
| 29 |
-
get_text("example_simple_4", lang)
|
| 30 |
-
]
|
| 31 |
-
|
| 32 |
-
with st.expander(get_text("example_queries", lang)):
|
| 33 |
-
tab1, tab2 = st.tabs([get_text("complex_queries", lang), get_text("simple_queries", lang)])
|
| 34 |
-
with tab1:
|
| 35 |
-
for i, ex in enumerate(complex_examples):
|
| 36 |
-
if st.button(ex, key=f"complex_{i}", use_container_width=True):
|
| 37 |
-
st.session_state.query_text = ex
|
| 38 |
-
with tab2:
|
| 39 |
-
for i, ex in enumerate(simple_examples):
|
| 40 |
-
if st.button(ex, key=f"simple_{i}", use_container_width=True):
|
| 41 |
-
st.session_state.query_text = ex
|
| 42 |
-
|
| 43 |
-
# --- Query input ---
|
| 44 |
-
st.text_area(
|
| 45 |
-
get_text("your_question", lang),
|
| 46 |
-
height=120,
|
| 47 |
-
placeholder=get_text("placeholder_text", lang),
|
| 48 |
-
key="query_text"
|
| 49 |
-
)
|
| 50 |
-
|
| 51 |
-
# --- Optional filters (initially empty) ---
|
| 52 |
-
with st.expander(get_text("advanced_filters", lang)):
|
| 53 |
-
col1, col2, col3 = st.columns(3)
|
| 54 |
-
|
| 55 |
-
budget_options = [get_text(opt, lang) for opt in ["any", "under_10k", "10k_20k", "20k_30k", "30k_40k", "over_40k"]]
|
| 56 |
-
study_level_options = [get_text(lvl, lang) for lvl in ["diploma", "bachelor", "master", "phd"]]
|
| 57 |
-
country_options = [get_text(c, lang) for c in ["singapore", "malaysia", "thailand", "indonesia", "philippines", "vietnam", "brunei"]]
|
| 58 |
-
|
| 59 |
-
selected_budget = col1.select_slider(get_text("budget_range", lang), options=budget_options, value=budget_options[0])
|
| 60 |
-
selected_levels = col2.multiselect(get_text("study_level", lang), study_level_options, default=[])
|
| 61 |
-
selected_countries = col3.multiselect(get_text("preferred_countries", lang), country_options, default=[])
|
| 62 |
-
|
| 63 |
-
# --- Ensure RAG system is initialized once ---
|
| 64 |
-
if "rag_system_ready" not in st.session_state:
|
| 65 |
-
st.session_state.rag_system_ready = False
|
| 66 |
-
try:
|
| 67 |
-
st.session_state.rag_system = RAGSystem()
|
| 68 |
-
st.session_state.rag_system_ready = True
|
| 69 |
-
except Exception as e:
|
| 70 |
-
st.error(f"Failed to initialize RAG system: {e}")
|
| 71 |
-
|
| 72 |
-
# --- Search button ---
|
| 73 |
-
search_disabled = not st.session_state.query_text.strip() or not st.session_state.rag_system_ready
|
| 74 |
-
|
| 75 |
-
if st.button(get_text("search_button", lang), disabled=search_disabled):
|
| 76 |
-
placeholder = st.empty()
|
| 77 |
-
placeholder.info("Searching...")
|
| 78 |
-
|
| 79 |
-
# Combine query with filter info
|
| 80 |
-
filter_info = {
|
| 81 |
-
"budget": selected_budget if selected_budget != budget_options[0] else None,
|
| 82 |
-
"study_levels": selected_levels,
|
| 83 |
-
"countries": selected_countries
|
| 84 |
-
}
|
| 85 |
-
full_query = f"{st.session_state.query_text.strip()}\nFilters: {filter_info}"
|
| 86 |
-
|
| 87 |
-
# Call RAG system with filters
|
| 88 |
-
query_result = st.session_state.rag_system.query(
|
| 89 |
-
question=full_query,
|
| 90 |
-
language=lang
|
| 91 |
-
)
|
| 92 |
-
|
| 93 |
-
placeholder.empty()
|
| 94 |
-
save_query_result(query_result)
|
| 95 |
-
|
| 96 |
-
st.success(query_result["answer"])
|
| 97 |
-
|
| 98 |
-
if query_result["source_documents"]:
|
| 99 |
-
st.markdown("#### Source Documents")
|
| 100 |
-
for i, doc in enumerate(query_result["source_documents"], 1):
|
| 101 |
-
st.markdown(
|
| 102 |
-
f"- **{i}. {doc.metadata.get('source', 'Unknown')}** "
|
| 103 |
-
f"({doc.metadata.get('university', 'Unknown')}, {doc.metadata.get('country', 'Unknown')})"
|
| 104 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
my_pages/upload_documents.py
DELETED
|
@@ -1,202 +0,0 @@
|
|
| 1 |
-
from langchain.schema import Document
|
| 2 |
-
import streamlit as st
|
| 3 |
-
from utils.rag_system import DocumentIngestion
|
| 4 |
-
from utils.translations import get_text
|
| 5 |
-
|
| 6 |
-
def upload_documents_page():
|
| 7 |
-
# Get current language from session state
|
| 8 |
-
current_lang = st.session_state.get('app_language', 'English')
|
| 9 |
-
|
| 10 |
-
st.header(get_text("upload_header", current_lang))
|
| 11 |
-
st.write(get_text("upload_description", current_lang))
|
| 12 |
-
|
| 13 |
-
# Add information about automatic metadata detection
|
| 14 |
-
st.info("🤖 **Automatic Metadata Detection Enabled**: The system will automatically detect university name, country, and document type from your uploaded files using AI.")
|
| 15 |
-
|
| 16 |
-
# File upload (removed manual metadata input fields)
|
| 17 |
-
uploaded_files = st.file_uploader(
|
| 18 |
-
get_text("choose_files", current_lang),
|
| 19 |
-
accept_multiple_files=True,
|
| 20 |
-
type=['pdf'],
|
| 21 |
-
help=get_text("file_limit", current_lang)
|
| 22 |
-
)
|
| 23 |
-
|
| 24 |
-
# # Optional: Add language selection for processing (if needed for multilingual documents)
|
| 25 |
-
# col1, col2 = st.columns(2)
|
| 26 |
-
# with col1:
|
| 27 |
-
# processing_language = st.selectbox(
|
| 28 |
-
# f"🌐 Processing Language (Optional)",
|
| 29 |
-
# ["Auto-detect", "English", "Chinese", "Malay", "Thai", "Indonesian", "Vietnamese", "Filipino"],
|
| 30 |
-
# help="Select the primary language of your documents for better metadata extraction"
|
| 31 |
-
# )
|
| 32 |
-
|
| 33 |
-
# with col2:
|
| 34 |
-
# # Optional: Allow users to override detected metadata if needed
|
| 35 |
-
# allow_manual_override = st.checkbox(
|
| 36 |
-
# "🔧 Allow manual metadata correction after processing",
|
| 37 |
-
# value=False,
|
| 38 |
-
# help="Enable this to manually correct any incorrectly detected metadata"
|
| 39 |
-
# )
|
| 40 |
-
|
| 41 |
-
if uploaded_files and st.button(get_text("process_documents", current_lang), type="primary"):
|
| 42 |
-
with st.spinner(f"{get_text('processing_docs', current_lang)} (with automatic metadata detection)..."):
|
| 43 |
-
try:
|
| 44 |
-
# Initialize document ingestion
|
| 45 |
-
doc_ingestion = DocumentIngestion()
|
| 46 |
-
|
| 47 |
-
# Process documents with automatic metadata extraction
|
| 48 |
-
documents = doc_ingestion.process_documents(uploaded_files)
|
| 49 |
-
|
| 50 |
-
if documents:
|
| 51 |
-
# Show detected metadata for review/correction if enabled
|
| 52 |
-
# if allow_manual_override and documents:
|
| 53 |
-
# st.subheader("🔍 Review Detected Metadata")
|
| 54 |
-
# st.write("Review and correct the automatically detected metadata if needed:")
|
| 55 |
-
|
| 56 |
-
# corrected_documents = []
|
| 57 |
-
# for i, doc in enumerate(documents):
|
| 58 |
-
# with st.expander(f"📄 {doc.metadata['source']}", expanded=False):
|
| 59 |
-
# col1, col2, col3 = st.columns(3)
|
| 60 |
-
|
| 61 |
-
# with col1:
|
| 62 |
-
# corrected_university = st.text_input(
|
| 63 |
-
# "University Name",
|
| 64 |
-
# value=doc.metadata['university'],
|
| 65 |
-
# key=f"uni_{i}"
|
| 66 |
-
# )
|
| 67 |
-
|
| 68 |
-
# with col2:
|
| 69 |
-
# corrected_country = st.selectbox(
|
| 70 |
-
# "Country",
|
| 71 |
-
# ["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
|
| 72 |
-
# "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"],
|
| 73 |
-
# index=0 if doc.metadata['country'] == "Unknown" else
|
| 74 |
-
# (["Unknown", "Singapore", "Malaysia", "Thailand", "Indonesia",
|
| 75 |
-
# "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"].index(doc.metadata['country'])
|
| 76 |
-
# if doc.metadata['country'] in ["Singapore", "Malaysia", "Thailand", "Indonesia",
|
| 77 |
-
# "Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"] else 0),
|
| 78 |
-
# key=f"country_{i}"
|
| 79 |
-
# )
|
| 80 |
-
|
| 81 |
-
# with col3:
|
| 82 |
-
# corrected_doc_type = st.selectbox(
|
| 83 |
-
# "Document Type",
|
| 84 |
-
# ["admission_requirements", "tuition_fees", "program_information",
|
| 85 |
-
# "scholarship_info", "application_deadlines", "general_info"],
|
| 86 |
-
# index=["admission_requirements", "tuition_fees", "program_information",
|
| 87 |
-
# "scholarship_info", "application_deadlines", "general_info"].index(doc.metadata['document_type']),
|
| 88 |
-
# key=f"doctype_{i}"
|
| 89 |
-
# )
|
| 90 |
-
|
| 91 |
-
# # Update document metadata with corrections
|
| 92 |
-
# corrected_doc = Document(
|
| 93 |
-
# page_content=doc.page_content,
|
| 94 |
-
# metadata={
|
| 95 |
-
# **doc.metadata,
|
| 96 |
-
# "university": corrected_university,
|
| 97 |
-
# "country": corrected_country,
|
| 98 |
-
# "document_type": corrected_doc_type,
|
| 99 |
-
# "manually_corrected": True
|
| 100 |
-
# }
|
| 101 |
-
# )
|
| 102 |
-
# corrected_documents.append(corrected_doc)
|
| 103 |
-
|
| 104 |
-
# # Use corrected documents
|
| 105 |
-
# documents = corrected_documents
|
| 106 |
-
|
| 107 |
-
# if st.button("✅ Confirm and Save Documents", type="primary"):
|
| 108 |
-
# # Create or update vector store with corrected metadata
|
| 109 |
-
# vectorstore = doc_ingestion.create_vector_store(documents)
|
| 110 |
-
|
| 111 |
-
# if vectorstore:
|
| 112 |
-
# st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with corrected metadata!")
|
| 113 |
-
|
| 114 |
-
# # Show final processed files
|
| 115 |
-
# with st.expander("📋 Final Processed Files"):
|
| 116 |
-
# for doc in documents:
|
| 117 |
-
# st.write(f"• **{doc.metadata['source']}**")
|
| 118 |
-
# st.write(f" - University: {doc.metadata['university']}")
|
| 119 |
-
# st.write(f" - Country: {doc.metadata['country']}")
|
| 120 |
-
# st.write(f" - Type: {doc.metadata['document_type']}")
|
| 121 |
-
# if doc.metadata.get('manually_corrected'):
|
| 122 |
-
# st.write(f" - ✏️ Manually corrected")
|
| 123 |
-
# st.write("---")
|
| 124 |
-
# else:
|
| 125 |
-
# Process normally without manual override
|
| 126 |
-
vectorstore = doc_ingestion.create_vector_store(documents)
|
| 127 |
-
|
| 128 |
-
if vectorstore:
|
| 129 |
-
st.success(f"✅ {get_text('successfully_processed', current_lang)} {len(documents)} {get_text('documents', current_lang)} with automatic metadata detection!")
|
| 130 |
-
|
| 131 |
-
# Show processed files with detected metadata
|
| 132 |
-
with st.expander("📋 Processed Files with Detected Metadata"):
|
| 133 |
-
for doc in documents:
|
| 134 |
-
st.write(f"• **{doc.metadata['source']}**")
|
| 135 |
-
st.write(f" - 🏫 University: {doc.metadata['university']}")
|
| 136 |
-
st.write(f" - 🌏 Country: {doc.metadata['country']}")
|
| 137 |
-
st.write(f" - 📋 Type: {doc.metadata['document_type']}")
|
| 138 |
-
st.write(f" - 🤖 Auto-detected: Yes")
|
| 139 |
-
st.write("---")
|
| 140 |
-
|
| 141 |
-
# Show summary of detected metadata
|
| 142 |
-
universities = list(set([doc.metadata['university'] for doc in documents if doc.metadata['university'] != 'Unknown']))
|
| 143 |
-
countries = list(set([doc.metadata['country'] for doc in documents if doc.metadata['country'] != 'Unknown']))
|
| 144 |
-
doc_types = list(set([doc.metadata['document_type'] for doc in documents]))
|
| 145 |
-
|
| 146 |
-
if universities or countries or doc_types:
|
| 147 |
-
st.subheader("📊 Detection Summary")
|
| 148 |
-
if universities:
|
| 149 |
-
st.write(f"🏫 **Universities detected**: {', '.join(universities)}")
|
| 150 |
-
if countries:
|
| 151 |
-
st.write(f"🌏 **Countries detected**: {', '.join(countries)}")
|
| 152 |
-
if doc_types:
|
| 153 |
-
st.write(f"📋 **Document types detected**: {', '.join(doc_types)}")
|
| 154 |
-
else:
|
| 155 |
-
st.error(get_text("no_docs_processed", current_lang))
|
| 156 |
-
|
| 157 |
-
except Exception as e:
|
| 158 |
-
st.error(f"{get_text('failed_to_process', current_lang)}: {str(e)}")
|
| 159 |
-
st.error("Please check your API keys and model configurations.")
|
| 160 |
-
|
| 161 |
-
# Additional helper function for metadata validation
|
| 162 |
-
def validate_metadata(metadata: dict) -> dict:
|
| 163 |
-
"""Validate and clean extracted metadata"""
|
| 164 |
-
|
| 165 |
-
# List of valid countries for ASEAN region
|
| 166 |
-
valid_countries = [
|
| 167 |
-
"Singapore", "Malaysia", "Thailand", "Indonesia",
|
| 168 |
-
"Philippines", "Vietnam", "Brunei", "Cambodia", "Laos", "Myanmar"
|
| 169 |
-
]
|
| 170 |
-
|
| 171 |
-
# List of valid document types
|
| 172 |
-
valid_doc_types = [
|
| 173 |
-
"admission_requirements", "tuition_fees", "program_information",
|
| 174 |
-
"scholarship_info", "application_deadlines", "general_info"
|
| 175 |
-
]
|
| 176 |
-
|
| 177 |
-
# Clean and validate country
|
| 178 |
-
if metadata.get('country', '').strip():
|
| 179 |
-
country = metadata['country'].strip()
|
| 180 |
-
# Try to match with valid countries (case insensitive)
|
| 181 |
-
for valid_country in valid_countries:
|
| 182 |
-
if valid_country.lower() in country.lower() or country.lower() in valid_country.lower():
|
| 183 |
-
metadata['country'] = valid_country
|
| 184 |
-
break
|
| 185 |
-
else:
|
| 186 |
-
# If no match found, keep original but mark as unvalidated
|
| 187 |
-
if country.lower() not in [c.lower() for c in valid_countries]:
|
| 188 |
-
metadata['country'] = country # Keep original
|
| 189 |
-
|
| 190 |
-
# Validate document type
|
| 191 |
-
if metadata.get('document_type') not in valid_doc_types:
|
| 192 |
-
metadata['document_type'] = "general_info" # Default fallback
|
| 193 |
-
|
| 194 |
-
# Clean university name
|
| 195 |
-
if metadata.get('university_name'):
|
| 196 |
-
# Remove common prefixes/suffixes that might be incorrectly included
|
| 197 |
-
university = metadata['university_name'].strip()
|
| 198 |
-
# Remove quotes if present
|
| 199 |
-
university = university.strip('"\'')
|
| 200 |
-
metadata['university_name'] = university
|
| 201 |
-
|
| 202 |
-
return metadata
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
aiohappyeyeballs==2.6.1
|
| 4 |
aiohttp==3.12.15
|
| 5 |
aiosignal==1.4.0
|
|
@@ -10,6 +9,7 @@ attrs==25.3.0
|
|
| 10 |
backoff==2.2.1
|
| 11 |
bcrypt==4.3.0
|
| 12 |
blinker==1.9.0
|
|
|
|
| 13 |
build==1.3.0
|
| 14 |
cachetools==5.5.2
|
| 15 |
certifi==2025.8.3
|
|
@@ -20,6 +20,8 @@ coloredlogs==15.0.1
|
|
| 20 |
dataclasses-json==0.6.7
|
| 21 |
distro==1.9.0
|
| 22 |
durationpy==0.10
|
|
|
|
|
|
|
| 23 |
filelock==3.18.0
|
| 24 |
flatbuffers==25.2.10
|
| 25 |
frozenlist==1.7.0
|
|
@@ -28,6 +30,9 @@ gitdb==4.0.12
|
|
| 28 |
GitPython==3.1.45
|
| 29 |
google-auth==2.40.3
|
| 30 |
googleapis-common-protos==1.70.0
|
|
|
|
|
|
|
|
|
|
| 31 |
grpcio==1.74.0
|
| 32 |
h11==0.16.0
|
| 33 |
hf-xet==1.1.7
|
|
@@ -91,12 +96,14 @@ pydantic==2.11.7
|
|
| 91 |
pydantic-settings==2.10.1
|
| 92 |
pydantic_core==2.33.2
|
| 93 |
pydeck==0.9.1
|
|
|
|
| 94 |
Pygments==2.19.2
|
| 95 |
PyPDF2==3.0.1
|
| 96 |
PyPika==0.48.9
|
| 97 |
pyproject_hooks==1.2.0
|
| 98 |
python-dateutil==2.9.0.post0
|
| 99 |
python-dotenv==1.1.1
|
|
|
|
| 100 |
pytz==2025.2
|
| 101 |
PyYAML==6.0.2
|
| 102 |
referencing==0.36.2
|
|
@@ -107,15 +114,19 @@ requests-toolbelt==1.0.0
|
|
| 107 |
rich==14.1.0
|
| 108 |
rpds-py==0.27.0
|
| 109 |
rsa==4.9.1
|
|
|
|
|
|
|
| 110 |
safetensors==0.6.2
|
| 111 |
scikit-learn==1.7.1
|
| 112 |
scipy==1.16.1
|
|
|
|
| 113 |
sentence-transformers==5.1.0
|
| 114 |
shellingham==1.5.4
|
| 115 |
six==1.17.0
|
| 116 |
smmap==5.0.2
|
| 117 |
sniffio==1.3.1
|
| 118 |
-
|
|
|
|
| 119 |
streamlit==1.48.0
|
| 120 |
sympy==1.14.0
|
| 121 |
tenacity==9.1.2
|
|
@@ -123,6 +134,7 @@ threadpoolctl==3.6.0
|
|
| 123 |
tiktoken==0.11.0
|
| 124 |
tokenizers==0.21.4
|
| 125 |
toml==0.10.2
|
|
|
|
| 126 |
torch==2.8.0
|
| 127 |
tornado==6.5.2
|
| 128 |
tqdm==4.67.1
|
|
|
|
| 1 |
+
aiofiles==24.1.0
|
|
|
|
| 2 |
aiohappyeyeballs==2.6.1
|
| 3 |
aiohttp==3.12.15
|
| 4 |
aiosignal==1.4.0
|
|
|
|
| 9 |
backoff==2.2.1
|
| 10 |
bcrypt==4.3.0
|
| 11 |
blinker==1.9.0
|
| 12 |
+
Brotli==1.1.0
|
| 13 |
build==1.3.0
|
| 14 |
cachetools==5.5.2
|
| 15 |
certifi==2025.8.3
|
|
|
|
| 20 |
dataclasses-json==0.6.7
|
| 21 |
distro==1.9.0
|
| 22 |
durationpy==0.10
|
| 23 |
+
fastapi==0.116.1
|
| 24 |
+
ffmpy==0.6.1
|
| 25 |
filelock==3.18.0
|
| 26 |
flatbuffers==25.2.10
|
| 27 |
frozenlist==1.7.0
|
|
|
|
| 30 |
GitPython==3.1.45
|
| 31 |
google-auth==2.40.3
|
| 32 |
googleapis-common-protos==1.70.0
|
| 33 |
+
gradio==5.42.0
|
| 34 |
+
gradio_client==1.11.1
|
| 35 |
+
groovy==0.1.2
|
| 36 |
grpcio==1.74.0
|
| 37 |
h11==0.16.0
|
| 38 |
hf-xet==1.1.7
|
|
|
|
| 96 |
pydantic-settings==2.10.1
|
| 97 |
pydantic_core==2.33.2
|
| 98 |
pydeck==0.9.1
|
| 99 |
+
pydub==0.25.1
|
| 100 |
Pygments==2.19.2
|
| 101 |
PyPDF2==3.0.1
|
| 102 |
PyPika==0.48.9
|
| 103 |
pyproject_hooks==1.2.0
|
| 104 |
python-dateutil==2.9.0.post0
|
| 105 |
python-dotenv==1.1.1
|
| 106 |
+
python-multipart==0.0.20
|
| 107 |
pytz==2025.2
|
| 108 |
PyYAML==6.0.2
|
| 109 |
referencing==0.36.2
|
|
|
|
| 114 |
rich==14.1.0
|
| 115 |
rpds-py==0.27.0
|
| 116 |
rsa==4.9.1
|
| 117 |
+
ruff==0.12.8
|
| 118 |
+
safehttpx==0.1.6
|
| 119 |
safetensors==0.6.2
|
| 120 |
scikit-learn==1.7.1
|
| 121 |
scipy==1.16.1
|
| 122 |
+
semantic-version==2.10.0
|
| 123 |
sentence-transformers==5.1.0
|
| 124 |
shellingham==1.5.4
|
| 125 |
six==1.17.0
|
| 126 |
smmap==5.0.2
|
| 127 |
sniffio==1.3.1
|
| 128 |
+
SQLAlchemy==2.0.43
|
| 129 |
+
starlette==0.47.2
|
| 130 |
streamlit==1.48.0
|
| 131 |
sympy==1.14.0
|
| 132 |
tenacity==9.1.2
|
|
|
|
| 134 |
tiktoken==0.11.0
|
| 135 |
tokenizers==0.21.4
|
| 136 |
toml==0.10.2
|
| 137 |
+
tomlkit==0.13.3
|
| 138 |
torch==2.8.0
|
| 139 |
tornado==6.5.2
|
| 140 |
tqdm==4.67.1
|
runtime.txt
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
python-3.10.12
|
|
|
|
|
|
start.sh
DELETED
|
@@ -1,43 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# PanSea University Search - Startup Script
|
| 4 |
-
|
| 5 |
-
echo "🎓 Starting PanSea University Search..."
|
| 6 |
-
|
| 7 |
-
# Check if virtual environment exists
|
| 8 |
-
if [ ! -d ".venv" ]; then
|
| 9 |
-
echo "❌ Virtual environment not found. Please run setup first."
|
| 10 |
-
exit 1
|
| 11 |
-
fi
|
| 12 |
-
|
| 13 |
-
# Activate virtual environment
|
| 14 |
-
source .venv/bin/activate
|
| 15 |
-
|
| 16 |
-
# Check if .env file exists
|
| 17 |
-
if [ ! -f ".env" ]; then
|
| 18 |
-
echo "⚠️ .env file not found. Please create one with your OpenAI API key."
|
| 19 |
-
echo "Example:"
|
| 20 |
-
echo "OPENAI_API_KEY=your_api_key_here"
|
| 21 |
-
exit 1
|
| 22 |
-
fi
|
| 23 |
-
|
| 24 |
-
# Create necessary directories
|
| 25 |
-
mkdir -p chroma_db
|
| 26 |
-
mkdir -p documents
|
| 27 |
-
mkdir -p query_results
|
| 28 |
-
|
| 29 |
-
# Check if required packages are installed
|
| 30 |
-
echo "🔍 Checking dependencies..."
|
| 31 |
-
python -c "import streamlit, langchain, chromadb" 2>/dev/null
|
| 32 |
-
if [ $? -ne 0 ]; then
|
| 33 |
-
echo "❌ Dependencies not found. Installing..."
|
| 34 |
-
pip install -r requirements.txt
|
| 35 |
-
fi
|
| 36 |
-
|
| 37 |
-
echo "🚀 Starting Streamlit application..."
|
| 38 |
-
echo "📱 Open your browser to: http://localhost:8501"
|
| 39 |
-
echo "🛑 Press Ctrl+C to stop the application"
|
| 40 |
-
echo ""
|
| 41 |
-
|
| 42 |
-
# Start the Streamlit app
|
| 43 |
-
streamlit run app.py --server.port=8501 --server.address=0.0.0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tabs/help.py
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Help tab functionality for the Gradio app
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
|
| 6 |
+
def create_help_tab(global_vars):
|
| 7 |
+
"""Create the Help tab with comprehensive documentation"""
|
| 8 |
+
with gr.Tab("❓ Help", id="help"):
|
| 9 |
+
gr.Markdown("""
|
| 10 |
+
# 🌏 PANSEA University Requirements Assistant - User Guide
|
| 11 |
+
|
| 12 |
+
Welcome to the PANSEA (Pan-Southeast Asian) University Requirements Assistant! This tool helps you navigate university admission requirements across Southeast Asian countries using advanced AI-powered document analysis.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## 🚀 Getting Started
|
| 17 |
+
|
| 18 |
+
### Step 1: Initialize the System
|
| 19 |
+
1. Go to the **🔧 Initialize** tab
|
| 20 |
+
2. Click **"Initialize All Systems"**
|
| 21 |
+
3. Wait for the success message
|
| 22 |
+
4. The system will set up AI models and document processing capabilities
|
| 23 |
+
|
| 24 |
+
### Step 2: Upload Documents
|
| 25 |
+
1. Navigate to the **📤 Upload Documents** tab
|
| 26 |
+
2. Select one or more PDF files containing university requirement information
|
| 27 |
+
3. Fill in the document metadata:
|
| 28 |
+
- **University Name**: Official name of the institution
|
| 29 |
+
- **Country**: Select from Southeast Asian countries
|
| 30 |
+
- **Document Type**: Choose the type of document
|
| 31 |
+
- **Language**: Document language
|
| 32 |
+
4. Click **"Process Documents"**
|
| 33 |
+
5. Wait for processing completion
|
| 34 |
+
|
| 35 |
+
### Step 3: Query Documents
|
| 36 |
+
1. Go to the **🔍 Query Documents** tab
|
| 37 |
+
2. Type your question in the query box
|
| 38 |
+
3. Click **"Search Documents"**
|
| 39 |
+
4. Review the AI-generated answer and source references
|
| 40 |
+
5. Use example questions to explore different types of queries
|
| 41 |
+
|
| 42 |
+
### Step 4: Manage Documents
|
| 43 |
+
1. Visit the **🗂 Manage Documents** tab
|
| 44 |
+
2. View all uploaded documents and statistics
|
| 45 |
+
3. Delete individual documents or clear all documents as needed
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 📖 Features Overview
|
| 50 |
+
|
| 51 |
+
### 🤖 AI-Powered Analysis
|
| 52 |
+
- Uses advanced SEA-LION AI models optimized for Southeast Asian contexts
|
| 53 |
+
- Semantic search across your document collection
|
| 54 |
+
- Contextual answers with source citations
|
| 55 |
+
- Multi-language document support
|
| 56 |
+
|
| 57 |
+
### 📚 Document Management
|
| 58 |
+
- Support for PDF documents
|
| 59 |
+
- Intelligent text chunking for better search results
|
| 60 |
+
- Metadata tracking (university, country, document type, language)
|
| 61 |
+
- Easy document deletion and management
|
| 62 |
+
|
| 63 |
+
### 🌐 Regional Focus
|
| 64 |
+
- Specialized for Southeast Asian universities
|
| 65 |
+
- Supports multiple countries and languages
|
| 66 |
+
- Culturally aware responses
|
| 67 |
+
- Up-to-date admission requirement information
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## 💡 Usage Tips
|
| 72 |
+
|
| 73 |
+
### Asking Better Questions
|
| 74 |
+
- **Be Specific**: "What are the English proficiency requirements for Computer Science at NUS?" instead of "What are the requirements?"
|
| 75 |
+
- **Include Context**: Mention specific programs, countries, or universities you're interested in
|
| 76 |
+
- **Use Keywords**: Include terms like "admission", "requirements", "GPA", "test scores", etc.
|
| 77 |
+
|
| 78 |
+
### Document Upload Best Practices
|
| 79 |
+
- **Quality Documents**: Upload official university brochures, requirement documents, or application guides
|
| 80 |
+
- **Accurate Metadata**: Fill in all metadata fields correctly for better search results
|
| 81 |
+
- **Regular Updates**: Replace outdated documents with current versions
|
| 82 |
+
- **Organized Approach**: Upload documents systematically by country or university
|
| 83 |
+
|
| 84 |
+
### Managing Your Knowledge Base
|
| 85 |
+
- **Regular Maintenance**: Remove outdated documents periodically
|
| 86 |
+
- **Logical Organization**: Group related documents together
|
| 87 |
+
- **Backup Important Queries**: Save important answers for future reference
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## 🛠 Troubleshooting
|
| 92 |
+
|
| 93 |
+
### Common Issues
|
| 94 |
+
|
| 95 |
+
**Problem**: "Please initialize systems first" error
|
| 96 |
+
- **Solution**: Go to the Initialize tab and click "Initialize All Systems"
|
| 97 |
+
|
| 98 |
+
**Problem**: Document upload fails
|
| 99 |
+
- **Solution**: Ensure PDF files are not corrupted and contain text (not just images)
|
| 100 |
+
|
| 101 |
+
**Problem**: No search results
|
| 102 |
+
- **Solution**: Check if documents are uploaded and try different keywords
|
| 103 |
+
|
| 104 |
+
**Problem**: Slow performance
|
| 105 |
+
- **Solution**: Wait for processing to complete, avoid uploading too many large documents at once
|
| 106 |
+
|
| 107 |
+
### Technical Requirements
|
| 108 |
+
- **File Format**: PDF documents only
|
| 109 |
+
- **File Size**: Reasonable size limits (avoid extremely large files)
|
| 110 |
+
- **Content**: Text-based PDFs work best (scanned images may not work well)
|
| 111 |
+
- **Internet**: Required for AI model access
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## 📊 Understanding Results
|
| 116 |
+
|
| 117 |
+
### Query Responses
|
| 118 |
+
- **Answer**: AI-generated response based on your documents
|
| 119 |
+
- **Sources**: Specific document chunks used to generate the answer
|
| 120 |
+
- **Confidence**: Implied by the specificity and detail of the response
|
| 121 |
+
- **Context**: Related information that might be helpful
|
| 122 |
+
|
| 123 |
+
### Document Statistics
|
| 124 |
+
- **Total Documents**: Number of unique documents uploaded
|
| 125 |
+
- **Total Chunks**: Number of text segments for searching
|
| 126 |
+
- **Metadata**: Information about each document's origin and type
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## 🌟 Best Practices for University Research
|
| 131 |
+
|
| 132 |
+
### Research Strategy
|
| 133 |
+
1. **Start Broad**: Upload general university information first
|
| 134 |
+
2. **Get Specific**: Add detailed program requirements
|
| 135 |
+
3. **Compare Options**: Query for comparisons between universities
|
| 136 |
+
4. **Verify Information**: Cross-reference with official university websites
|
| 137 |
+
|
| 138 |
+
### Question Types to Try
|
| 139 |
+
- **Admission Requirements**: "What are the minimum GPA requirements for..."
|
| 140 |
+
- **Test Scores**: "What IELTS/TOEFL scores are needed for..."
|
| 141 |
+
- **Application Deadlines**: "When is the application deadline for..."
|
| 142 |
+
- **Program Details**: "What courses are included in the... program at..."
|
| 143 |
+
- **Scholarships**: "What scholarship opportunities are available for..."
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## 🆘 Support & Feedback
|
| 148 |
+
|
| 149 |
+
If you encounter issues or have suggestions for improvement:
|
| 150 |
+
|
| 151 |
+
1. **Check Documentation**: Review this help section first
|
| 152 |
+
2. **Try Different Approaches**: Rephrase your queries or check document formats
|
| 153 |
+
3. **Document Issues**: Note specific error messages or unexpected behavior
|
| 154 |
+
4. **Feature Requests**: Consider what additional functionality would be helpful
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## 🔄 Version Information
|
| 159 |
+
|
| 160 |
+
**Current Version**: Gradio-based PANSEA Assistant
|
| 161 |
+
**AI Models**: SEA-LION optimized for Southeast Asian contexts
|
| 162 |
+
**Document Processing**: Advanced semantic chunking and embedding
|
| 163 |
+
**Search Technology**: Vector similarity search with contextual ranking
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
*Happy university hunting! 🎓 We hope this tool helps you find the perfect educational opportunity in Southeast Asia.*
|
| 168 |
+
""")
|
tabs/initialize.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Initialize tab functionality for the Gradio app
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
from utils.rag_system import DocumentIngestion, RAGSystem
|
| 6 |
+
|
| 7 |
+
def initialize_systems(global_vars):
|
| 8 |
+
"""Initialize the RAG systems"""
|
| 9 |
+
try:
|
| 10 |
+
print("🚀 Initializing document ingestion system...")
|
| 11 |
+
global_vars['doc_ingestion'] = DocumentIngestion()
|
| 12 |
+
print("🚀 Initializing RAG system...")
|
| 13 |
+
global_vars['rag_system'] = RAGSystem()
|
| 14 |
+
return "✅ Systems initialized successfully! You can now upload documents."
|
| 15 |
+
except Exception as e:
|
| 16 |
+
error_msg = f"❌ Error initializing systems: {str(e)}\n\n"
|
| 17 |
+
|
| 18 |
+
if "sentence-transformers" in str(e):
|
| 19 |
+
error_msg += """
|
| 20 |
+
**Possible solutions:**
|
| 21 |
+
1. Install sentence-transformers: `pip install sentence-transformers`
|
| 22 |
+
2. Or provide OpenAI API key in environment variables
|
| 23 |
+
3. Check that PyTorch is properly installed
|
| 24 |
+
|
| 25 |
+
**For deployment:**
|
| 26 |
+
- Ensure requirements.txt includes: sentence-transformers, torch, transformers
|
| 27 |
+
"""
|
| 28 |
+
return error_msg
|
| 29 |
+
|
| 30 |
+
def create_initialize_tab(global_vars):
|
| 31 |
+
"""Create the Initialize System tab"""
|
| 32 |
+
with gr.Tab("🚀 Initialize System", id="init"):
|
| 33 |
+
gr.Markdown("""
|
| 34 |
+
### Step 1: Initialize the System
|
| 35 |
+
Click the button below to initialize the AI models and embedding systems.
|
| 36 |
+
This may take a few moments on first run as models are downloaded.
|
| 37 |
+
""")
|
| 38 |
+
|
| 39 |
+
init_btn = gr.Button(
|
| 40 |
+
"🚀 Initialize Systems",
|
| 41 |
+
variant="primary",
|
| 42 |
+
size="lg"
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
init_status = gr.Textbox(
|
| 46 |
+
label="Initialization Status",
|
| 47 |
+
interactive=False,
|
| 48 |
+
lines=8,
|
| 49 |
+
placeholder="Click 'Initialize Systems' to start..."
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
init_btn.click(
|
| 53 |
+
lambda: initialize_systems(global_vars),
|
| 54 |
+
outputs=init_status
|
| 55 |
+
)
|
tabs/manage.py
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Manage documents tab functionality for the Gradio app
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
|
| 6 |
+
def manage_documents(global_vars):
|
| 7 |
+
"""Manage uploaded documents - view, delete individual or all documents"""
|
| 8 |
+
doc_ingestion = global_vars.get('doc_ingestion')
|
| 9 |
+
|
| 10 |
+
if not doc_ingestion:
|
| 11 |
+
return "❌ Please initialize systems first!", "", ""
|
| 12 |
+
|
| 13 |
+
try:
|
| 14 |
+
vectorstore = doc_ingestion.load_existing_vectorstore()
|
| 15 |
+
|
| 16 |
+
if not vectorstore:
|
| 17 |
+
return "⚠️ No documents found. Upload documents first.", "", ""
|
| 18 |
+
|
| 19 |
+
# Get all documents from vectorstore
|
| 20 |
+
collection = vectorstore._collection
|
| 21 |
+
all_docs = collection.get(include=["metadatas", "documents"])
|
| 22 |
+
metadatas = all_docs["metadatas"]
|
| 23 |
+
ids = all_docs["ids"]
|
| 24 |
+
documents = all_docs["documents"]
|
| 25 |
+
|
| 26 |
+
# Group by file_id to show unique documents
|
| 27 |
+
doc_map = {}
|
| 28 |
+
for meta, doc_id, doc_text in zip(metadatas, ids, documents):
|
| 29 |
+
file_id = meta.get("file_id", doc_id)
|
| 30 |
+
if file_id not in doc_map:
|
| 31 |
+
doc_map[file_id] = {
|
| 32 |
+
"source": meta.get("source", "Unknown"),
|
| 33 |
+
"university": meta.get("university", "Unknown"),
|
| 34 |
+
"country": meta.get("country", "Unknown"),
|
| 35 |
+
"document_type": meta.get("document_type", "Unknown"),
|
| 36 |
+
"language": meta.get("language", "Unknown"),
|
| 37 |
+
"upload_timestamp": meta.get("upload_timestamp", "Unknown"),
|
| 38 |
+
"file_id": file_id,
|
| 39 |
+
"chunks": []
|
| 40 |
+
}
|
| 41 |
+
doc_map[file_id]["chunks"].append(doc_text)
|
| 42 |
+
|
| 43 |
+
if not doc_map:
|
| 44 |
+
return "ℹ️ No documents found in the system.", "", ""
|
| 45 |
+
|
| 46 |
+
# Create summary
|
| 47 |
+
total_documents = len(doc_map)
|
| 48 |
+
total_chunks = sum(len(info["chunks"]) for info in doc_map.values())
|
| 49 |
+
|
| 50 |
+
summary = f"""## 📊 Document Statistics
|
| 51 |
+
|
| 52 |
+
**Total Documents:** {total_documents}
|
| 53 |
+
**Total Text Chunks:** {total_chunks}
|
| 54 |
+
**Storage Status:** Active
|
| 55 |
+
|
| 56 |
+
## 📚 Document List
|
| 57 |
+
"""
|
| 58 |
+
|
| 59 |
+
# Create document list with details
|
| 60 |
+
document_list = ""
|
| 61 |
+
file_id_list = []
|
| 62 |
+
|
| 63 |
+
for i, (file_id, info) in enumerate(doc_map.items(), 1):
|
| 64 |
+
timestamp = info['upload_timestamp'][:19] if len(info['upload_timestamp']) > 19 else info['upload_timestamp']
|
| 65 |
+
|
| 66 |
+
document_list += f"""
|
| 67 |
+
**{i}. {info['source']}**
|
| 68 |
+
- University: {info['university']}
|
| 69 |
+
- Country: {info['country']}
|
| 70 |
+
- Type: {info['document_type']}
|
| 71 |
+
- Language: {info['language']}
|
| 72 |
+
- Chunks: {len(info['chunks'])}
|
| 73 |
+
- Uploaded: {timestamp}
|
| 74 |
+
- File ID: `{file_id}`
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
"""
|
| 78 |
+
file_id_list.append(file_id)
|
| 79 |
+
|
| 80 |
+
# Create dropdown options for individual deletion
|
| 81 |
+
file_options = [f"{info['source']} ({info['university']})" for info in doc_map.values()]
|
| 82 |
+
|
| 83 |
+
return summary, document_list, file_options
|
| 84 |
+
|
| 85 |
+
except Exception as e:
|
| 86 |
+
return f"❌ Error loading documents: {str(e)}", "", []
|
| 87 |
+
|
| 88 |
+
def delete_document(selected_file, current_doc_list, global_vars):
|
| 89 |
+
"""Delete a specific document"""
|
| 90 |
+
doc_ingestion = global_vars.get('doc_ingestion')
|
| 91 |
+
|
| 92 |
+
if not doc_ingestion or not selected_file:
|
| 93 |
+
return "❌ Please select a document to delete.", current_doc_list
|
| 94 |
+
|
| 95 |
+
try:
|
| 96 |
+
vectorstore = doc_ingestion.load_existing_vectorstore()
|
| 97 |
+
if not vectorstore:
|
| 98 |
+
return "❌ No vectorstore found.", current_doc_list
|
| 99 |
+
|
| 100 |
+
# Get all documents and find the matching file_id
|
| 101 |
+
collection = vectorstore._collection
|
| 102 |
+
all_docs = collection.get(include=["metadatas"])
|
| 103 |
+
metadatas = all_docs["metadatas"]
|
| 104 |
+
ids = all_docs["ids"]
|
| 105 |
+
|
| 106 |
+
# Find file_id for the selected document
|
| 107 |
+
target_file_id = None
|
| 108 |
+
for meta, doc_id in zip(metadatas, ids):
|
| 109 |
+
source = meta.get("source", "Unknown")
|
| 110 |
+
university = meta.get("university", "Unknown")
|
| 111 |
+
if f"{source} ({university})" == selected_file:
|
| 112 |
+
target_file_id = meta.get("file_id", doc_id)
|
| 113 |
+
break
|
| 114 |
+
|
| 115 |
+
if not target_file_id:
|
| 116 |
+
return "❌ Document not found.", current_doc_list
|
| 117 |
+
|
| 118 |
+
# Delete all chunks with this file_id
|
| 119 |
+
ids_to_delete = [doc_id for meta, doc_id in zip(metadatas, ids) if meta.get("file_id", doc_id) == target_file_id]
|
| 120 |
+
collection.delete(ids=ids_to_delete)
|
| 121 |
+
|
| 122 |
+
# Refresh the document list
|
| 123 |
+
_, new_doc_list, _ = manage_documents(global_vars)
|
| 124 |
+
|
| 125 |
+
return f"✅ Successfully deleted document: {selected_file}", new_doc_list
|
| 126 |
+
|
| 127 |
+
except Exception as e:
|
| 128 |
+
return f"❌ Error deleting document: {str(e)}", current_doc_list
|
| 129 |
+
|
| 130 |
+
def delete_all_documents(global_vars):
|
| 131 |
+
"""Delete all documents from the vectorstore"""
|
| 132 |
+
doc_ingestion = global_vars.get('doc_ingestion')
|
| 133 |
+
|
| 134 |
+
if not doc_ingestion:
|
| 135 |
+
return "❌ Please initialize systems first.", ""
|
| 136 |
+
|
| 137 |
+
try:
|
| 138 |
+
vectorstore_instance = doc_ingestion.load_existing_vectorstore()
|
| 139 |
+
if not vectorstore_instance:
|
| 140 |
+
return "⚠️ No documents found to delete.", ""
|
| 141 |
+
|
| 142 |
+
# Get all document IDs
|
| 143 |
+
collection = vectorstore_instance._collection
|
| 144 |
+
all_docs = collection.get()
|
| 145 |
+
all_ids = all_docs["ids"]
|
| 146 |
+
|
| 147 |
+
# Delete all documents
|
| 148 |
+
if all_ids:
|
| 149 |
+
collection.delete(ids=all_ids)
|
| 150 |
+
# Clear global vectorstore
|
| 151 |
+
global_vars['vectorstore'] = None
|
| 152 |
+
return f"✅ Successfully deleted all {len(all_ids)} document chunks.", ""
|
| 153 |
+
else:
|
| 154 |
+
return "ℹ️ No documents found to delete.", ""
|
| 155 |
+
|
| 156 |
+
except Exception as e:
|
| 157 |
+
return f"❌ Error deleting all documents: {str(e)}", ""
|
| 158 |
+
|
| 159 |
+
def create_manage_tab(global_vars):
|
| 160 |
+
"""Create the Manage Documents tab"""
|
| 161 |
+
with gr.Tab("🗂 Manage Documents", id="manage"):
|
| 162 |
+
gr.Markdown("""
|
| 163 |
+
### Step 4: Manage Your Documents
|
| 164 |
+
View, inspect, and manage all uploaded documents in your knowledge base.
|
| 165 |
+
You can see document details and delete individual documents or all documents.
|
| 166 |
+
""")
|
| 167 |
+
|
| 168 |
+
# Buttons for actions
|
| 169 |
+
with gr.Row():
|
| 170 |
+
refresh_btn = gr.Button("🔄 Refresh Document List", variant="secondary")
|
| 171 |
+
delete_all_btn = gr.Button("🗑️ Delete All Documents", variant="stop")
|
| 172 |
+
|
| 173 |
+
# Document statistics and list
|
| 174 |
+
doc_summary = gr.Markdown(
|
| 175 |
+
value="📊 Click 'Refresh Document List' to view your documents.",
|
| 176 |
+
label="Document Summary"
|
| 177 |
+
)
|
| 178 |
+
|
| 179 |
+
doc_list = gr.Markdown(
|
| 180 |
+
value="📚 Document details will appear here after refresh.",
|
| 181 |
+
label="Document List"
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
# Individual document deletion
|
| 185 |
+
gr.Markdown("### 🗑️ Delete Individual Document")
|
| 186 |
+
|
| 187 |
+
with gr.Row():
|
| 188 |
+
file_selector = gr.Dropdown(
|
| 189 |
+
choices=[],
|
| 190 |
+
label="Select Document to Delete",
|
| 191 |
+
interactive=True,
|
| 192 |
+
info="First click 'Refresh Document List' to see available documents"
|
| 193 |
+
)
|
| 194 |
+
delete_single_btn = gr.Button("🗑️ Delete Selected", variant="stop")
|
| 195 |
+
|
| 196 |
+
delete_status = gr.Textbox(
|
| 197 |
+
label="Action Status",
|
| 198 |
+
interactive=False,
|
| 199 |
+
lines=2,
|
| 200 |
+
placeholder="Deletion status will appear here..."
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
# Event handlers
|
| 204 |
+
def refresh_documents():
|
| 205 |
+
summary, documents, file_options = manage_documents(global_vars)
|
| 206 |
+
# Update dropdown choices
|
| 207 |
+
return summary, documents, gr.Dropdown(choices=file_options, value=None)
|
| 208 |
+
|
| 209 |
+
def delete_selected_document(selected_file, current_list):
|
| 210 |
+
if not selected_file:
|
| 211 |
+
return "❌ Please select a document to delete first.", current_list, gr.Dropdown(choices=[])
|
| 212 |
+
|
| 213 |
+
status, new_list = delete_document(selected_file, current_list, global_vars)
|
| 214 |
+
# Also refresh the file options after deletion
|
| 215 |
+
_, _, new_options = manage_documents(global_vars)
|
| 216 |
+
return status, new_list, gr.Dropdown(choices=new_options, value=None)
|
| 217 |
+
|
| 218 |
+
def delete_all_docs():
|
| 219 |
+
status, empty_list = delete_all_documents(global_vars)
|
| 220 |
+
return status, "📚 No documents in the system.", gr.Dropdown(choices=[], value=None)
|
| 221 |
+
|
| 222 |
+
# Connect event handlers
|
| 223 |
+
refresh_btn.click(
|
| 224 |
+
refresh_documents,
|
| 225 |
+
outputs=[doc_summary, doc_list, file_selector]
|
| 226 |
+
)
|
| 227 |
+
|
| 228 |
+
delete_single_btn.click(
|
| 229 |
+
delete_selected_document,
|
| 230 |
+
inputs=[file_selector, doc_list],
|
| 231 |
+
outputs=[delete_status, doc_list, file_selector]
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
delete_all_btn.click(
|
| 235 |
+
delete_all_docs,
|
| 236 |
+
outputs=[delete_status, doc_list, file_selector]
|
| 237 |
+
)
|
tabs/query.py
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Query documents tab functionality for the Gradio app
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
|
| 6 |
+
def query_documents(question, language, global_vars):
|
| 7 |
+
"""Handle document queries"""
|
| 8 |
+
rag_system = global_vars.get('rag_system')
|
| 9 |
+
vectorstore = global_vars.get('vectorstore')
|
| 10 |
+
|
| 11 |
+
if not rag_system:
|
| 12 |
+
return "❌ Please initialize systems first using the 'Initialize System' tab!"
|
| 13 |
+
|
| 14 |
+
if not vectorstore:
|
| 15 |
+
return "❌ Please upload and process documents first using the 'Upload Documents' tab!"
|
| 16 |
+
|
| 17 |
+
if not question.strip():
|
| 18 |
+
return "❌ Please enter a question."
|
| 19 |
+
|
| 20 |
+
try:
|
| 21 |
+
print(f"🔍 Processing query: {question}")
|
| 22 |
+
result = rag_system.query(question, language)
|
| 23 |
+
|
| 24 |
+
# Format response
|
| 25 |
+
answer = result["answer"]
|
| 26 |
+
sources = result.get("source_documents", [])
|
| 27 |
+
model_used = result.get("model_used", "SEA-LION")
|
| 28 |
+
|
| 29 |
+
# Add model information
|
| 30 |
+
response = f"**Model Used:** {model_used}\n\n"
|
| 31 |
+
response += f"**Answer:**\n{answer}\n\n"
|
| 32 |
+
|
| 33 |
+
if sources:
|
| 34 |
+
response += "**📚 Sources:**\n"
|
| 35 |
+
for i, doc in enumerate(sources[:3], 1):
|
| 36 |
+
metadata = doc.metadata
|
| 37 |
+
source_name = metadata.get('source', 'Unknown')
|
| 38 |
+
university = metadata.get('university', 'Unknown')
|
| 39 |
+
country = metadata.get('country', 'Unknown')
|
| 40 |
+
doc_type = metadata.get('document_type', 'Unknown')
|
| 41 |
+
|
| 42 |
+
response += f"{i}. **{source_name}**\n"
|
| 43 |
+
response += f" - University: {university}\n"
|
| 44 |
+
response += f" - Country: {country}\n"
|
| 45 |
+
response += f" - Type: {doc_type}\n"
|
| 46 |
+
response += f" - Preview: {doc.page_content[:150]}...\n\n"
|
| 47 |
+
else:
|
| 48 |
+
response += "\n*No specific sources found. This might be a general response.*"
|
| 49 |
+
|
| 50 |
+
return response
|
| 51 |
+
|
| 52 |
+
except Exception as e:
|
| 53 |
+
return f"❌ Error querying documents: {str(e)}\n\nPlease check the console for more details."
|
| 54 |
+
|
| 55 |
+
def get_example_questions():
|
| 56 |
+
"""Return example questions for the interface"""
|
| 57 |
+
return [
|
| 58 |
+
"What are the admission requirements for Computer Science programs in Singapore?",
|
| 59 |
+
"Which universities offer scholarships for international students?",
|
| 60 |
+
"What are the tuition fees for MBA programs in Thailand?",
|
| 61 |
+
"Find universities with engineering programs under $5000 per year",
|
| 62 |
+
"What are the application deadlines for programs in Malaysia?",
|
| 63 |
+
"Compare admission requirements between different ASEAN countries"
|
| 64 |
+
]
|
| 65 |
+
|
| 66 |
+
def create_query_tab(global_vars):
|
| 67 |
+
"""Create the Search & Query tab"""
|
| 68 |
+
with gr.Tab("🔍 Search & Query", id="query"):
|
| 69 |
+
gr.Markdown("""
|
| 70 |
+
### Step 3: Ask Questions
|
| 71 |
+
Ask questions about the uploaded documents in your preferred language.
|
| 72 |
+
The AI will provide detailed answers with source citations.
|
| 73 |
+
""")
|
| 74 |
+
|
| 75 |
+
with gr.Row():
|
| 76 |
+
with gr.Column(scale=3):
|
| 77 |
+
question_input = gr.Textbox(
|
| 78 |
+
label="💭 Your Question",
|
| 79 |
+
placeholder="Ask anything about the universities...",
|
| 80 |
+
lines=3
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
with gr.Column(scale=1):
|
| 84 |
+
language_dropdown = gr.Dropdown(
|
| 85 |
+
choices=[
|
| 86 |
+
"English", "Chinese", "Malay", "Thai",
|
| 87 |
+
"Indonesian", "Vietnamese", "Filipino"
|
| 88 |
+
],
|
| 89 |
+
value="English",
|
| 90 |
+
label="🌍 Response Language"
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
query_btn = gr.Button(
|
| 94 |
+
"🔍 Search Documents",
|
| 95 |
+
variant="primary",
|
| 96 |
+
size="lg"
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
answer_output = gr.Textbox(
|
| 100 |
+
label="🤖 AI Response",
|
| 101 |
+
interactive=False,
|
| 102 |
+
lines=20,
|
| 103 |
+
placeholder="Ask a question to get AI-powered answers..."
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
# Example questions section
|
| 107 |
+
gr.Markdown("### 💡 Example Questions")
|
| 108 |
+
example_questions = get_example_questions()
|
| 109 |
+
|
| 110 |
+
with gr.Row():
|
| 111 |
+
for i in range(0, len(example_questions), 2):
|
| 112 |
+
with gr.Column():
|
| 113 |
+
if i < len(example_questions):
|
| 114 |
+
example_btn = gr.Button(
|
| 115 |
+
example_questions[i],
|
| 116 |
+
size="sm",
|
| 117 |
+
variant="secondary"
|
| 118 |
+
)
|
| 119 |
+
example_btn.click(
|
| 120 |
+
lambda x=example_questions[i]: x,
|
| 121 |
+
outputs=question_input
|
| 122 |
+
)
|
| 123 |
+
|
| 124 |
+
if i + 1 < len(example_questions):
|
| 125 |
+
example_btn2 = gr.Button(
|
| 126 |
+
example_questions[i + 1],
|
| 127 |
+
size="sm",
|
| 128 |
+
variant="secondary"
|
| 129 |
+
)
|
| 130 |
+
example_btn2.click(
|
| 131 |
+
lambda x=example_questions[i + 1]: x,
|
| 132 |
+
outputs=question_input
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
query_btn.click(
|
| 136 |
+
lambda question, language: query_documents(question, language, global_vars),
|
| 137 |
+
inputs=[question_input, language_dropdown],
|
| 138 |
+
outputs=answer_output
|
| 139 |
+
)
|
tabs/upload.py
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Upload documents tab functionality for the Gradio app
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
|
| 6 |
+
def upload_documents(files, global_vars):
|
| 7 |
+
"""Handle document upload and processing"""
|
| 8 |
+
doc_ingestion = global_vars.get('doc_ingestion')
|
| 9 |
+
|
| 10 |
+
if not doc_ingestion:
|
| 11 |
+
return "❌ Please initialize systems first using the 'Initialize System' tab!"
|
| 12 |
+
|
| 13 |
+
if not files:
|
| 14 |
+
return "❌ Please upload at least one PDF file."
|
| 15 |
+
|
| 16 |
+
try:
|
| 17 |
+
# Filter for PDF files only
|
| 18 |
+
pdf_files = []
|
| 19 |
+
for file_path in files:
|
| 20 |
+
if file_path.endswith('.pdf'):
|
| 21 |
+
pdf_files.append(file_path)
|
| 22 |
+
|
| 23 |
+
if not pdf_files:
|
| 24 |
+
return "❌ Please upload PDF files only."
|
| 25 |
+
|
| 26 |
+
print(f"📄 Processing {len(pdf_files)} PDF file(s)...")
|
| 27 |
+
|
| 28 |
+
# Process documents
|
| 29 |
+
documents = doc_ingestion.process_documents(pdf_files)
|
| 30 |
+
|
| 31 |
+
if documents:
|
| 32 |
+
print("🔗 Creating vector store...")
|
| 33 |
+
# Create vector store
|
| 34 |
+
vectorstore = doc_ingestion.create_vector_store(documents)
|
| 35 |
+
|
| 36 |
+
if vectorstore:
|
| 37 |
+
# Store vectorstore in global vars
|
| 38 |
+
global_vars['vectorstore'] = vectorstore
|
| 39 |
+
|
| 40 |
+
# Create summary
|
| 41 |
+
summary = f"✅ Successfully processed {len(documents)} document(s):\n\n"
|
| 42 |
+
|
| 43 |
+
for i, doc in enumerate(documents, 1):
|
| 44 |
+
metadata = doc.metadata
|
| 45 |
+
university = metadata.get('university', 'Unknown')
|
| 46 |
+
country = metadata.get('country', 'Unknown')
|
| 47 |
+
doc_type = metadata.get('document_type', 'Unknown')
|
| 48 |
+
language = metadata.get('language', 'Unknown')
|
| 49 |
+
|
| 50 |
+
summary += f"{i}. **{metadata['source']}**\n"
|
| 51 |
+
summary += f" - University: {university}\n"
|
| 52 |
+
summary += f" - Country: {country}\n"
|
| 53 |
+
summary += f" - Type: {doc_type}\n"
|
| 54 |
+
summary += f" - Language: {language}\n\n"
|
| 55 |
+
|
| 56 |
+
summary += "🎉 **Ready for queries!** Go to the 'Search & Query' tab to start asking questions."
|
| 57 |
+
return summary
|
| 58 |
+
else:
|
| 59 |
+
return "❌ Failed to create vector store from documents."
|
| 60 |
+
else:
|
| 61 |
+
return "❌ No documents were successfully processed. Please check if your PDFs are readable."
|
| 62 |
+
|
| 63 |
+
except Exception as e:
|
| 64 |
+
return f"❌ Error processing documents: {str(e)}\n\nPlease check the console for more details."
|
| 65 |
+
|
| 66 |
+
def create_upload_tab(global_vars):
|
| 67 |
+
"""Create the Upload Documents tab"""
|
| 68 |
+
with gr.Tab("📄 Upload Documents", id="upload"):
|
| 69 |
+
gr.Markdown("""
|
| 70 |
+
### Step 2: Upload PDF Documents
|
| 71 |
+
Upload university documents (brochures, admission guides, etc.) in PDF format.
|
| 72 |
+
The system will automatically extract metadata including university name, country, and document type.
|
| 73 |
+
""")
|
| 74 |
+
|
| 75 |
+
file_upload = gr.File(
|
| 76 |
+
label="📁 Upload PDF Documents",
|
| 77 |
+
file_types=[".pdf"],
|
| 78 |
+
file_count="multiple",
|
| 79 |
+
height=120
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
upload_btn = gr.Button(
|
| 83 |
+
"📄 Process Documents",
|
| 84 |
+
variant="primary",
|
| 85 |
+
size="lg"
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
upload_status = gr.Textbox(
|
| 89 |
+
label="Processing Status",
|
| 90 |
+
interactive=False,
|
| 91 |
+
lines=12,
|
| 92 |
+
placeholder="Upload PDF files and click 'Process Documents'..."
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
upload_btn.click(
|
| 96 |
+
lambda files: upload_documents(files, global_vars),
|
| 97 |
+
inputs=file_upload,
|
| 98 |
+
outputs=upload_status
|
| 99 |
+
)
|
utils/rag_system.py
CHANGED
|
@@ -2,7 +2,6 @@ import os
|
|
| 2 |
import uuid
|
| 3 |
import tempfile
|
| 4 |
from typing import List, Optional, Dict, Any
|
| 5 |
-
import streamlit as st
|
| 6 |
from pathlib import Path
|
| 7 |
import PyPDF2
|
| 8 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
|
@@ -27,24 +26,54 @@ class AlternativeEmbeddings:
|
|
| 27 |
"""Alternative embeddings using Sentence Transformers when OpenAI is not available"""
|
| 28 |
|
| 29 |
def __init__(self):
|
|
|
|
|
|
|
|
|
|
| 30 |
try:
|
| 31 |
from sentence_transformers import SentenceTransformer
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
except ImportError:
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
def embed_documents(self, texts):
|
| 40 |
if not self.model:
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
def embed_query(self, text):
|
| 45 |
if not self.model:
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
class SEALionLLM:
|
| 50 |
"""Custom LLM class for SEA-LION models"""
|
|
@@ -168,7 +197,7 @@ class SEALionLLM:
|
|
| 168 |
return response_text
|
| 169 |
|
| 170 |
except Exception as e:
|
| 171 |
-
|
| 172 |
return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
|
| 173 |
|
| 174 |
def extract_metadata(self, document_text: str) -> Dict[str, str]:
|
|
@@ -213,33 +242,33 @@ class SEALionLLM:
|
|
| 213 |
)
|
| 214 |
|
| 215 |
response_text = response.choices[0].message.content.strip()
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
|
| 220 |
json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
|
| 221 |
if json_match:
|
| 222 |
json_str = json_match.group(0)
|
| 223 |
try:
|
| 224 |
metadata = json.loads(json_str)
|
| 225 |
-
|
| 226 |
required_keys = ["university_name", "country", "document_type", "language"]
|
| 227 |
if all(key in metadata for key in required_keys):
|
| 228 |
-
|
| 229 |
return metadata
|
| 230 |
else:
|
| 231 |
-
|
| 232 |
return self._get_default_metadata()
|
| 233 |
except json.JSONDecodeError as e:
|
| 234 |
-
|
| 235 |
-
|
| 236 |
return self._extract_from_text_response(response_text)
|
| 237 |
else:
|
| 238 |
-
|
| 239 |
return self._extract_from_text_response(response_text)
|
| 240 |
|
| 241 |
except Exception as e:
|
| 242 |
-
|
| 243 |
return self._get_default_metadata()
|
| 244 |
|
| 245 |
def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
|
|
@@ -260,7 +289,7 @@ class SEALionLLM:
|
|
| 260 |
elif "language" in line.lower() and ":" in line:
|
| 261 |
value = line.split(":", 1)[1].strip().strip('",')
|
| 262 |
metadata["language"] = value
|
| 263 |
-
|
| 264 |
return metadata
|
| 265 |
|
| 266 |
def _get_default_metadata(self) -> Dict[str, str]:
|
|
@@ -301,10 +330,10 @@ class DocumentIngestion:
|
|
| 301 |
self.embeddings = OpenAIEmbeddings()
|
| 302 |
self.embedding_type = "OpenAI"
|
| 303 |
except Exception as e:
|
| 304 |
-
|
| 305 |
raise e
|
| 306 |
else:
|
| 307 |
-
|
| 308 |
raise Exception("No embedding model available")
|
| 309 |
|
| 310 |
self.text_splitter = SemanticChunker(
|
|
@@ -321,80 +350,77 @@ class DocumentIngestion:
|
|
| 321 |
self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
|
| 322 |
os.makedirs(self.persist_directory, exist_ok=True)
|
| 323 |
|
| 324 |
-
def extract_text_from_pdf(self,
|
| 325 |
-
"""Extract text from
|
| 326 |
try:
|
| 327 |
# Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
|
|
|
| 353 |
|
| 354 |
except Exception as e:
|
| 355 |
error_msg = str(e)
|
| 356 |
if "PyCryptodome" in error_msg:
|
| 357 |
-
|
| 358 |
-
|
| 359 |
elif "password" in error_msg.lower():
|
| 360 |
-
|
| 361 |
-
|
| 362 |
else:
|
| 363 |
-
|
| 364 |
return []
|
| 365 |
|
| 366 |
-
def process_documents(self,
|
| 367 |
-
"""Process
|
| 368 |
documents = []
|
| 369 |
processed_count = 0
|
| 370 |
failed_count = 0
|
| 371 |
|
| 372 |
-
|
| 373 |
|
| 374 |
-
for
|
| 375 |
-
if
|
| 376 |
-
|
|
|
|
| 377 |
|
| 378 |
# Extract text per page
|
| 379 |
-
text_per_page = self.extract_text_from_pdf(
|
| 380 |
-
|
| 381 |
|
| 382 |
if text_per_page:
|
| 383 |
# Combine first two pages for metadata extraction
|
| 384 |
text_for_metadata = "\n".join(text_per_page[:2])
|
| 385 |
-
|
| 386 |
# Extract metadata using LLM
|
| 387 |
-
|
| 388 |
extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
|
| 389 |
|
| 390 |
-
# Validate and clean metadata (assuming validate_metadata is defined elsewhere or will be added)
|
| 391 |
-
# For now, we\'ll use the extracted_metadata directly.
|
| 392 |
-
# If you want me to add validate_metadata here, please provide its content.
|
| 393 |
-
# extracted_metadata = validate_metadata(extracted_metadata)
|
| 394 |
-
|
| 395 |
# Create metadata
|
| 396 |
metadata = {
|
| 397 |
-
"source":
|
| 398 |
"university": extracted_metadata.get("university_name", "Unknown"),
|
| 399 |
"country": extracted_metadata.get("country", "Unknown"),
|
| 400 |
"document_type": extracted_metadata.get("document_type", "general_info"),
|
|
@@ -410,26 +436,27 @@ class DocumentIngestion:
|
|
| 410 |
)
|
| 411 |
documents.append(doc)
|
| 412 |
processed_count += 1
|
| 413 |
-
|
| 414 |
else:
|
| 415 |
failed_count += 1
|
| 416 |
-
|
| 417 |
else:
|
| 418 |
failed_count += 1
|
| 419 |
-
|
|
|
|
| 420 |
|
| 421 |
# Summary
|
| 422 |
if processed_count > 0:
|
| 423 |
-
|
| 424 |
if failed_count > 0:
|
| 425 |
-
|
| 426 |
|
| 427 |
return documents
|
| 428 |
|
| 429 |
def create_vector_store(self, documents: List[Document]) -> Chroma:
|
| 430 |
"""Create and persist vector store from documents."""
|
| 431 |
if not documents:
|
| 432 |
-
|
| 433 |
return None
|
| 434 |
|
| 435 |
# Split documents into chunks
|
|
@@ -453,7 +480,7 @@ class DocumentIngestion:
|
|
| 453 |
)
|
| 454 |
return vectorstore
|
| 455 |
except Exception as e:
|
| 456 |
-
|
| 457 |
return None
|
| 458 |
|
| 459 |
class RAGSystem:
|
|
@@ -480,7 +507,7 @@ class RAGSystem:
|
|
| 480 |
)
|
| 481 |
return vectorstore
|
| 482 |
except Exception as e:
|
| 483 |
-
|
| 484 |
return None
|
| 485 |
|
| 486 |
def query(self, question: str, language: str = "English") -> Dict[str, Any]:
|
|
@@ -532,7 +559,7 @@ Document {i} (Source: {source_info}, University: {university}, Country: {country
|
|
| 532 |
}
|
| 533 |
|
| 534 |
except Exception as e:
|
| 535 |
-
|
| 536 |
return {
|
| 537 |
"answer": f"Error processing your question: {str(e)}",
|
| 538 |
"source_documents": [],
|
|
@@ -570,7 +597,7 @@ def save_query_result(query_result: Dict[str, Any]):
|
|
| 570 |
json.dump(save_data, f, indent=2, ensure_ascii=False)
|
| 571 |
return True
|
| 572 |
except Exception as e:
|
| 573 |
-
|
| 574 |
return False
|
| 575 |
return False
|
| 576 |
|
|
@@ -583,6 +610,6 @@ def load_shared_query(query_id: str) -> Optional[Dict[str, Any]]:
|
|
| 583 |
with open(result_file, 'r', encoding='utf-8') as f:
|
| 584 |
return json.load(f)
|
| 585 |
except Exception as e:
|
| 586 |
-
|
| 587 |
|
| 588 |
return None
|
|
|
|
| 2 |
import uuid
|
| 3 |
import tempfile
|
| 4 |
from typing import List, Optional, Dict, Any
|
|
|
|
| 5 |
from pathlib import Path
|
| 6 |
import PyPDF2
|
| 7 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
|
|
|
| 26 |
"""Alternative embeddings using Sentence Transformers when OpenAI is not available"""
|
| 27 |
|
| 28 |
def __init__(self):
|
| 29 |
+
self.model = None
|
| 30 |
+
self.embedding_size = 384
|
| 31 |
+
|
| 32 |
try:
|
| 33 |
from sentence_transformers import SentenceTransformer
|
| 34 |
+
|
| 35 |
+
# Try smaller models in order of preference for better cloud compatibility
|
| 36 |
+
model_options = [
|
| 37 |
+
("all-MiniLM-L6-v2", 384), # Very small and reliable
|
| 38 |
+
("paraphrase-MiniLM-L3-v2", 384), # Even smaller
|
| 39 |
+
("BAAI/bge-small-en-v1.5", 384) # Original choice
|
| 40 |
+
]
|
| 41 |
+
|
| 42 |
+
for model_name, embed_size in model_options:
|
| 43 |
+
try:
|
| 44 |
+
print(f"🔄 Trying to load model: {model_name}")
|
| 45 |
+
self.model = SentenceTransformer(model_name)
|
| 46 |
+
self.embedding_size = embed_size
|
| 47 |
+
print(f"✅ Successfully loaded: {model_name}")
|
| 48 |
+
break
|
| 49 |
+
except Exception as e:
|
| 50 |
+
print(f"⚠️ Failed to load {model_name}: {str(e)}")
|
| 51 |
+
continue
|
| 52 |
+
|
| 53 |
+
if not self.model:
|
| 54 |
+
raise Exception("All embedding models failed to load")
|
| 55 |
+
|
| 56 |
except ImportError:
|
| 57 |
+
print("❌ sentence-transformers not available. Please install it or provide OpenAI API key.")
|
| 58 |
+
raise ImportError("sentence-transformers not available")
|
| 59 |
|
| 60 |
def embed_documents(self, texts):
|
| 61 |
if not self.model:
|
| 62 |
+
raise Exception("No embedding model available")
|
| 63 |
+
try:
|
| 64 |
+
return self.model.encode(texts, convert_to_numpy=True).tolist()
|
| 65 |
+
except Exception as e:
|
| 66 |
+
print(f"Error encoding documents: {e}")
|
| 67 |
+
raise
|
| 68 |
|
| 69 |
def embed_query(self, text):
|
| 70 |
if not self.model:
|
| 71 |
+
raise Exception("No embedding model available")
|
| 72 |
+
try:
|
| 73 |
+
return self.model.encode([text], convert_to_numpy=True)[0].tolist()
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f"Error encoding query: {e}")
|
| 76 |
+
raise
|
| 77 |
|
| 78 |
class SEALionLLM:
|
| 79 |
"""Custom LLM class for SEA-LION models"""
|
|
|
|
| 197 |
return response_text
|
| 198 |
|
| 199 |
except Exception as e:
|
| 200 |
+
print(f"Error with SEA-LION model: {str(e)}")
|
| 201 |
return f"I apologize, but I encountered an error processing your query. Please try rephrasing your question. Error: {str(e)}"
|
| 202 |
|
| 203 |
def extract_metadata(self, document_text: str) -> Dict[str, str]:
|
|
|
|
| 242 |
)
|
| 243 |
|
| 244 |
response_text = response.choices[0].message.content.strip()
|
| 245 |
+
print("--- DEBUG: LLM Metadata Extraction Details ---")
|
| 246 |
+
print(f"**Input Text for LLM (first 2 pages):**\n```\n{document_text[:1000]}...\n```") # Show first 1000 chars of input
|
| 247 |
+
print(f"**Raw LLM Response:**\n```json\n{response_text}\n```")
|
| 248 |
|
| 249 |
json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
|
| 250 |
if json_match:
|
| 251 |
json_str = json_match.group(0)
|
| 252 |
try:
|
| 253 |
metadata = json.loads(json_str)
|
| 254 |
+
print(f"**Parsed JSON Metadata:**\n```json\n{json.dumps(metadata, indent=2)}\n```")
|
| 255 |
required_keys = ["university_name", "country", "document_type", "language"]
|
| 256 |
if all(key in metadata for key in required_keys):
|
| 257 |
+
print("DEBUG: Successfully extracted and parsed metadata from LLM.")
|
| 258 |
return metadata
|
| 259 |
else:
|
| 260 |
+
print("DEBUG: LLM response missing required keys, attempting fallback or using defaults.")
|
| 261 |
return self._get_default_metadata()
|
| 262 |
except json.JSONDecodeError as e:
|
| 263 |
+
print(f"DEBUG: JSON Parsing Failed: {e}")
|
| 264 |
+
print(f"DEBUG: Attempting fallback text extraction from raw response.")
|
| 265 |
return self._extract_from_text_response(response_text)
|
| 266 |
else:
|
| 267 |
+
print("DEBUG: No JSON object found in LLM response.")
|
| 268 |
return self._extract_from_text_response(response_text)
|
| 269 |
|
| 270 |
except Exception as e:
|
| 271 |
+
print(f"DEBUG: Error during LLM Metadata Extraction: {str(e)}")
|
| 272 |
return self._get_default_metadata()
|
| 273 |
|
| 274 |
def _extract_from_text_response(self, response_text: str) -> Dict[str, str]:
|
|
|
|
| 289 |
elif "language" in line.lower() and ":" in line:
|
| 290 |
value = line.split(":", 1)[1].strip().strip('",')
|
| 291 |
metadata["language"] = value
|
| 292 |
+
print(f"DEBUG: Fallback text extraction result: {metadata}")
|
| 293 |
return metadata
|
| 294 |
|
| 295 |
def _get_default_metadata(self) -> Dict[str, str]:
|
|
|
|
| 330 |
self.embeddings = OpenAIEmbeddings()
|
| 331 |
self.embedding_type = "OpenAI"
|
| 332 |
except Exception as e:
|
| 333 |
+
print("Both BGE and OpenAI embeddings failed. Please check your setup.")
|
| 334 |
raise e
|
| 335 |
else:
|
| 336 |
+
print("No embedding model available. Please install sentence-transformers or provide OpenAI API key.")
|
| 337 |
raise Exception("No embedding model available")
|
| 338 |
|
| 339 |
self.text_splitter = SemanticChunker(
|
|
|
|
| 350 |
self.persist_directory = os.getenv("CHROMA_PERSIST_DIRECTORY", "./chroma_db")
|
| 351 |
os.makedirs(self.persist_directory, exist_ok=True)
|
| 352 |
|
| 353 |
+
def extract_text_from_pdf(self, pdf_file_path) -> List[str]:
|
| 354 |
+
"""Extract text from PDF file path with multiple fallback methods."""
|
| 355 |
try:
|
| 356 |
# Method 1: Try with PyPDF2 (handles most PDFs including encrypted ones with PyCryptodome)
|
| 357 |
+
with open(pdf_file_path, 'rb') as pdf_file:
|
| 358 |
+
pdf_reader = PyPDF2.PdfReader(pdf_file)
|
| 359 |
+
|
| 360 |
+
# Check if PDF is encrypted
|
| 361 |
+
if pdf_reader.is_encrypted:
|
| 362 |
+
# Try to decrypt with empty password (common for protected but not password-protected PDFs)
|
| 363 |
+
try:
|
| 364 |
+
pdf_reader.decrypt("")
|
| 365 |
+
except Exception:
|
| 366 |
+
print(f"PDF {os.path.basename(pdf_file_path)} is password-protected. Please provide an unprotected version.")
|
| 367 |
+
return [] # Return empty list for password-protected PDFs
|
| 368 |
+
|
| 369 |
+
text_per_page = []
|
| 370 |
+
for page_num, page in enumerate(pdf_reader.pages):
|
| 371 |
+
try:
|
| 372 |
+
page_text = page.extract_text()
|
| 373 |
+
text_per_page.append(page_text)
|
| 374 |
+
except Exception as e:
|
| 375 |
+
print(f"Could not extract text from page {page_num + 1} of {os.path.basename(pdf_file_path)}: {str(e)}")
|
| 376 |
+
text_per_page.append("") # Append empty string for failed pages
|
| 377 |
+
|
| 378 |
+
if any(text.strip() for text in text_per_page):
|
| 379 |
+
return text_per_page
|
| 380 |
+
else:
|
| 381 |
+
print(f"No extractable text found in {os.path.basename(pdf_file_path)}. This might be a scanned PDF or image-based document.")
|
| 382 |
+
return []
|
| 383 |
|
| 384 |
except Exception as e:
|
| 385 |
error_msg = str(e)
|
| 386 |
if "PyCryptodome" in error_msg:
|
| 387 |
+
print(f"Encryption error with {os.path.basename(pdf_file_path)}: {error_msg}")
|
| 388 |
+
print("💡 The PDF uses encryption. PyCryptodome has been installed to handle this.")
|
| 389 |
elif "password" in error_msg.lower():
|
| 390 |
+
print(f"Password-protected PDF: {os.path.basename(pdf_file_path)}")
|
| 391 |
+
print("💡 Please provide an unprotected version of this PDF.")
|
| 392 |
else:
|
| 393 |
+
print(f"Error extracting text from {os.path.basename(pdf_file_path)}: {error_msg}")
|
| 394 |
return []
|
| 395 |
|
| 396 |
+
def process_documents(self, pdf_file_paths) -> List[Document]:
|
| 397 |
+
"""Process PDF file paths and convert to documents with automatic metadata extraction."""
|
| 398 |
documents = []
|
| 399 |
processed_count = 0
|
| 400 |
failed_count = 0
|
| 401 |
|
| 402 |
+
print(f"📄 Processing {len(pdf_file_paths)} document(s) with automatic metadata detection...") # Changed to print
|
| 403 |
|
| 404 |
+
for pdf_file_path in pdf_file_paths:
|
| 405 |
+
if pdf_file_path.endswith('.pdf'):
|
| 406 |
+
filename = os.path.basename(pdf_file_path)
|
| 407 |
+
print(f"🔍 Extracting text from: **{filename}**") # Changed to print
|
| 408 |
|
| 409 |
# Extract text per page
|
| 410 |
+
text_per_page = self.extract_text_from_pdf(pdf_file_path)
|
| 411 |
+
print(f"DEBUG: Extracted {len(text_per_page)} pages from {filename}")
|
| 412 |
|
| 413 |
if text_per_page:
|
| 414 |
# Combine first two pages for metadata extraction
|
| 415 |
text_for_metadata = "\n".join(text_per_page[:2])
|
| 416 |
+
print(f"DEBUG: Text for metadata extraction (first 500 chars): {text_for_metadata[:500]}")
|
| 417 |
# Extract metadata using LLM
|
| 418 |
+
print(f"🤖 Detecting metadata for: **{filename}**") # Changed to print
|
| 419 |
extracted_metadata = self.sea_lion_llm.extract_metadata(text_for_metadata)
|
| 420 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
# Create metadata
|
| 422 |
metadata = {
|
| 423 |
+
"source": filename,
|
| 424 |
"university": extracted_metadata.get("university_name", "Unknown"),
|
| 425 |
"country": extracted_metadata.get("country", "Unknown"),
|
| 426 |
"document_type": extracted_metadata.get("document_type", "general_info"),
|
|
|
|
| 436 |
)
|
| 437 |
documents.append(doc)
|
| 438 |
processed_count += 1
|
| 439 |
+
print(f"✅ Successfully processed: **{filename}** ({len(doc.page_content)} characters)") # Changed to print
|
| 440 |
else:
|
| 441 |
failed_count += 1
|
| 442 |
+
print(f"⚠️ Could not extract text from **{filename}**") # Changed to print
|
| 443 |
else:
|
| 444 |
failed_count += 1
|
| 445 |
+
filename = os.path.basename(pdf_file_path)
|
| 446 |
+
print(f"❌ Unsupported file type for {filename} (expected .pdf)") # Changed to print
|
| 447 |
|
| 448 |
# Summary
|
| 449 |
if processed_count > 0:
|
| 450 |
+
print(f"🎉 Successfully processed **{processed_count}** document(s)") # Changed to print
|
| 451 |
if failed_count > 0:
|
| 452 |
+
print(f"⚠️ Failed to process **{failed_count}** document(s)") # Changed to print
|
| 453 |
|
| 454 |
return documents
|
| 455 |
|
| 456 |
def create_vector_store(self, documents: List[Document]) -> Chroma:
|
| 457 |
"""Create and persist vector store from documents."""
|
| 458 |
if not documents:
|
| 459 |
+
print("No documents to process") # Changed to print
|
| 460 |
return None
|
| 461 |
|
| 462 |
# Split documents into chunks
|
|
|
|
| 480 |
)
|
| 481 |
return vectorstore
|
| 482 |
except Exception as e:
|
| 483 |
+
print(f"Could not load existing vector store: {str(e)}") # Changed to print
|
| 484 |
return None
|
| 485 |
|
| 486 |
class RAGSystem:
|
|
|
|
| 507 |
)
|
| 508 |
return vectorstore
|
| 509 |
except Exception as e:
|
| 510 |
+
print(f"Error loading vector store: {str(e)}")
|
| 511 |
return None
|
| 512 |
|
| 513 |
def query(self, question: str, language: str = "English") -> Dict[str, Any]:
|
|
|
|
| 559 |
}
|
| 560 |
|
| 561 |
except Exception as e:
|
| 562 |
+
print(f"Error querying system: {str(e)}")
|
| 563 |
return {
|
| 564 |
"answer": f"Error processing your question: {str(e)}",
|
| 565 |
"source_documents": [],
|
|
|
|
| 597 |
json.dump(save_data, f, indent=2, ensure_ascii=False)
|
| 598 |
return True
|
| 599 |
except Exception as e:
|
| 600 |
+
print(f"Error saving query result: {str(e)}")
|
| 601 |
return False
|
| 602 |
return False
|
| 603 |
|
|
|
|
| 610 |
with open(result_file, 'r', encoding='utf-8') as f:
|
| 611 |
return json.load(f)
|
| 612 |
except Exception as e:
|
| 613 |
+
print(f"Error loading shared query: {str(e)}")
|
| 614 |
|
| 615 |
return None
|
utils/translations.py
CHANGED
|
@@ -110,6 +110,40 @@ translations = {
|
|
| 110 |
"example_simple_2": "What is the difference between bachelor and master degree?",
|
| 111 |
"example_simple_3": "How to apply for student visa?",
|
| 112 |
"example_simple_4": "What documents are needed for university application?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
},
|
| 114 |
|
| 115 |
"中文": {
|
|
@@ -223,6 +257,40 @@ translations = {
|
|
| 223 |
"example_simple_2": "学士学位和硕士学位有什么区别?",
|
| 224 |
"example_simple_3": "如何申请学生签证?",
|
| 225 |
"example_simple_4": "大学申请需要哪些文件?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
},
|
| 227 |
|
| 228 |
"Malay": {
|
|
|
|
| 110 |
"example_simple_2": "What is the difference between bachelor and master degree?",
|
| 111 |
"example_simple_3": "How to apply for student visa?",
|
| 112 |
"example_simple_4": "What documents are needed for university application?",
|
| 113 |
+
|
| 114 |
+
# System messages
|
| 115 |
+
"systems_initialized": "✅ Systems initialized successfully!",
|
| 116 |
+
"can_upload_documents": "You can now upload documents.",
|
| 117 |
+
"initialization_error": "Error initializing systems",
|
| 118 |
+
"installation_help": """**Possible solutions:**
|
| 119 |
+
1. Install sentence-transformers: `pip install sentence-transformers`
|
| 120 |
+
2. Or provide OpenAI API key in environment variables
|
| 121 |
+
3. Check that PyTorch is properly installed
|
| 122 |
+
|
| 123 |
+
**For deployment:**
|
| 124 |
+
- Ensure requirements.txt includes: sentence-transformers, torch, transformers""",
|
| 125 |
+
"please_initialize_first": "Please initialize systems first using the 'Initialize System' tab!",
|
| 126 |
+
"please_upload_pdf": "Please upload at least one PDF file.",
|
| 127 |
+
"upload_pdf_only": "Please upload PDF files only.",
|
| 128 |
+
"successfully_processed_docs": "Successfully processed",
|
| 129 |
+
"failed_create_vectorstore": "Failed to create vector store from documents.",
|
| 130 |
+
"no_docs_successfully_processed": "No documents were successfully processed. Please check if your PDFs are readable.",
|
| 131 |
+
"error_processing_docs": "Error processing documents",
|
| 132 |
+
"check_console": "Please check the console for more details.",
|
| 133 |
+
"please_upload_process_first": "Please upload and process documents first using the 'Upload Documents' tab!",
|
| 134 |
+
"please_enter_question": "Please enter a question.",
|
| 135 |
+
"processing_query": "Processing query",
|
| 136 |
+
"model_used": "Model Used",
|
| 137 |
+
"answer": "Answer",
|
| 138 |
+
"sources": "Sources",
|
| 139 |
+
"no_sources_found": "No specific sources found. This might be a general response.",
|
| 140 |
+
"error_querying_docs": "Error querying documents",
|
| 141 |
+
"ready_for_queries": "Ready for queries! Go to the 'Search & Query' tab to start asking questions.",
|
| 142 |
+
|
| 143 |
+
# Interface elements
|
| 144 |
+
"initialize_system": "Initialize System",
|
| 145 |
+
"initialize_systems": "Initialize Systems",
|
| 146 |
+
"initialization_status": "Initialization Status",
|
| 147 |
},
|
| 148 |
|
| 149 |
"中文": {
|
|
|
|
| 257 |
"example_simple_2": "学士学位和硕士学位有什么区别?",
|
| 258 |
"example_simple_3": "如何申请学生签证?",
|
| 259 |
"example_simple_4": "大学申请需要哪些文件?",
|
| 260 |
+
|
| 261 |
+
# System messages
|
| 262 |
+
"systems_initialized": "✅ 系统初始化成功!",
|
| 263 |
+
"can_upload_documents": "您现在可以上传文档。",
|
| 264 |
+
"initialization_error": "系统初始化错误",
|
| 265 |
+
"installation_help": """**可能的解决方案:**
|
| 266 |
+
1. 安装 sentence-transformers: `pip install sentence-transformers`
|
| 267 |
+
2. 或在环境变量中提供 OpenAI API 密钥
|
| 268 |
+
3. 检查 PyTorch 是否正确安装
|
| 269 |
+
|
| 270 |
+
**部署时:**
|
| 271 |
+
- 确保 requirements.txt 包含:sentence-transformers, torch, transformers""",
|
| 272 |
+
"please_initialize_first": "请先使用'初始化系统'选项卡初始化系统!",
|
| 273 |
+
"please_upload_pdf": "请至少上传一个PDF文件。",
|
| 274 |
+
"upload_pdf_only": "请仅上传PDF文件。",
|
| 275 |
+
"successfully_processed_docs": "成功处理",
|
| 276 |
+
"failed_create_vectorstore": "创建向量存储失败。",
|
| 277 |
+
"no_docs_successfully_processed": "没有成功处理任何文档。请检查您的PDF是否可读。",
|
| 278 |
+
"error_processing_docs": "处理文档时出错",
|
| 279 |
+
"check_console": "请查看控制台获取更多详细信息。",
|
| 280 |
+
"please_upload_process_first": "请先使用'上传文档'选项卡上传和处理文档!",
|
| 281 |
+
"please_enter_question": "请输入问题。",
|
| 282 |
+
"processing_query": "正在处理查询",
|
| 283 |
+
"model_used": "使用的模型",
|
| 284 |
+
"answer": "答案",
|
| 285 |
+
"sources": "来源",
|
| 286 |
+
"no_sources_found": "未找到特定来源。这可能是一般性回答。",
|
| 287 |
+
"error_querying_docs": "查询文档时出错",
|
| 288 |
+
"ready_for_queries": "准备查询!前往'搜索与查询'选项卡开始提问。",
|
| 289 |
+
|
| 290 |
+
# Interface elements
|
| 291 |
+
"initialize_system": "初始化系统",
|
| 292 |
+
"initialize_systems": "初始化系统",
|
| 293 |
+
"initialization_status": "初始化状态",
|
| 294 |
},
|
| 295 |
|
| 296 |
"Malay": {
|