Spaces:
Runtime error
Runtime error
Commit
·
d383abe
0
Parent(s):
Final Project
Browse files- DEPLOYMENT.md +74 -0
- README.md +340 -0
- app.py +465 -0
- rag_notebook.ipynb +0 -0
- requirements.txt +13 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🚀 Hugging Face Spaces Deployment Guide
|
| 2 |
+
|
| 3 |
+
## Quick Deployment Steps
|
| 4 |
+
|
| 5 |
+
### 1. Create a New Space
|
| 6 |
+
|
| 7 |
+
- Go to [Hugging Face Spaces](https://huggingface.co/spaces)
|
| 8 |
+
- Click "Create new Space"
|
| 9 |
+
- Choose "Streamlit" as the SDK
|
| 10 |
+
- Set visibility (Public/Private)
|
| 11 |
+
|
| 12 |
+
### 2. Upload Files
|
| 13 |
+
|
| 14 |
+
Upload these files to your Space:
|
| 15 |
+
|
| 16 |
+
- `app.py` (main Streamlit application)
|
| 17 |
+
- `requirements.txt` (dependencies)
|
| 18 |
+
- `README.md` (documentation)
|
| 19 |
+
|
| 20 |
+
### 3. Set Environment Variables
|
| 21 |
+
|
| 22 |
+
- Go to Settings → Secrets
|
| 23 |
+
- Add `GOOGLE_API_KEY` with your Gemini API key
|
| 24 |
+
- The app will automatically use this environment variable
|
| 25 |
+
|
| 26 |
+
### 4. Deploy
|
| 27 |
+
|
| 28 |
+
- Push your code to the Space
|
| 29 |
+
- The app will automatically build and deploy
|
| 30 |
+
- Wait for the build to complete (usually 2-3 minutes)
|
| 31 |
+
|
| 32 |
+
### 5. Test Your App
|
| 33 |
+
|
| 34 |
+
- Open your Space URL
|
| 35 |
+
- Enter your Gemini API key in the sidebar
|
| 36 |
+
- Click "Initialize RAG System"
|
| 37 |
+
- Start chatting!
|
| 38 |
+
|
| 39 |
+
## Important Notes
|
| 40 |
+
|
| 41 |
+
- **API Key**: Make sure to set `GOOGLE_API_KEY` in Space secrets
|
| 42 |
+
- **Memory**: The app will create a Chroma database in memory
|
| 43 |
+
- **Performance**: First initialization may take a few minutes
|
| 44 |
+
- **Limits**: Hugging Face Spaces have resource limits
|
| 45 |
+
|
| 46 |
+
## Troubleshooting
|
| 47 |
+
|
| 48 |
+
### Build Fails
|
| 49 |
+
|
| 50 |
+
- Check `requirements.txt` for correct package versions
|
| 51 |
+
- Ensure all imports are available
|
| 52 |
+
|
| 53 |
+
### Runtime Errors
|
| 54 |
+
|
| 55 |
+
- Verify API key is set correctly
|
| 56 |
+
- Check logs in the Space interface
|
| 57 |
+
- Ensure all dependencies are installed
|
| 58 |
+
|
| 59 |
+
### Performance Issues
|
| 60 |
+
|
| 61 |
+
- Reduce the number of documents processed
|
| 62 |
+
- Use smaller embedding models
|
| 63 |
+
- Optimize the RAG pipeline
|
| 64 |
+
|
| 65 |
+
## Customization
|
| 66 |
+
|
| 67 |
+
You can customize the app by:
|
| 68 |
+
|
| 69 |
+
- Modifying the UI in `app.py`
|
| 70 |
+
- Changing the embedding model
|
| 71 |
+
- Adjusting the RAG pipeline parameters
|
| 72 |
+
- Adding new features
|
| 73 |
+
|
| 74 |
+
Happy deploying! 🎉
|
README.md
ADDED
|
@@ -0,0 +1,340 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🤖 InsightRAG Chatbot: ML/AI Knowledge Assistant
|
| 2 |
+
|
| 3 |
+
[](https://colab.research.google.com/drive/1u4hwe39XZZlQbtdQDecR4MStnQlgBj2h?usp=sharing)
|
| 4 |
+
|
| 5 |
+
A fully functional Retrieval-Augmented Generation (RAG) chatbot that provides comprehensive information about machine learning, deep learning, AI, and related topics. Built with modern AI technologies and ready for deployment.
|
| 6 |
+
|
| 7 |
+
## 🎯 Project Purpose
|
| 8 |
+
|
| 9 |
+
This RAG chatbot serves as an intelligent knowledge assistant specializing in machine learning, deep learning, and artificial intelligence topics. The chatbot leverages a sophisticated retrieval-augmented generation pipeline to provide accurate, contextual answers by combining:
|
| 10 |
+
|
| 11 |
+
- **Knowledge Retrieval**: Accessing relevant information from a curated ML/AI knowledge base
|
| 12 |
+
- **Contextual Generation**: Using Google Gemini 2.5 Flash to generate comprehensive responses
|
| 13 |
+
- **Interactive Learning**: Enabling users to explore complex AI concepts through natural conversation
|
| 14 |
+
|
| 15 |
+
The primary goal is to make AI and machine learning knowledge accessible through an intuitive, conversational interface that can handle both basic concepts and advanced technical questions.
|
| 16 |
+
|
| 17 |
+
## 📚 Dataset Information
|
| 18 |
+
|
| 19 |
+
### Dataset Source
|
| 20 |
+
|
| 21 |
+
- **Primary Dataset**: The Pile (EleutherAI/the_pile) from Hugging Face
|
| 22 |
+
- **Access Method**: Hugging Face Datasets API (no local downloads required)
|
| 23 |
+
- **Content Type**: Text-only data (no tables, images, or PDFs)
|
| 24 |
+
|
| 25 |
+
### Dataset Structure
|
| 26 |
+
|
| 27 |
+
The dataset contains diverse text content filtered specifically for ML/AI relevance:
|
| 28 |
+
|
| 29 |
+
- **Content Filtering**: Text samples are filtered using ML/AI keywords including:
|
| 30 |
+
|
| 31 |
+
- Machine learning, deep learning, neural networks
|
| 32 |
+
- Artificial intelligence, algorithms, models
|
| 33 |
+
- Training, data, features, classification
|
| 34 |
+
- Regression, clustering, optimization, gradient, tensor
|
| 35 |
+
|
| 36 |
+
- **Text Processing**:
|
| 37 |
+
|
| 38 |
+
- Content is cleaned and preprocessed
|
| 39 |
+
- Text is chunked into manageable pieces (500 words with 50-word overlap)
|
| 40 |
+
- Only substantial chunks (100-2000 characters) are retained
|
| 41 |
+
- Text is embedded using sentence transformers for vector search
|
| 42 |
+
|
| 43 |
+
- **Storage**: Processed text chunks are stored in Chroma vector database for efficient similarity search
|
| 44 |
+
|
| 45 |
+
### Usage in RAG Pipeline
|
| 46 |
+
|
| 47 |
+
The dataset serves as the knowledge base for the RAG system, enabling:
|
| 48 |
+
|
| 49 |
+
- Semantic search for relevant context
|
| 50 |
+
- Contextual answer generation
|
| 51 |
+
- Comprehensive coverage of ML/AI topics
|
| 52 |
+
|
| 53 |
+
## 🔧 Methods Used
|
| 54 |
+
|
| 55 |
+
### RAG Pipeline Architecture
|
| 56 |
+
|
| 57 |
+
The chatbot implements a sophisticated Retrieval-Augmented Generation pipeline:
|
| 58 |
+
|
| 59 |
+
#### 1. **Data Processing Pipeline**
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
Raw Text → Filtering → Chunking → Embedding → Vector Storage
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
- **Text Filtering**: ML/AI keyword-based content selection
|
| 66 |
+
- **Chunking**: Intelligent text segmentation with overlap
|
| 67 |
+
- **Embedding**: Sentence transformer-based vectorization
|
| 68 |
+
- **Storage**: Chroma vector database for efficient retrieval
|
| 69 |
+
|
| 70 |
+
#### 2. **Retrieval System**
|
| 71 |
+
|
| 72 |
+
- **Embedding Model**: `all-MiniLM-L6-v2` (sentence-transformers)
|
| 73 |
+
- **Vector Database**: Chroma with persistent storage
|
| 74 |
+
- **Similarity Search**: Cosine similarity for document retrieval
|
| 75 |
+
- **Context Assembly**: Top-k relevant documents combined
|
| 76 |
+
|
| 77 |
+
#### 3. **Generation System**
|
| 78 |
+
|
| 79 |
+
- **Language Model**: Google Gemini 2.5 Flash
|
| 80 |
+
- **Temperature**: 0.7 for balanced creativity and accuracy
|
| 81 |
+
- **Context Integration**: Retrieved documents used as context
|
| 82 |
+
- **Response Formatting**: Markdown support for rich text
|
| 83 |
+
|
| 84 |
+
#### 4. **Technical Stack**
|
| 85 |
+
|
| 86 |
+
- **RAG Framework**: LangChain for pipeline orchestration
|
| 87 |
+
- **Vector Database**: Chroma for embedding storage and retrieval
|
| 88 |
+
- **Embeddings**: Sentence Transformers for text vectorization
|
| 89 |
+
- **LLM**: Google Gemini 2.5 Flash for response generation
|
| 90 |
+
- **Interface**: Streamlit for web-based chat interface
|
| 91 |
+
|
| 92 |
+
## 📊 Results Summary
|
| 93 |
+
|
| 94 |
+
The RAG chatbot successfully provides comprehensive answers across multiple ML/AI domains:
|
| 95 |
+
|
| 96 |
+
### **Answer Quality**
|
| 97 |
+
|
| 98 |
+
- **Contextual Accuracy**: Responses are grounded in retrieved knowledge
|
| 99 |
+
- **Comprehensive Coverage**: Handles both basic and advanced topics
|
| 100 |
+
- **Structured Output**: Well-formatted responses with examples
|
| 101 |
+
- **Technical Depth**: Can explain complex algorithms and concepts
|
| 102 |
+
|
| 103 |
+
### **Performance Metrics**
|
| 104 |
+
|
| 105 |
+
- **Response Time**: Fast retrieval and generation (< 5 seconds)
|
| 106 |
+
- **Relevance**: High-quality context retrieval from knowledge base
|
| 107 |
+
- **Coverage**: Extensive ML/AI topic coverage
|
| 108 |
+
- **Usability**: Intuitive conversational interface
|
| 109 |
+
|
| 110 |
+
### **Capabilities Demonstrated**
|
| 111 |
+
|
| 112 |
+
- Explains fundamental ML/AI concepts
|
| 113 |
+
- Provides algorithm explanations with examples
|
| 114 |
+
- Offers practical implementation guidance
|
| 115 |
+
- Covers current trends and advanced topics
|
| 116 |
+
- Handles both theoretical and applied questions
|
| 117 |
+
|
| 118 |
+
## 💡 Example Questions
|
| 119 |
+
|
| 120 |
+
The chatbot can answer a comprehensive range of questions across multiple categories:
|
| 121 |
+
|
| 122 |
+
### **Basic Concepts**
|
| 123 |
+
|
| 124 |
+
- What is the difference between AI, machine learning, and deep learning?
|
| 125 |
+
- Can you explain supervised, unsupervised, and reinforcement learning?
|
| 126 |
+
- What are features and labels in a dataset?
|
| 127 |
+
- Explain overfitting vs underfitting.
|
| 128 |
+
|
| 129 |
+
### **Algorithms & Models**
|
| 130 |
+
|
| 131 |
+
- How does a neural network learn?
|
| 132 |
+
- What is gradient descent and how does it work?
|
| 133 |
+
- Explain decision trees and random forests.
|
| 134 |
+
- What are convolutional neural networks (CNNs) used for?
|
| 135 |
+
- How does a transformer model like GPT work?
|
| 136 |
+
|
| 137 |
+
### **Practical Applications**
|
| 138 |
+
|
| 139 |
+
- How to preprocess data for machine learning?
|
| 140 |
+
- How can I use AI for image recognition?
|
| 141 |
+
- Give an example of AI in healthcare.
|
| 142 |
+
- What are common pitfalls when training deep learning models?
|
| 143 |
+
|
| 144 |
+
### **Technical Details**
|
| 145 |
+
|
| 146 |
+
- What is backpropagation?
|
| 147 |
+
- How does regularization prevent overfitting?
|
| 148 |
+
- Explain embedding vectors and similarity search.
|
| 149 |
+
- What are activation functions and why are they important?
|
| 150 |
+
|
| 151 |
+
### **Performance & Optimization**
|
| 152 |
+
|
| 153 |
+
- How to improve model accuracy?
|
| 154 |
+
- What is cross-validation and why is it used?
|
| 155 |
+
- Explain hyperparameter tuning.
|
| 156 |
+
- What is transfer learning?
|
| 157 |
+
|
| 158 |
+
### **Trends & Advanced**
|
| 159 |
+
|
| 160 |
+
- Explain reinforcement learning with examples.
|
| 161 |
+
- What are large language models and how do they work?
|
| 162 |
+
- How is generative AI different from predictive AI?
|
| 163 |
+
- What is the future of AI in finance/medicine?
|
| 164 |
+
|
| 165 |
+
## 🚀 Quick Start
|
| 166 |
+
|
| 167 |
+
### Option 1: Google Colab (Recommended)
|
| 168 |
+
|
| 169 |
+
[](https://colab.research.google.com/github/your-username/your-repo/blob/main/rag_notebook.ipynb)
|
| 170 |
+
|
| 171 |
+
1. **Open the notebook**: Click the Colab badge above or upload `rag_notebook.ipynb` to Google Colab
|
| 172 |
+
2. **Set up API key**: Add your Gemini API key to Colab secrets
|
| 173 |
+
3. **Run all cells**: Execute the notebook to build the RAG system
|
| 174 |
+
4. **Test the system**: Try the sample questions provided
|
| 175 |
+
|
| 176 |
+
### Option 2: Local Development
|
| 177 |
+
|
| 178 |
+
1. **Create virtual environment**:
|
| 179 |
+
|
| 180 |
+
```bash
|
| 181 |
+
python -m venv rag_chatbot_env
|
| 182 |
+
source rag_chatbot_env/bin/activate # On Windows: rag_chatbot_env\Scripts\activate
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
2. **Install dependencies**:
|
| 186 |
+
|
| 187 |
+
```bash
|
| 188 |
+
pip install -r requirements.txt
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
3. **Set up environment**:
|
| 192 |
+
|
| 193 |
+
```bash
|
| 194 |
+
export GOOGLE_API_KEY="your_gemini_api_key_here"
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
4. **Run the Streamlit app**:
|
| 198 |
+
|
| 199 |
+
```bash
|
| 200 |
+
streamlit run app.py
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
5. **Access the interface**: Open `http://localhost:8501` in your browser
|
| 204 |
+
|
| 205 |
+
## 🔑 API Key Setup
|
| 206 |
+
|
| 207 |
+
### Google Colab
|
| 208 |
+
|
| 209 |
+
1. Go to the key icon (🔑) in the left sidebar
|
| 210 |
+
2. Add a new secret with key `GEMINI_API_KEY` and your API key as value
|
| 211 |
+
3. Restart the runtime and run the notebook
|
| 212 |
+
|
| 213 |
+
### Local/Hugging Face Spaces
|
| 214 |
+
|
| 215 |
+
1. Get your API key from [Google AI Studio](https://makersuite.google.com/app/apikey)
|
| 216 |
+
2. Set it as an environment variable: `GOOGLE_API_KEY`
|
| 217 |
+
3. Or enter it directly in the Streamlit interface
|
| 218 |
+
|
| 219 |
+
## 🏗️ Solution Architecture
|
| 220 |
+
|
| 221 |
+
### Problem Statement
|
| 222 |
+
|
| 223 |
+
Traditional chatbots often provide generic responses without access to specific domain knowledge. This project solves the challenge of creating an AI assistant that can provide accurate, contextual information about machine learning and AI topics.
|
| 224 |
+
|
| 225 |
+
### Technology Stack
|
| 226 |
+
|
| 227 |
+
- **Frontend**: Streamlit for web interface
|
| 228 |
+
- **Backend**: Python with LangChain framework
|
| 229 |
+
- **Vector Database**: Chroma for embedding storage
|
| 230 |
+
- **Embeddings**: Sentence Transformers
|
| 231 |
+
- **LLM**: Google Gemini 2.5 Flash
|
| 232 |
+
- **Data Source**: The Pile dataset via Hugging Face
|
| 233 |
+
|
| 234 |
+
### Architecture Benefits
|
| 235 |
+
|
| 236 |
+
- **Scalable**: Can handle multiple users simultaneously
|
| 237 |
+
- **Accurate**: Grounded responses using retrieved context
|
| 238 |
+
- **Flexible**: Easy to extend with additional knowledge sources
|
| 239 |
+
- **Efficient**: Fast retrieval and generation pipeline
|
| 240 |
+
|
| 241 |
+
## 🌐 Web Interface & Deployment
|
| 242 |
+
|
| 243 |
+
### Local Testing
|
| 244 |
+
|
| 245 |
+
1. Run `streamlit run app.py`
|
| 246 |
+
2. Open `http://localhost:8501`
|
| 247 |
+
3. Enter your Gemini API key
|
| 248 |
+
4. Initialize the RAG system
|
| 249 |
+
5. Start chatting!
|
| 250 |
+
|
| 251 |
+
### Web Deployment
|
| 252 |
+
|
| 253 |
+
[Deploy to Hugging Face Spaces](https://huggingface.co/spaces) - _Add your deployment link here_
|
| 254 |
+
|
| 255 |
+
### Interface Features
|
| 256 |
+
|
| 257 |
+
- **Chat Interface**: Clean, responsive design
|
| 258 |
+
- **Real-time Responses**: Instant AI-generated answers
|
| 259 |
+
- **Context Display**: Shows retrieved documents and similarity scores
|
| 260 |
+
- **Sample Questions**: Quick-start buttons for common queries
|
| 261 |
+
- **System Status**: Real-time monitoring of RAG system health
|
| 262 |
+
|
| 263 |
+
## 📁 Project Structure
|
| 264 |
+
|
| 265 |
+
```
|
| 266 |
+
Chatbot_Project/
|
| 267 |
+
├── rag_notebook.ipynb # Complete Colab notebook with RAG pipeline
|
| 268 |
+
├── app.py # Streamlit web application
|
| 269 |
+
├── requirements.txt # Python dependencies
|
| 270 |
+
├── README.md # This documentation
|
| 271 |
+
└── chroma_db/ # Vector database (created during execution)
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
## 🔧 Configuration Options
|
| 275 |
+
|
| 276 |
+
The system can be customized through various parameters:
|
| 277 |
+
|
| 278 |
+
```python
|
| 279 |
+
# RAG Pipeline Configuration
|
| 280 |
+
EMBEDDING_MODEL = 'all-MiniLM-L6-v2'
|
| 281 |
+
GEMINI_MODEL = 'gemini-2.0-flash-exp'
|
| 282 |
+
TEMPERATURE = 0.7
|
| 283 |
+
MAX_OUTPUT_TOKENS = 1024
|
| 284 |
+
N_RETRIEVAL_RESULTS = 5
|
| 285 |
+
CHUNK_SIZE = 500
|
| 286 |
+
CHUNK_OVERLAP = 50
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
## 🐛 Troubleshooting
|
| 290 |
+
|
| 291 |
+
### Common Issues
|
| 292 |
+
|
| 293 |
+
1. **API Key Error**: Ensure your Gemini API key is correctly set
|
| 294 |
+
2. **Memory Issues**: Reduce the number of documents processed in Colab
|
| 295 |
+
3. **Chroma Connection**: Check if the vector database directory exists
|
| 296 |
+
4. **Model Loading**: Ensure all dependencies are installed correctly
|
| 297 |
+
|
| 298 |
+
### Solutions
|
| 299 |
+
|
| 300 |
+
- **Restart Runtime**: In Colab, use Runtime → Restart Runtime
|
| 301 |
+
- **Check Logs**: Look for error messages in the console
|
| 302 |
+
- **Verify Dependencies**: Run `pip list` to check installed packages
|
| 303 |
+
- **Test Components**: Use the test functions in the notebook
|
| 304 |
+
|
| 305 |
+
## 🤝 Contributing
|
| 306 |
+
|
| 307 |
+
Contributions are welcome! Please feel free to:
|
| 308 |
+
|
| 309 |
+
1. Fork the repository
|
| 310 |
+
2. Create a feature branch
|
| 311 |
+
3. Make your changes
|
| 312 |
+
4. Submit a pull request
|
| 313 |
+
|
| 314 |
+
## 📄 License
|
| 315 |
+
|
| 316 |
+
This project is open source and available under the MIT License.
|
| 317 |
+
|
| 318 |
+
## 🙏 Acknowledgments
|
| 319 |
+
|
| 320 |
+
- **EleutherAI** for The Pile dataset
|
| 321 |
+
- **Google** for Gemini API
|
| 322 |
+
- **LangChain** for RAG framework
|
| 323 |
+
- **Chroma** for vector database
|
| 324 |
+
- **Streamlit** for web interface
|
| 325 |
+
- **Hugging Face** for dataset access and deployment platform
|
| 326 |
+
|
| 327 |
+
## 📞 Support
|
| 328 |
+
|
| 329 |
+
If you encounter any issues or have questions:
|
| 330 |
+
|
| 331 |
+
1. Check the troubleshooting section above
|
| 332 |
+
2. Review the notebook comments and documentation
|
| 333 |
+
3. Open an issue in the repository
|
| 334 |
+
4. Contact the development team
|
| 335 |
+
|
| 336 |
+
---
|
| 337 |
+
|
| 338 |
+
**🚀 Ready to explore the world of AI with our RAG chatbot!**
|
| 339 |
+
|
| 340 |
+
_Built with ❤️ using modern AI technologies_
|
app.py
ADDED
|
@@ -0,0 +1,465 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import os
|
| 3 |
+
import json
|
| 4 |
+
import chromadb
|
| 5 |
+
from chromadb.config import Settings
|
| 6 |
+
from sentence_transformers import SentenceTransformer
|
| 7 |
+
from langchain_google_genai import ChatGoogleGenerativeAI
|
| 8 |
+
from langchain.schema import HumanMessage, SystemMessage
|
| 9 |
+
import time
|
| 10 |
+
from datetime import datetime
|
| 11 |
+
import uuid
|
| 12 |
+
import pandas as pd
|
| 13 |
+
import numpy as np
|
| 14 |
+
from datasets import load_dataset
|
| 15 |
+
from tqdm import tqdm
|
| 16 |
+
import re
|
| 17 |
+
|
| 18 |
+
# Page configuration
|
| 19 |
+
st.set_page_config(
|
| 20 |
+
page_title="🤖 RAG Chatbot: ML/AI Assistant",
|
| 21 |
+
page_icon="🤖",
|
| 22 |
+
layout="wide",
|
| 23 |
+
initial_sidebar_state="expanded"
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
# Custom CSS for better styling
|
| 27 |
+
st.markdown("""
|
| 28 |
+
<style>
|
| 29 |
+
.main-header {
|
| 30 |
+
text-align: center;
|
| 31 |
+
padding: 2rem 0;
|
| 32 |
+
background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
|
| 33 |
+
color: white;
|
| 34 |
+
border-radius: 10px;
|
| 35 |
+
margin-bottom: 2rem;
|
| 36 |
+
}
|
| 37 |
+
.chat-message {
|
| 38 |
+
padding: 1rem;
|
| 39 |
+
border-radius: 10px;
|
| 40 |
+
margin: 1rem 0;
|
| 41 |
+
border-left: 4px solid #667eea;
|
| 42 |
+
}
|
| 43 |
+
.user-message {
|
| 44 |
+
background-color: #f0f2f6;
|
| 45 |
+
border-left-color: #667eea;
|
| 46 |
+
}
|
| 47 |
+
.bot-message {
|
| 48 |
+
background-color: #e8f4fd;
|
| 49 |
+
border-left-color: #764ba2;
|
| 50 |
+
}
|
| 51 |
+
.sidebar-content {
|
| 52 |
+
padding: 1rem;
|
| 53 |
+
}
|
| 54 |
+
.metric-card {
|
| 55 |
+
background-color: #f8f9fa;
|
| 56 |
+
padding: 1rem;
|
| 57 |
+
border-radius: 8px;
|
| 58 |
+
border: 1px solid #e9ecef;
|
| 59 |
+
margin: 0.5rem 0;
|
| 60 |
+
}
|
| 61 |
+
</style>
|
| 62 |
+
""", unsafe_allow_html=True)
|
| 63 |
+
|
| 64 |
+
# Initialize session state
|
| 65 |
+
if 'messages' not in st.session_state:
|
| 66 |
+
st.session_state.messages = []
|
| 67 |
+
if 'rag_system' not in st.session_state:
|
| 68 |
+
st.session_state.rag_system = None
|
| 69 |
+
if 'initialized' not in st.session_state:
|
| 70 |
+
st.session_state.initialized = False
|
| 71 |
+
|
| 72 |
+
# RAG System Functions (from notebook)
|
| 73 |
+
def chunk_text(text, chunk_size=500, overlap=50):
|
| 74 |
+
"""Split text into overlapping chunks"""
|
| 75 |
+
words = text.split()
|
| 76 |
+
chunks = []
|
| 77 |
+
|
| 78 |
+
for i in range(0, len(words), chunk_size - overlap):
|
| 79 |
+
chunk = ' '.join(words[i:i + chunk_size])
|
| 80 |
+
if len(chunk.strip()) > 50: # Only keep substantial chunks
|
| 81 |
+
chunks.append(chunk)
|
| 82 |
+
|
| 83 |
+
return chunks
|
| 84 |
+
|
| 85 |
+
def load_and_process_dataset():
|
| 86 |
+
"""Load and process The Pile dataset"""
|
| 87 |
+
print("📚 Loading The Pile dataset...")
|
| 88 |
+
|
| 89 |
+
try:
|
| 90 |
+
# Load a specific subset that contains ML/AI content
|
| 91 |
+
dataset = load_dataset("EleutherAI/the_pile", split="train", streaming=True)
|
| 92 |
+
|
| 93 |
+
# Take first 1000 samples for demonstration
|
| 94 |
+
texts = []
|
| 95 |
+
ml_keywords = ['machine learning', 'deep learning', 'neural network', 'artificial intelligence',
|
| 96 |
+
'algorithm', 'model', 'training', 'data', 'feature', 'classification',
|
| 97 |
+
'regression', 'clustering', 'optimization', 'gradient', 'tensor']
|
| 98 |
+
|
| 99 |
+
print("🔍 Filtering ML/AI related content...")
|
| 100 |
+
count = 0
|
| 101 |
+
for sample in tqdm(dataset, desc="Processing samples"):
|
| 102 |
+
if count >= 1000: # Limit to 1000 samples for demo
|
| 103 |
+
break
|
| 104 |
+
|
| 105 |
+
text = sample['text']
|
| 106 |
+
# Check if text contains ML/AI keywords
|
| 107 |
+
if any(keyword in text.lower() for keyword in ml_keywords):
|
| 108 |
+
# Clean and preprocess text
|
| 109 |
+
text = re.sub(r'\s+', ' ', text) # Remove extra whitespace
|
| 110 |
+
text = text.strip()
|
| 111 |
+
|
| 112 |
+
# Only keep texts that are reasonable length (not too short or too long)
|
| 113 |
+
if 100 <= len(text) <= 2000:
|
| 114 |
+
texts.append(text)
|
| 115 |
+
count += 1
|
| 116 |
+
|
| 117 |
+
print(f"✅ Loaded {len(texts)} ML/AI related text samples")
|
| 118 |
+
return texts
|
| 119 |
+
|
| 120 |
+
except Exception as e:
|
| 121 |
+
print(f"❌ Error loading dataset: {e}")
|
| 122 |
+
print("🔄 Using fallback sample data...")
|
| 123 |
+
|
| 124 |
+
# Fallback sample data if The Pile is not accessible
|
| 125 |
+
texts = [
|
| 126 |
+
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data. Deep learning uses neural networks with multiple layers to process complex patterns in data.",
|
| 127 |
+
"Neural networks are computing systems inspired by biological neural networks. They consist of interconnected nodes that process information using a connectionist approach.",
|
| 128 |
+
"Supervised learning uses labeled training data to learn a mapping from inputs to outputs. Common algorithms include linear regression, decision trees, and support vector machines.",
|
| 129 |
+
"Unsupervised learning finds hidden patterns in data without labeled examples. Clustering algorithms like K-means group similar data points together.",
|
| 130 |
+
"Natural language processing combines computational linguistics with machine learning to help computers understand human language. It includes tasks like text classification and sentiment analysis.",
|
| 131 |
+
"Computer vision enables machines to interpret and understand visual information from the world. It uses deep learning models like convolutional neural networks.",
|
| 132 |
+
"Reinforcement learning is a type of machine learning where agents learn to make decisions by interacting with an environment and receiving rewards or penalties.",
|
| 133 |
+
"Feature engineering is the process of selecting and transforming raw data into features that can be used by machine learning algorithms. Good features can significantly improve model performance.",
|
| 134 |
+
"Cross-validation is a technique used to assess how well a machine learning model generalizes to new data. It involves splitting data into training and validation sets multiple times.",
|
| 135 |
+
"Overfitting occurs when a model learns the training data too well and performs poorly on new data. Regularization techniques help prevent overfitting."
|
| 136 |
+
]
|
| 137 |
+
print(f"✅ Using {len(texts)} sample texts")
|
| 138 |
+
return texts
|
| 139 |
+
|
| 140 |
+
def initialize_rag_system(api_key):
|
| 141 |
+
"""Initialize the RAG system with all components"""
|
| 142 |
+
try:
|
| 143 |
+
# Set API key
|
| 144 |
+
os.environ['GOOGLE_API_KEY'] = api_key
|
| 145 |
+
|
| 146 |
+
# Initialize embedding model
|
| 147 |
+
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
|
| 148 |
+
|
| 149 |
+
# Initialize Chroma
|
| 150 |
+
chroma_client = chromadb.Client(Settings(
|
| 151 |
+
persist_directory="./chroma_db",
|
| 152 |
+
anonymized_telemetry=False
|
| 153 |
+
))
|
| 154 |
+
|
| 155 |
+
collection_name = "ml_ai_knowledge"
|
| 156 |
+
try:
|
| 157 |
+
collection = chroma_client.get_collection(collection_name)
|
| 158 |
+
print(f"✅ Found existing collection: {collection_name}")
|
| 159 |
+
except:
|
| 160 |
+
collection = chroma_client.create_collection(
|
| 161 |
+
name=collection_name,
|
| 162 |
+
metadata={"description": "ML/AI knowledge base from The Pile dataset"}
|
| 163 |
+
)
|
| 164 |
+
print(f"✅ Created new collection: {collection_name}")
|
| 165 |
+
|
| 166 |
+
# Check if collection already has data
|
| 167 |
+
existing_count = collection.count()
|
| 168 |
+
print(f"📊 Current documents in collection: {existing_count}")
|
| 169 |
+
|
| 170 |
+
if existing_count == 0:
|
| 171 |
+
print("🔄 Adding new documents to collection...")
|
| 172 |
+
|
| 173 |
+
# Load and process dataset
|
| 174 |
+
texts = load_and_process_dataset()
|
| 175 |
+
|
| 176 |
+
all_chunks = []
|
| 177 |
+
chunk_ids = []
|
| 178 |
+
chunk_metadatas = []
|
| 179 |
+
|
| 180 |
+
for i, text in enumerate(tqdm(texts, desc="Processing texts")):
|
| 181 |
+
chunks = chunk_text(text)
|
| 182 |
+
|
| 183 |
+
for j, chunk in enumerate(chunks):
|
| 184 |
+
chunk_id = f"doc_{i}_chunk_{j}"
|
| 185 |
+
metadata = {
|
| 186 |
+
"source": f"the_pile_doc_{i}",
|
| 187 |
+
"chunk_index": j,
|
| 188 |
+
"total_chunks": len(chunks),
|
| 189 |
+
"text_length": len(chunk)
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
all_chunks.append(chunk)
|
| 193 |
+
chunk_ids.append(chunk_id)
|
| 194 |
+
chunk_metadatas.append(metadata)
|
| 195 |
+
|
| 196 |
+
print(f"📊 Created {len(all_chunks)} text chunks")
|
| 197 |
+
|
| 198 |
+
# Add documents to Chroma in batches to avoid memory issues
|
| 199 |
+
batch_size = 100
|
| 200 |
+
for i in tqdm(range(0, len(all_chunks), batch_size), desc="Adding to Chroma"):
|
| 201 |
+
batch_chunks = all_chunks[i:i + batch_size]
|
| 202 |
+
batch_ids = chunk_ids[i:i + batch_size]
|
| 203 |
+
batch_metadatas = chunk_metadatas[i:i + batch_size]
|
| 204 |
+
|
| 205 |
+
collection.add(
|
| 206 |
+
documents=batch_chunks,
|
| 207 |
+
ids=batch_ids,
|
| 208 |
+
metadatas=batch_metadatas
|
| 209 |
+
)
|
| 210 |
+
|
| 211 |
+
print("✅ All documents added to Chroma!")
|
| 212 |
+
else:
|
| 213 |
+
print("✅ Collection already contains data, skipping addition")
|
| 214 |
+
|
| 215 |
+
# Initialize Gemini
|
| 216 |
+
llm = ChatGoogleGenerativeAI(
|
| 217 |
+
model="gemini-2.0-flash-exp",
|
| 218 |
+
temperature=0.7,
|
| 219 |
+
max_output_tokens=1024,
|
| 220 |
+
convert_system_message_to_human=True
|
| 221 |
+
)
|
| 222 |
+
|
| 223 |
+
return {
|
| 224 |
+
'embedding_model': embedding_model,
|
| 225 |
+
'chroma_client': chroma_client,
|
| 226 |
+
'collection': collection,
|
| 227 |
+
'llm': llm
|
| 228 |
+
}
|
| 229 |
+
except Exception as e:
|
| 230 |
+
st.error(f"Error initializing RAG system: {e}")
|
| 231 |
+
return None
|
| 232 |
+
|
| 233 |
+
def retrieve_relevant_docs(query, collection, n_results=5):
|
| 234 |
+
"""Retrieve relevant documents from Chroma"""
|
| 235 |
+
try:
|
| 236 |
+
results = collection.query(
|
| 237 |
+
query_texts=[query],
|
| 238 |
+
n_results=n_results
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
# Extract documents and metadata
|
| 242 |
+
documents = results['documents'][0]
|
| 243 |
+
metadatas = results['metadatas'][0]
|
| 244 |
+
distances = results['distances'][0]
|
| 245 |
+
|
| 246 |
+
return documents, metadatas, distances
|
| 247 |
+
except Exception as e:
|
| 248 |
+
print(f"Error retrieving documents: {e}")
|
| 249 |
+
return [], [], []
|
| 250 |
+
|
| 251 |
+
def create_context(documents):
|
| 252 |
+
"""Create context string from retrieved documents"""
|
| 253 |
+
context = "\n\n".join(documents)
|
| 254 |
+
return context
|
| 255 |
+
|
| 256 |
+
def generate_answer(query, context, llm):
|
| 257 |
+
"""Generate answer using Gemini with retrieved context"""
|
| 258 |
+
system_prompt = """You are an AI assistant specialized in machine learning, deep learning, and artificial intelligence.
|
| 259 |
+
Use the provided context to answer questions accurately and comprehensively. If the context doesn't contain enough
|
| 260 |
+
information, you can supplement with your general knowledge, but always prioritize the provided context.
|
| 261 |
+
|
| 262 |
+
Provide clear, well-structured answers with examples when appropriate."""
|
| 263 |
+
|
| 264 |
+
user_prompt = f"""Context:
|
| 265 |
+
{context}
|
| 266 |
+
|
| 267 |
+
Question: {query}
|
| 268 |
+
|
| 269 |
+
Please provide a comprehensive answer based on the context above."""
|
| 270 |
+
|
| 271 |
+
try:
|
| 272 |
+
messages = [
|
| 273 |
+
SystemMessage(content=system_prompt),
|
| 274 |
+
HumanMessage(content=user_prompt)
|
| 275 |
+
]
|
| 276 |
+
|
| 277 |
+
response = llm.invoke(messages)
|
| 278 |
+
return response.content
|
| 279 |
+
except Exception as e:
|
| 280 |
+
return f"Error generating answer: {e}"
|
| 281 |
+
|
| 282 |
+
def rag_pipeline(query, rag_system, n_results=5):
|
| 283 |
+
"""Complete RAG pipeline"""
|
| 284 |
+
try:
|
| 285 |
+
collection = rag_system['collection']
|
| 286 |
+
llm = rag_system['llm']
|
| 287 |
+
|
| 288 |
+
# Retrieve relevant documents
|
| 289 |
+
documents, metadatas, distances = retrieve_relevant_docs(query, collection, n_results)
|
| 290 |
+
|
| 291 |
+
if not documents:
|
| 292 |
+
return "I couldn't find relevant information for your query. Please try asking about machine learning, deep learning, or AI topics."
|
| 293 |
+
|
| 294 |
+
# Create context
|
| 295 |
+
context = create_context(documents)
|
| 296 |
+
|
| 297 |
+
# Generate answer
|
| 298 |
+
answer = generate_answer(query, context, llm)
|
| 299 |
+
return answer, documents, distances
|
| 300 |
+
|
| 301 |
+
except Exception as e:
|
| 302 |
+
return f"Error generating response: {e}", [], []
|
| 303 |
+
|
| 304 |
+
# Header
|
| 305 |
+
st.markdown("""
|
| 306 |
+
<div class="main-header">
|
| 307 |
+
<h1>🤖 RAG Chatbot: ML/AI Assistant</h1>
|
| 308 |
+
<p>Powered by Google Gemini 2.5 Flash + LangChain + Chroma</p>
|
| 309 |
+
</div>
|
| 310 |
+
""", unsafe_allow_html=True)
|
| 311 |
+
|
| 312 |
+
# Sidebar
|
| 313 |
+
with st.sidebar:
|
| 314 |
+
st.markdown("## 🛠️ Configuration")
|
| 315 |
+
|
| 316 |
+
# API Key input
|
| 317 |
+
api_key = st.text_input(
|
| 318 |
+
"🔑 Google Gemini API Key",
|
| 319 |
+
type="password",
|
| 320 |
+
help="Get your API key from Google AI Studio"
|
| 321 |
+
)
|
| 322 |
+
|
| 323 |
+
if api_key:
|
| 324 |
+
os.environ['GOOGLE_API_KEY'] = api_key
|
| 325 |
+
|
| 326 |
+
# Initialize button
|
| 327 |
+
if st.button("🚀 Initialize RAG System", disabled=not api_key):
|
| 328 |
+
with st.spinner("Initializing RAG system..."):
|
| 329 |
+
try:
|
| 330 |
+
rag_system = initialize_rag_system(api_key)
|
| 331 |
+
if rag_system:
|
| 332 |
+
st.session_state.rag_system = rag_system
|
| 333 |
+
st.session_state.initialized = True
|
| 334 |
+
st.success("✅ RAG system initialized successfully!")
|
| 335 |
+
else:
|
| 336 |
+
st.error("❌ Failed to initialize system")
|
| 337 |
+
except Exception as e:
|
| 338 |
+
st.error(f"❌ Error initializing system: {e}")
|
| 339 |
+
|
| 340 |
+
# System status
|
| 341 |
+
st.markdown("## 📊 System Status")
|
| 342 |
+
if st.session_state.initialized:
|
| 343 |
+
st.success("🟢 System Ready")
|
| 344 |
+
try:
|
| 345 |
+
doc_count = st.session_state.rag_system['collection'].count()
|
| 346 |
+
st.metric("📚 Documents", doc_count)
|
| 347 |
+
except:
|
| 348 |
+
st.metric("📚 Documents", "Unknown")
|
| 349 |
+
else:
|
| 350 |
+
st.warning("🟡 System Not Initialized")
|
| 351 |
+
|
| 352 |
+
# Sample questions
|
| 353 |
+
st.markdown("## 💡 Sample Questions")
|
| 354 |
+
sample_questions = [
|
| 355 |
+
"What is machine learning?",
|
| 356 |
+
"How do neural networks work?",
|
| 357 |
+
"Explain deep learning",
|
| 358 |
+
"What is overfitting?",
|
| 359 |
+
"Difference between supervised and unsupervised learning"
|
| 360 |
+
]
|
| 361 |
+
|
| 362 |
+
for question in sample_questions:
|
| 363 |
+
if st.button(f"❓ {question}", key=f"sample_{question}"):
|
| 364 |
+
if st.session_state.initialized:
|
| 365 |
+
st.session_state.messages.append({"role": "user", "content": question})
|
| 366 |
+
st.rerun()
|
| 367 |
+
else:
|
| 368 |
+
st.warning("Please initialize the system first!")
|
| 369 |
+
|
| 370 |
+
# Main chat interface
|
| 371 |
+
if not st.session_state.initialized:
|
| 372 |
+
st.info("👆 Please initialize the RAG system using the sidebar to start chatting!")
|
| 373 |
+
|
| 374 |
+
# Show project information
|
| 375 |
+
st.markdown("""
|
| 376 |
+
## 🎯 About This Project
|
| 377 |
+
|
| 378 |
+
This RAG (Retrieval-Augmented Generation) chatbot provides information about machine learning,
|
| 379 |
+
deep learning, AI, and related topics using:
|
| 380 |
+
|
| 381 |
+
- **🤖 Generation Model**: Google Gemini 2.5 Flash
|
| 382 |
+
- **🔗 RAG Framework**: LangChain
|
| 383 |
+
- **🗄️ Vector Database**: Chroma
|
| 384 |
+
- **📚 Dataset**: The Pile (EleutherAI/the_pile) from Hugging Face
|
| 385 |
+
- **🌐 Interface**: Streamlit
|
| 386 |
+
|
| 387 |
+
### 🚀 How It Works
|
| 388 |
+
|
| 389 |
+
1. **Data Loading**: Text data from The Pile dataset is loaded and filtered for ML/AI content
|
| 390 |
+
2. **Embedding**: Text is processed and embedded using sentence transformers
|
| 391 |
+
3. **Storage**: Embeddings are stored in Chroma vector database
|
| 392 |
+
4. **Retrieval**: Relevant context is retrieved for user queries
|
| 393 |
+
5. **Generation**: Gemini generates answers using retrieved context
|
| 394 |
+
|
| 395 |
+
### 📝 Sample Questions You Can Ask
|
| 396 |
+
|
| 397 |
+
- What is machine learning?
|
| 398 |
+
- How do neural networks work?
|
| 399 |
+
- Explain deep learning
|
| 400 |
+
- What is overfitting in ML?
|
| 401 |
+
- Difference between supervised and unsupervised learning
|
| 402 |
+
- What is natural language processing?
|
| 403 |
+
- How does computer vision work?
|
| 404 |
+
- Explain reinforcement learning
|
| 405 |
+
""")
|
| 406 |
+
|
| 407 |
+
else:
|
| 408 |
+
# Chat interface
|
| 409 |
+
st.markdown("## 💬 Chat with the AI Assistant")
|
| 410 |
+
|
| 411 |
+
# Display chat messages
|
| 412 |
+
for message in st.session_state.messages:
|
| 413 |
+
with st.chat_message(message["role"]):
|
| 414 |
+
st.markdown(message["content"])
|
| 415 |
+
|
| 416 |
+
# Chat input
|
| 417 |
+
if prompt := st.chat_input("Ask me anything about ML/AI..."):
|
| 418 |
+
# Add user message
|
| 419 |
+
st.session_state.messages.append({"role": "user", "content": prompt})
|
| 420 |
+
|
| 421 |
+
# Display user message
|
| 422 |
+
with st.chat_message("user"):
|
| 423 |
+
st.markdown(prompt)
|
| 424 |
+
|
| 425 |
+
# Generate response
|
| 426 |
+
with st.chat_message("assistant"):
|
| 427 |
+
with st.spinner("Thinking..."):
|
| 428 |
+
try:
|
| 429 |
+
# RAG pipeline
|
| 430 |
+
rag_system = st.session_state.rag_system
|
| 431 |
+
|
| 432 |
+
response, documents, distances = rag_pipeline(prompt, rag_system)
|
| 433 |
+
|
| 434 |
+
# Display response
|
| 435 |
+
st.markdown(response)
|
| 436 |
+
|
| 437 |
+
# Add assistant message
|
| 438 |
+
st.session_state.messages.append({"role": "assistant", "content": response})
|
| 439 |
+
|
| 440 |
+
# Show retrieval info
|
| 441 |
+
with st.expander("🔍 Retrieval Information"):
|
| 442 |
+
st.write(f"**Retrieved Documents**: {len(documents)}")
|
| 443 |
+
st.write(f"**Similarity Scores**: {[f'{d:.3f}' for d in distances]}")
|
| 444 |
+
|
| 445 |
+
for i, doc in enumerate(documents):
|
| 446 |
+
st.write(f"**Document {i+1}**: {doc[:200]}...")
|
| 447 |
+
|
| 448 |
+
except Exception as e:
|
| 449 |
+
error_msg = f"❌ Error: {e}"
|
| 450 |
+
st.error(error_msg)
|
| 451 |
+
st.session_state.messages.append({"role": "assistant", "content": error_msg})
|
| 452 |
+
|
| 453 |
+
# Clear chat button
|
| 454 |
+
if st.button("🗑️ Clear Chat History"):
|
| 455 |
+
st.session_state.messages = []
|
| 456 |
+
st.rerun()
|
| 457 |
+
|
| 458 |
+
# Footer
|
| 459 |
+
st.markdown("---")
|
| 460 |
+
st.markdown("""
|
| 461 |
+
<div style="text-align: center; color: #666; padding: 1rem;">
|
| 462 |
+
<p>🤖 RAG Chatbot | Powered by Google Gemini 2.5 Flash + LangChain + Chroma</p>
|
| 463 |
+
<p>📚 Knowledge Base: The Pile Dataset (EleutherAI/the_pile)</p>
|
| 464 |
+
</div>
|
| 465 |
+
""", unsafe_allow_html=True)
|
rag_notebook.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
requirements.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit==1.28.1
|
| 2 |
+
langchain==0.1.0
|
| 3 |
+
langchain-community==0.0.10
|
| 4 |
+
langchain-google-genai==0.0.6
|
| 5 |
+
chromadb==0.4.18
|
| 6 |
+
datasets==2.14.6
|
| 7 |
+
transformers==4.35.2
|
| 8 |
+
sentence-transformers==2.2.2
|
| 9 |
+
google-generativeai==0.3.2
|
| 10 |
+
tiktoken==0.5.1
|
| 11 |
+
numpy==1.24.3
|
| 12 |
+
pandas==2.0.3
|
| 13 |
+
tqdm==4.66.1
|