Spaces:

vikramvasudevan
/

youtube-channel-surfer-ai

Running

App Files Files Community

vikramvasudevan commited on 12 days ago

Commit

f315fdc

verified ·

1 Parent(s): 180b122

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.github/workflows/update_space.yml +28 -0
README.md +102 -102
app.py +187 -149
modules/answerer.py +109 -109
modules/collector.py +69 -69
modules/db.py +36 -36
modules/indexer.py +34 -34
modules/retriever.py +36 -36
modules/youtube_utils.py +26 -26
tests/search.py +13 -13

.github/workflows/update_space.yml ADDED Viewed

	@@ -0,0 +1,28 @@

+name: Run Python script
+on:
+  push:
+    branches:
+      - main
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.9'
+    - name: Install Gradio
+      run: python -m pip install gradio
+    - name: Log in to Hugging Face
+      run: python -c 'import huggingface_hub; huggingface_hub.login(token="${{ secrets.hf_token }}")'
+    - name: Deploy to Spaces
+      run: gradio deploy

README.md CHANGED Viewed

@@ -1,102 +1,102 @@
----
-title: youtube-channel-surfer-ai
-license: mit
-emoji: "📺"
-app_file: "app.py"
-sdk: "gradio"
-pinned: false
-python_version: 3.13
----
-# 📺 YouTube Metadata Q&A Agent
-This application allows you to index YouTube channels and ask natural language questions about the videos. It leverages **OpenAI embeddings** and **GPT-4o-mini** to provide insightful answers based on video metadata (titles + descriptions), and it displays top relevant videos in a clean, interactive table.
----
-## Features
-- **Index YouTube Channels**: Provide one or more YouTube channel URLs to index video metadata.
-- **Search & Answer Questions**: Ask questions about channel content and get answers generated by an LLM.
-- **Top Video Results**: View top relevant videos in a structured HTML table with clickable links.
-- **Embedded Video Player**: Watch videos directly in the app using YouTube embeds.
-- **Refresh Channels**: Update previously indexed channels to include the latest videos.
-- **Lightweight Storage**: Uses a local **ChromaDB** persistent database to store video embeddings for fast retrieval.
-- **Structured LLM Output**: LLM returns structured `LLMAnswer` objects with textual answer + top videos for clean rendering.
----
-## How it Works
-1. **Channel Indexing**:
-   - The app fetches the latest videos from provided YouTube channels using the YouTube Data API.
-   - Video metadata (title, description, channel, video ID) is embedded with OpenAI embeddings and stored in ChromaDB.
-2. **Query & Retrieval**:
-   - User queries are embedded and compared with stored video embeddings.
-   - Top matching videos are retrieved.
-3. **Answer Generation**:
-   - The LLM generates an answer based on the top video metadata.
-   - The answer and top videos are returned as structured data (`LLMAnswer`).
-4. **Rendering**:
-   - Answer text is displayed in Markdown.
-   - Top videos are displayed in a structured HTML table with clickable links and embedded YouTube players.
----
-## Installation
-## Steps to Run
-1. **Clone the repository:**
-        git clone <repo_url>
-        cd youtube_surfer_ai_agent
-2. **Create and activate a virtual environment:**
-    - Linux/macOS:
-            python -m venv .venv
-            source .venv/bin/activate
-    - Windows:
-            python -m venv .venv
-            .venv\Scripts\activate
-3. **Install dependencies:**
-        pip install -r requirements.txt
-4. **Create a `.env` file** in the project root with your API keys:
-        YOUTUBE_API_KEY=your_youtube_api_key
-        OPENAI_API_KEY=your_openai_api_key
-5. **Run the application:**
-        python app.py
-6. **Open the Gradio interface** in your browser (default: http://127.0.0.1:7860).
----
-## How to Use
-- **Index Channels:** Paste one or more YouTube channel URLs (comma or newline separated) and click "Index Channels".
-- **Refresh Channels:** Use the sidebar "Refresh All Channels" button to update existing channels.
-- **Ask Questions:** Type a query in the text box and click "Get Answer" to receive a structured response with embedded videos.
-- **View Indexed Channels:** The sidebar lists all channels that have been indexed with clickable links.
----
-## Notes
-- The LLM uses structured outputs (`LLMAnswer` + `VideoItem`) internally to produce consistent results.
-- Top videos are embedded as iframes in the Gradio interface.
-- You can adjust the number of top videos returned by modifying the `top_k` parameter in `answer_query`.
----

+---
+title: youtube-channel-surfer-ai
+license: mit
+emoji: "📺"
+app_file: "app.py"
+sdk: "gradio"
+pinned: false
+python_version: 3.13
+---
+# 📺 YouTube Metadata Q&A Agent
+This application allows you to index YouTube channels and ask natural language questions about the videos. It leverages **OpenAI embeddings** and **GPT-4o-mini** to provide insightful answers based on video metadata (titles + descriptions), and it displays top relevant videos in a clean, interactive table.
+---
+## Features
+- **Index YouTube Channels**: Provide one or more YouTube channel URLs to index video metadata.
+- **Search & Answer Questions**: Ask questions about channel content and get answers generated by an LLM.
+- **Top Video Results**: View top relevant videos in a structured HTML table with clickable links.
+- **Embedded Video Player**: Watch videos directly in the app using YouTube embeds.
+- **Refresh Channels**: Update previously indexed channels to include the latest videos.
+- **Lightweight Storage**: Uses a local **ChromaDB** persistent database to store video embeddings for fast retrieval.
+- **Structured LLM Output**: LLM returns structured `LLMAnswer` objects with textual answer + top videos for clean rendering.
+---
+## How it Works
+1. **Channel Indexing**:
+   - The app fetches the latest videos from provided YouTube channels using the YouTube Data API.
+   - Video metadata (title, description, channel, video ID) is embedded with OpenAI embeddings and stored in ChromaDB.
+2. **Query & Retrieval**:
+   - User queries are embedded and compared with stored video embeddings.
+   - Top matching videos are retrieved.
+3. **Answer Generation**:
+   - The LLM generates an answer based on the top video metadata.
+   - The answer and top videos are returned as structured data (`LLMAnswer`).
+4. **Rendering**:
+   - Answer text is displayed in Markdown.
+   - Top videos are displayed in a structured HTML table with clickable links and embedded YouTube players.
+---
+## Installation
+## Steps to Run
+1. **Clone the repository:**
+        git clone <repo_url>
+        cd youtube_surfer_ai_agent
+2. **Create and activate a virtual environment:**
+    - Linux/macOS:
+            python -m venv .venv
+            source .venv/bin/activate
+    - Windows:
+            python -m venv .venv
+            .venv\Scripts\activate
+3. **Install dependencies:**
+        pip install -r requirements.txt
+4. **Create a `.env` file** in the project root with your API keys:
+        YOUTUBE_API_KEY=your_youtube_api_key
+        OPENAI_API_KEY=your_openai_api_key
+5. **Run the application:**
+        python app.py
+6. **Open the Gradio interface** in your browser (default: http://127.0.0.1:7860).
+---
+## How to Use
+- **Index Channels:** Paste one or more YouTube channel URLs (comma or newline separated) and click "Index Channels".
+- **Refresh Channels:** Use the sidebar "Refresh All Channels" button to update existing channels.
+- **Ask Questions:** Type a query in the text box and click "Get Answer" to receive a structured response with embedded videos.
+- **View Indexed Channels:** The sidebar lists all channels that have been indexed with clickable links.
+---
+## Notes
+- The LLM uses structured outputs (`LLMAnswer` + `VideoItem`) internally to produce consistent results.
+- Top videos are embedded as iframes in the Gradio interface.
+- You can adjust the number of top videos returned by modifying the `top_k` parameter in `answer_query`.
+---

app.py CHANGED Viewed

@@ -1,149 +1,187 @@
-import os
-import re
-import gradio as gr
-import chromadb
-from modules.collector import fetch_channel_videos_from_url
-from modules.db import get_indexed_channels
-from modules.indexer import index_videos
-from modules.answerer import answer_query, LLMAnswer, VideoItem, build_video_html
-from dotenv import load_dotenv
-load_dotenv()
-# -------------------------------
-# Setup Chroma
-# -------------------------------
-client = chromadb.PersistentClient(path="./youtube_db")
-collection = client.get_or_create_collection("yt_metadata", embedding_function=None)
-# -------------------------------
-# Utils
-# -------------------------------
-def refresh_channel(api_key, channel_url: str):
-    """Fetch + re-index a single channel."""
-    videos = fetch_channel_videos_from_url(api_key, channel_url)
-    for v in videos:
-        v["channel_url"] = channel_url
-    index_videos(videos, collection, channel_url=channel_url)
-    return len(videos)
-def index_channels(channel_urls: str):
-    yt_api_key = os.environ["YOUTUBE_API_KEY"]
-    urls = [u.strip() for u in re.split(r"[\n,]+", channel_urls) if u.strip()]
-    total_videos = sum(refresh_channel(yt_api_key, url) for url in urls)
-    return (
-        f"✅ Indexed {total_videos} videos from {len(urls)} channels.",
-        list_channels(),
-    )
-def list_channels():
-    channels = get_indexed_channels(collection)
-    if not channels:
-        return "No channels indexed yet."
-    md = []
-    for key, val in channels.items():
-        if isinstance(val, dict):
-            cname = val.get("channel_title", "Unknown")
-            curl = val.get("channel_url", None)
-        else:
-            cname = val
-            curl = key
-        if curl:
-            md.append(f"- **{cname}** ([link]({curl}))")
-        else:
-            md.append(f"- **{cname}**")
-    return "\n".join(md)
-def refresh_all_channels():
-    yt_api_key = os.environ["YOUTUBE_API_KEY"]
-    channels = get_indexed_channels(collection)
-    if not channels:
-        return "⚠️ No channels available to refresh.", list_channels()
-    total_videos = 0
-    for key, val in channels.items():
-        url = val.get("channel_url") if isinstance(val, dict) else key
-        if url:
-            total_videos += refresh_channel(yt_api_key, url)
-    return (
-        f"🔄 Refreshed {len(channels)} channels, re-indexed {total_videos} videos.",
-        list_channels(),
-    )
-def handle_query(query: str):
-    (answer_text, video_html) = answer_query(query, collection)  # returns LLMAnswer
-    return answer_text, video_html
-# -------------------------------
-# Gradio UI
-# -------------------------------
-def show_component():
-    return gr.update(visible=True)
-def hide_component():
-    return gr.update(visible=False)
-def close_component():
-    return gr.update(open=False)
-def open_component():
-    return gr.update(open=True)
-with gr.Blocks() as demo:
-    gr.Markdown("## 📺 YouTube Metadata Q&A Agent")
-    from gradio_modal import Modal
-    with Modal(visible=False) as add_channel_modal:
-        channel_input = gr.Textbox(
-            label="Channel URLs",
-            placeholder="Paste one or more YouTube channel URLs (comma or newline separated)",
-        )
-        save_add_channels_btn = gr.Button("Add Channels")
-        index_status = gr.Markdown(label="Index Status", container=False)
-    with gr.Row():
-        with gr.Sidebar() as my_sidebar:
-            gr.Markdown("### 📺 Channels")
-            channel_list = gr.Markdown(list_channels())
-            with gr.Row():
-                refresh_all_btn = gr.Button(
-                    "🔄 Refresh", size="sm", scale=0
-                )
-                add_channels_btn = gr.Button("+ Add", size="sm", scale=0)
-            refresh_status = gr.Markdown(label="Refresh Status", container=False)
-            refresh_all_btn.click(
-                fn=refresh_all_channels,
-                inputs=None,
-                outputs=[refresh_status, channel_list],
-            )
-            add_channels_btn.click(close_component, outputs=[my_sidebar]).then(show_component, outputs=[add_channel_modal])
-            save_add_channels_btn.click(
-                index_channels,
-                inputs=[channel_input],
-                outputs=[index_status, channel_list],
-            ).then(hide_component, outputs=[add_channel_modal]).then(open_component, outputs=[my_sidebar])
-        with gr.Column(scale=3):
-            question = gr.Textbox(
-                label="Ask a Question",
-                placeholder="e.g., What topics did they cover on AI ethics?",
-            )
-            gr.Examples(
-                [
-                    "Show me some videos that mention Ranganatha.",
-                    "Slokas that mention gajendra moksham",
-                ],
-                inputs=question,
-            )
-            answer = gr.Markdown()
-            video_embed = gr.HTML()  # iframe embeds will render here
-            ask_btn = gr.Button("Get Answer")
-            ask_btn.click(handle_query, inputs=question, outputs=[answer, video_embed])
-if __name__ == "__main__":
-    demo.launch()

+import os
+import re
+import gradio as gr
+from gradio_modal import Modal
+import chromadb
+from modules.collector import fetch_channel_videos_from_url
+from modules.db import get_indexed_channels
+from modules.indexer import index_videos
+from modules.answerer import answer_query, LLMAnswer, VideoItem, build_video_html
+from dotenv import load_dotenv
+load_dotenv()
+# -------------------------------
+# Setup Chroma
+# -------------------------------
+client = chromadb.PersistentClient(path="./youtube_db")
+collection = client.get_or_create_collection("yt_metadata", embedding_function=None)
+# -------------------------------
+# Utils
+# -------------------------------
+def refresh_channel(api_key, channel_url: str):
+    """Fetch + re-index a single channel."""
+    videos = fetch_channel_videos_from_url(api_key, channel_url)
+    for v in videos:
+        v["channel_url"] = channel_url
+    index_videos(videos, collection, channel_url=channel_url)
+    return len(videos)
+def index_channels(channel_urls: str):
+    yield "saving ...", gr.update()
+    yt_api_key = os.environ["YOUTUBE_API_KEY"]
+    urls = [u.strip() for u in re.split(r"[\n,]+", channel_urls) if u.strip()]
+    total_videos = sum(refresh_channel(yt_api_key, url) for url in urls)
+    yield (
+        f"✅ Indexed {total_videos} videos from {len(urls)} channels.",
+        list_channels(),
+    )
+    return
+def list_channels():
+    channels = get_indexed_channels(collection)
+    if not channels:
+        return "No channels indexed yet."
+    md = []
+    for key, val in channels.items():
+        if isinstance(val, dict):
+            cname = val.get("channel_title", "Unknown")
+            curl = val.get("channel_url", None)
+        else:
+            cname = val
+            curl = key
+        if curl:
+            md.append(f"- **{cname}** ([link]({curl}))")
+        else:
+            md.append(f"- **{cname}**")
+    return "\n".join(md)
+def refresh_all_channels():
+    yt_api_key = os.environ["YOUTUBE_API_KEY"]
+    channels = get_indexed_channels(collection)
+    if not channels:
+        return "⚠️ No channels available to refresh.", list_channels()
+    total_videos = 0
+    for key, val in channels.items():
+        url = val.get("channel_url") if isinstance(val, dict) else key
+        if url:
+            total_videos += refresh_channel(yt_api_key, url)
+    return (
+        f"🔄 Refreshed {len(channels)} channels, re-indexed {total_videos} videos.",
+        list_channels(),
+    )
+def handle_query(query: str):
+    (answer_text, video_html) = answer_query(query, collection)  # returns LLMAnswer
+    return answer_text, video_html
+# -------------------------------
+# Gradio UI
+# -------------------------------
+def show_component():
+    return gr.update(visible=True)
+def hide_component():
+    return gr.update(visible=False)
+def close_component():
+    return gr.update(open=False)
+def open_component():
+    return gr.update(open=True)
+def disable_component():
+    return gr.update(interactive=False)
+def enable_component():
+    return gr.update(interactive=True)
+def clear_component():
+    return gr.update(value="")
+def show_loading():
+    return gr.update(value="loading")
+with gr.Blocks() as demo:
+    gr.Markdown("## 📺 YouTube Metadata Q&A Agent")
+    with Modal(visible=False) as add_channel_modal:
+        channel_input = gr.Textbox(
+            label="Channel URLs",
+            placeholder="Paste one or more YouTube channel URLs (comma or newline separated)",
+        )
+        save_add_channels_btn = gr.Button("Add Channels")
+        index_status = gr.Markdown(label="Index Status", container=False)
+    with gr.Row():
+        with gr.Sidebar() as my_sidebar:
+            gr.Markdown("### 📺 Channels")
+            channel_list = gr.Markdown(list_channels())
+            with gr.Row():
+                refresh_all_btn = gr.Button("🔄 Refresh", size="sm", scale=0)
+                add_channels_btn = gr.Button("+ Add", size="sm", scale=0)
+            refresh_status = gr.Markdown(label="Refresh Status", container=False)
+            refresh_all_btn.click(
+                fn=refresh_all_channels,
+                inputs=None,
+                outputs=[refresh_status, channel_list],
+            )
+            add_channels_btn.click(close_component, outputs=[my_sidebar]).then(
+                show_component, outputs=[add_channel_modal]
+            )
+            save_add_channels_btn.click(
+                disable_component, outputs=[save_add_channels_btn]
+            ).then(
+                index_channels,
+                inputs=[channel_input],
+                outputs=[index_status, channel_list],
+            ).then(
+                hide_component, outputs=[add_channel_modal]
+            ).then(
+                open_component, outputs=[my_sidebar]
+            ).then(
+                enable_component, outputs=[save_add_channels_btn]
+            )
+        with gr.Column(scale=3):
+            question = gr.Textbox(
+                label="Ask a Question",
+                placeholder="e.g., What topics did they cover on AI ethics?",
+            )
+            gr.Examples(
+                [
+                    "Show me some videos that mention Ranganatha.",
+                    "Slokas that mention gajendra moksham",
+                ],
+                inputs=question,
+            )
+            answer = gr.Markdown()
+            video_embed = gr.HTML()  # iframe embeds will render here
+            ask_btn = gr.Button("Get Answer")
+            ask_status = gr.Markdown()
+            ask_btn.click(show_loading, outputs=[ask_status]).then(
+                disable_component, outputs=[ask_btn]
+            ).then(handle_query, inputs=question, outputs=[answer, video_embed]).then(
+                enable_component, outputs=[ask_btn]
+            ).then(
+                clear_component, outputs=[ask_status]
+            )
+if __name__ == "__main__":
+    demo.launch()

modules/answerer.py CHANGED Viewed

@@ -1,109 +1,109 @@
-# -------------------------------
-# 4. Answerer
-# -------------------------------
-from typing import List
-from pydantic import BaseModel
-from openai import OpenAI
-from modules.retriever import retrieve_videos
-# -------------------------------
-# Structured Output Classes
-# -------------------------------
-class VideoItem(BaseModel):
-    video_id: str
-    title: str
-    channel: str
-    description: str
-class LLMAnswer(BaseModel):
-    answer_text: str
-    top_videos: List[VideoItem]
-# -------------------------------
-# Main Function
-# -------------------------------
-def answer_query(query: str, collection, top_k: int = 5) -> LLMAnswer:
-    """
-    Answer a user query using YouTube video metadata.
-    Returns an LLMAnswer object with textual answer + list of videos.
-    """
-    results = retrieve_videos(query, collection, top_k=top_k)
-    if not results:
-        return LLMAnswer(answer_text="No relevant videos found.", top_videos=[])
-    # Build context lines for the LLM
-    context_lines = []
-    top_videos_list = []
-    for r in results:
-        # Ensure each result is a dict
-        if not isinstance(r, dict):
-            continue
-        vid_id = r.get("video_id", "")
-        title = r.get("video_title") or r.get("title", "")
-        channel = r.get("channel") or r.get("channel_title", "")
-        description = r.get("description", "")
-        context_lines.append(f"- {title} ({channel}) (https://youtube.com/watch?v={vid_id})\n  description: {description}")
-        top_videos_list.append(
-            VideoItem(
-                video_id=vid_id,
-                title=title,
-                channel=channel,
-                description=description
-            )
-        )
-    context_text = "\n".join(context_lines)
-    # Call LLM with structured output
-    client = OpenAI()
-    response = client.chat.completions.parse(
-        model="gpt-4o-mini",
-        messages=[
-            {
-                "role": "system",
-                "content": (
-                    "You are a helpful assistant that answers questions using YouTube video metadata. "
-                    "Return your response strictly as the LLMAnswer class, including 'answer_text' and a list of 'top_videos'."
-                )
-            },
-            {
-                "role": "user",
-                "content": f"Question: {query}\n\nRelevant videos:\n{context_text}\n\nAnswer based only on this."
-            }
-        ],
-        response_format=LLMAnswer
-    )
-    llm_answer = response.choices[0].message.parsed  # already LLMAnswer object
-    answer_text = llm_answer.answer_text
-    video_html = build_video_html(llm_answer.top_videos)
-    return answer_text, video_html
-def build_video_html(videos: list[VideoItem]) -> str:
-    """Build a clean HTML table from top_videos."""
-    if not videos:
-        return "<p>No relevant videos found.</p>"
-    html = """
-    <table border="1" style="border-collapse: collapse; width: 100%;">
-        <tr>
-            <th>Title</th>
-            <th>Channel</th>
-            <th>Description</th>
-            <th>Watch</th>
-        </tr>
-    """
-    for v in videos:
-        html += f"""
-        <tr>
-            <td>{v.title}</td>
-            <td>{v.channel}</td>
-            <td>{v.description}</td>
-            <td><a href="https://youtube.com/watch?v={v.video_id}" target="_blank">▶️ Watch</a></td>
-        </tr>
-        """
-    html += "</table>"
-    return html

+# -------------------------------
+# 4. Answerer
+# -------------------------------
+from typing import List
+from pydantic import BaseModel
+from openai import OpenAI
+from modules.retriever import retrieve_videos
+# -------------------------------
+# Structured Output Classes
+# -------------------------------
+class VideoItem(BaseModel):
+    video_id: str
+    title: str
+    channel: str
+    description: str
+class LLMAnswer(BaseModel):
+    answer_text: str
+    top_videos: List[VideoItem]
+# -------------------------------
+# Main Function
+# -------------------------------
+def answer_query(query: str, collection, top_k: int = 5) -> LLMAnswer:
+    """
+    Answer a user query using YouTube video metadata.
+    Returns an LLMAnswer object with textual answer + list of videos.
+    """
+    results = retrieve_videos(query, collection, top_k=top_k)
+    if not results:
+        return LLMAnswer(answer_text="No relevant videos found.", top_videos=[])
+    # Build context lines for the LLM
+    context_lines = []
+    top_videos_list = []
+    for r in results:
+        # Ensure each result is a dict
+        if not isinstance(r, dict):
+            continue
+        vid_id = r.get("video_id", "")
+        title = r.get("video_title") or r.get("title", "")
+        channel = r.get("channel") or r.get("channel_title", "")
+        description = r.get("description", "")
+        context_lines.append(f"- {title} ({channel}) (https://youtube.com/watch?v={vid_id})\n  description: {description}")
+        top_videos_list.append(
+            VideoItem(
+                video_id=vid_id,
+                title=title,
+                channel=channel,
+                description=description
+            )
+        )
+    context_text = "\n".join(context_lines)
+    # Call LLM with structured output
+    client = OpenAI()
+    response = client.chat.completions.parse(
+        model="gpt-4o-mini",
+        messages=[
+            {
+                "role": "system",
+                "content": (
+                    "You are a helpful assistant that answers questions using YouTube video metadata. "
+                    "Return your response strictly as the LLMAnswer class, including 'answer_text' and a list of 'top_videos'."
+                )
+            },
+            {
+                "role": "user",
+                "content": f"Question: {query}\n\nRelevant videos:\n{context_text}\n\nAnswer based only on this."
+            }
+        ],
+        response_format=LLMAnswer
+    )
+    llm_answer = response.choices[0].message.parsed  # already LLMAnswer object
+    answer_text = llm_answer.answer_text
+    video_html = build_video_html(llm_answer.top_videos)
+    return answer_text, video_html
+def build_video_html(videos: list[VideoItem]) -> str:
+    """Build a clean HTML table from top_videos."""
+    if not videos:
+        return "<p>No relevant videos found.</p>"
+    html = """
+    <table border="1" style="border-collapse: collapse; width: 100%;">
+        <tr>
+            <th>Title</th>
+            <th>Channel</th>
+            <th>Description</th>
+            <th>Watch</th>
+        </tr>
+    """
+    for v in videos:
+        html += f"""
+        <tr>
+            <td>{v.title}</td>
+            <td>{v.channel}</td>
+            <td>{v.description}</td>
+            <td><a href="https://youtube.com/watch?v={v.video_id}" target="_blank">▶️ Watch</a></td>
+        </tr>
+        """
+    html += "</table>"
+    return html

modules/collector.py CHANGED Viewed

@@ -1,69 +1,69 @@
-# -------------------------------
-# 1. Collector
-# -------------------------------
-from typing import List,Dict
-from googleapiclient.discovery import build
-from modules.youtube_utils import get_channel_id
-def fetch_channel_videos_from_url(api_key: str, channel_url: str, max_results=20):
-    youtube = build("youtube", "v3", developerKey=api_key)
-    channel_id = get_channel_id(youtube, channel_url)
-    # Get channel details to fetch its title
-    channel_response = youtube.channels().list(
-        part="snippet",
-        id=channel_id
-    ).execute()
-    channel_title = channel_response["items"][0]["snippet"]["title"]
-    request = youtube.search().list(
-        part="snippet",
-        channelId=channel_id,
-        maxResults=max_results,
-        order="date"
-    )
-    response = request.execute()
-    videos = []
-    for item in response.get("items", []):
-        if item["id"]["kind"] == "youtube#video":
-            videos.append({
-                "video_id": item["id"]["videoId"],
-                "title": item["snippet"]["title"],
-                "description": item["snippet"].get("description", ""),
-                "channel_id": channel_id,
-                "channel_title": channel_title,
-            })
-    return videos
-def fetch_channel_videos(api_key: str, channel_id: str, max_results=20):
-    youtube = build("youtube", "v3", developerKey=api_key)
-    # Fetch channel title
-    channel_response = youtube.channels().list(
-        part="snippet",
-        id=channel_id
-    ).execute()
-    channel_title = channel_response["items"][0]["snippet"]["title"]
-    request = youtube.search().list(
-        part="snippet",
-        channelId=channel_id,
-        maxResults=max_results,
-        order="date"
-    )
-    response = request.execute()
-    videos = []
-    for item in response.get("items", []):
-        if item["id"]["kind"] == "youtube#video":
-            videos.append({
-                "video_id": item["id"]["videoId"],
-                "title": item["snippet"]["title"],
-                "description": item["snippet"].get("description", ""),
-                "channel_id": channel_id,
-                "channel_title": channel_title,
-            })
-    return videos

+# -------------------------------
+# 1. Collector
+# -------------------------------
+from typing import List,Dict
+from googleapiclient.discovery import build
+from modules.youtube_utils import get_channel_id
+def fetch_channel_videos_from_url(api_key: str, channel_url: str, max_results=20):
+    youtube = build("youtube", "v3", developerKey=api_key)
+    channel_id = get_channel_id(youtube, channel_url)
+    # Get channel details to fetch its title
+    channel_response = youtube.channels().list(
+        part="snippet",
+        id=channel_id
+    ).execute()
+    channel_title = channel_response["items"][0]["snippet"]["title"]
+    request = youtube.search().list(
+        part="snippet",
+        channelId=channel_id,
+        maxResults=max_results,
+        order="date"
+    )
+    response = request.execute()
+    videos = []
+    for item in response.get("items", []):
+        if item["id"]["kind"] == "youtube#video":
+            videos.append({
+                "video_id": item["id"]["videoId"],
+                "title": item["snippet"]["title"],
+                "description": item["snippet"].get("description", ""),
+                "channel_id": channel_id,
+                "channel_title": channel_title,
+            })
+    return videos
+def fetch_channel_videos(api_key: str, channel_id: str, max_results=20):
+    youtube = build("youtube", "v3", developerKey=api_key)
+    # Fetch channel title
+    channel_response = youtube.channels().list(
+        part="snippet",
+        id=channel_id
+    ).execute()
+    channel_title = channel_response["items"][0]["snippet"]["title"]
+    request = youtube.search().list(
+        part="snippet",
+        channelId=channel_id,
+        maxResults=max_results,
+        order="date"
+    )
+    response = request.execute()
+    videos = []
+    for item in response.get("items", []):
+        if item["id"]["kind"] == "youtube#video":
+            videos.append({
+                "video_id": item["id"]["videoId"],
+                "title": item["snippet"]["title"],
+                "description": item["snippet"].get("description", ""),
+                "channel_id": channel_id,
+                "channel_title": channel_title,
+            })
+    return videos

modules/db.py CHANGED Viewed

@@ -1,36 +1,36 @@
-import chromadb
-def get_collection():
-    client = chromadb.PersistentClient(path="./youtube_db")
-    # Ensure fresh collection with correct dimension
-    try:
-        collection = client.get_collection("yt_metadata")
-    except Exception:
-        collection = client.create_collection("yt_metadata")
-    # Check dimension mismatch
-    try:
-        # quick test query
-        collection.query(query_embeddings=[[0.0] * 1536], n_results=1)
-    except Exception:
-        # Delete and recreate with fresh schema
-        client.delete_collection("yt_metadata")
-        collection = client.create_collection("yt_metadata")
-    return collection
-# modules/db.py
-def get_indexed_channels(collection):
-    results = collection.get(include=["metadatas"])
-    channels = {}
-    for meta in results["metadatas"]:
-        cid = meta.get("channel_id")  # ✅ safe
-        cname = meta.get("channel_title", "Unknown Channel")
-        if cid:  # only include if we have a channel_id
-            channels[cid] = cname
-    return channels

+import chromadb
+def get_collection():
+    client = chromadb.PersistentClient(path="./youtube_db")
+    # Ensure fresh collection with correct dimension
+    try:
+        collection = client.get_collection("yt_metadata")
+    except Exception:
+        collection = client.create_collection("yt_metadata")
+    # Check dimension mismatch
+    try:
+        # quick test query
+        collection.query(query_embeddings=[[0.0] * 1536], n_results=1)
+    except Exception:
+        # Delete and recreate with fresh schema
+        client.delete_collection("yt_metadata")
+        collection = client.create_collection("yt_metadata")
+    return collection
+# modules/db.py
+def get_indexed_channels(collection):
+    results = collection.get(include=["metadatas"])
+    channels = {}
+    for meta in results["metadatas"]:
+        cid = meta.get("channel_id")  # ✅ safe
+        cname = meta.get("channel_title", "Unknown Channel")
+        if cid:  # only include if we have a channel_id
+            channels[cid] = cname
+    return channels

modules/indexer.py CHANGED Viewed

@@ -1,34 +1,34 @@
-# modules/indexer.py
-from typing import Dict, List
-from openai import OpenAI
-def index_videos(videos: List[Dict], collection,channel_url : str):
-    client = OpenAI()
-    for vid in videos:
-        text = f"{vid.get('title', '')} - {vid.get('description', '')}"
-        embedding = client.embeddings.create(
-            input=text,
-            model="text-embedding-3-small"
-        ).data[0].embedding
-        # build metadata safely
-        metadata = {
-            "video_id": vid.get("video_id"),
-            "video_title": vid.get("title", ""),
-            "description" : vid.get('description', ''),
-            "channel_url" : channel_url,
-        }
-        # add channel info if available
-        if "channel_id" in vid:
-            metadata["channel_id"] = vid["channel_id"]
-        if "channel_title" in vid:
-            metadata["channel_title"] = vid["channel_title"]
-        collection.add(
-            documents=[text],
-            embeddings=[embedding],
-            metadatas=[metadata],
-            ids=[vid.get("video_id")]
-        )

+# modules/indexer.py
+from typing import Dict, List
+from openai import OpenAI
+def index_videos(videos: List[Dict], collection,channel_url : str):
+    client = OpenAI()
+    for vid in videos:
+        text = f"{vid.get('title', '')} - {vid.get('description', '')}"
+        embedding = client.embeddings.create(
+            input=text,
+            model="text-embedding-3-small"
+        ).data[0].embedding
+        # build metadata safely
+        metadata = {
+            "video_id": vid.get("video_id"),
+            "video_title": vid.get("title", ""),
+            "description" : vid.get('description', ''),
+            "channel_url" : channel_url,
+        }
+        # add channel info if available
+        if "channel_id" in vid:
+            metadata["channel_id"] = vid["channel_id"]
+        if "channel_title" in vid:
+            metadata["channel_title"] = vid["channel_title"]
+        collection.add(
+            documents=[text],
+            embeddings=[embedding],
+            metadatas=[metadata],
+            ids=[vid.get("video_id")]
+        )

modules/retriever.py CHANGED Viewed

@@ -1,36 +1,36 @@
-# modules/retriever.py
-from typing import List, Dict
-from openai import OpenAI
-def retrieve_videos(query: str, collection, top_k: int = 3) -> List[Dict]:
-    client = OpenAI()
-    # Create embedding for query
-    embedding = client.embeddings.create(
-        input=query,
-        model="text-embedding-3-small"
-    ).data[0].embedding
-    # Query Chroma
-    results = collection.query(
-        query_embeddings=[embedding],
-        n_results=top_k,
-        include=["metadatas", "documents", "distances"]
-    )
-    # Build list of standardized dicts
-    videos = []
-    metadatas_list = results.get("metadatas", [[]])[0]  # list of metadata dicts
-    documents_list = results.get("documents", [[]])[0]  # list of text
-    distances_list = results.get("distances", [[]])[0]  # optional
-    for idx, meta in enumerate(metadatas_list):
-        videos.append({
-            "video_id": meta.get("video_id", ""),
-            "video_title": meta.get("video_title", meta.get("title", documents_list[idx])),
-            "channel": meta.get("channel", meta.get("channel_title", "")),
-            "description": documents_list[idx] if idx < len(documents_list) else "",
-            "score": distances_list[idx] if idx < len(distances_list) else None
-        })
-    return videos

+# modules/retriever.py
+from typing import List, Dict
+from openai import OpenAI
+def retrieve_videos(query: str, collection, top_k: int = 3) -> List[Dict]:
+    client = OpenAI()
+    # Create embedding for query
+    embedding = client.embeddings.create(
+        input=query,
+        model="text-embedding-3-small"
+    ).data[0].embedding
+    # Query Chroma
+    results = collection.query(
+        query_embeddings=[embedding],
+        n_results=top_k,
+        include=["metadatas", "documents", "distances"]
+    )
+    # Build list of standardized dicts
+    videos = []
+    metadatas_list = results.get("metadatas", [[]])[0]  # list of metadata dicts
+    documents_list = results.get("documents", [[]])[0]  # list of text
+    distances_list = results.get("distances", [[]])[0]  # optional
+    for idx, meta in enumerate(metadatas_list):
+        videos.append({
+            "video_id": meta.get("video_id", ""),
+            "video_title": meta.get("video_title", meta.get("title", documents_list[idx])),
+            "channel": meta.get("channel", meta.get("channel_title", "")),
+            "description": documents_list[idx] if idx < len(documents_list) else "",
+            "score": distances_list[idx] if idx < len(distances_list) else None
+        })
+    return videos

modules/youtube_utils.py CHANGED Viewed

@@ -1,26 +1,26 @@
-def get_channel_id(youtube, channel_url: str) -> str:
-    """
-    Extract channel ID from a YouTube URL or handle.
-    Supports:
-    - https://www.youtube.com/channel/UCxxxx
-    - https://www.youtube.com/@handle
-    - @handle
-    """
-    # If already a UC... ID
-    if "channel/" in channel_url:
-        return channel_url.split("channel/")[-1].split("/")[0]
-    # If it's a handle (@xyz or full URL)
-    if "@" in channel_url:
-        handle = channel_url.split("@")[-1]
-        request = youtube.channels().list(
-            part="id",
-            forHandle=handle
-        )
-        response = request.execute()
-        return response["items"][0]["id"]
-    if channel_url.startswith("UC"):
-        return channel_url
-    raise ValueError(f"Unsupported channel URL format {channel_url}")

+def get_channel_id(youtube, channel_url: str) -> str:
+    """
+    Extract channel ID from a YouTube URL or handle.
+    Supports:
+    - https://www.youtube.com/channel/UCxxxx
+    - https://www.youtube.com/@handle
+    - @handle
+    """
+    # If already a UC... ID
+    if "channel/" in channel_url:
+        return channel_url.split("channel/")[-1].split("/")[0]
+    # If it's a handle (@xyz or full URL)
+    if "@" in channel_url:
+        handle = channel_url.split("@")[-1]
+        request = youtube.channels().list(
+            part="id",
+            forHandle=handle
+        )
+        response = request.execute()
+        return response["items"][0]["id"]
+    if channel_url.startswith("UC"):
+        return channel_url
+    raise ValueError(f"Unsupported channel URL format {channel_url}")

tests/search.py CHANGED Viewed

@@ -1,14 +1,14 @@
-from chromadb import PersistentClient
-from modules.db import get_collection
-from modules.retriever import retrieve_videos
-from dotenv import load_dotenv
-load_dotenv()
-collection = get_collection()
-all_metas = collection.get(include=["metadatas"])["metadatas"]
-print("Sample metadatas:", all_metas[:5])
-print("-------")
 retrieve_videos("Show me some videos that mention Ranganatha.", collection)

+from chromadb import PersistentClient
+from modules.db import get_collection
+from modules.retriever import retrieve_videos
+from dotenv import load_dotenv
+load_dotenv()
+collection = get_collection()
+all_metas = collection.get(include=["metadatas"])["metadatas"]
+print("Sample metadatas:", all_metas[:5])
+print("-------")
 retrieve_videos("Show me some videos that mention Ranganatha.", collection)