vikramvasudevan commited on
Commit
4617295
·
verified ·
1 Parent(s): df9283a

Upload folder using huggingface_hub

Browse files
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+ .env
12
+ youtube_db/
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.13
README.md CHANGED
@@ -1,12 +1,102 @@
1
- ---
2
- title: Youtube Channel Surfer Ai
3
- emoji: 📊
4
- colorFrom: yellow
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.44.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: youtube-channel-surfer-ai
3
+ license: mit
4
+ emoji: "📺"
5
+ app_file: "app.py"
6
+ sdk: "gradio"
7
+ pinned: false
8
+ python_version: 3.16
9
+ ---
10
+
11
+ # 📺 YouTube Metadata Q&A Agent
12
+
13
+ This application allows you to index YouTube channels and ask natural language questions about the videos. It leverages **OpenAI embeddings** and **GPT-4o-mini** to provide insightful answers based on video metadata (titles + descriptions), and it displays top relevant videos in a clean, interactive table.
14
+
15
+ ---
16
+
17
+ ## Features
18
+
19
+ - **Index YouTube Channels**: Provide one or more YouTube channel URLs to index video metadata.
20
+ - **Search & Answer Questions**: Ask questions about channel content and get answers generated by an LLM.
21
+ - **Top Video Results**: View top relevant videos in a structured HTML table with clickable links.
22
+ - **Embedded Video Player**: Watch videos directly in the app using YouTube embeds.
23
+ - **Refresh Channels**: Update previously indexed channels to include the latest videos.
24
+ - **Lightweight Storage**: Uses a local **ChromaDB** persistent database to store video embeddings for fast retrieval.
25
+ - **Structured LLM Output**: LLM returns structured `LLMAnswer` objects with textual answer + top videos for clean rendering.
26
+
27
+ ---
28
+
29
+ ## How it Works
30
+
31
+ 1. **Channel Indexing**:
32
+ - The app fetches the latest videos from provided YouTube channels using the YouTube Data API.
33
+ - Video metadata (title, description, channel, video ID) is embedded with OpenAI embeddings and stored in ChromaDB.
34
+
35
+ 2. **Query & Retrieval**:
36
+ - User queries are embedded and compared with stored video embeddings.
37
+ - Top matching videos are retrieved.
38
+
39
+ 3. **Answer Generation**:
40
+ - The LLM generates an answer based on the top video metadata.
41
+ - The answer and top videos are returned as structured data (`LLMAnswer`).
42
+
43
+ 4. **Rendering**:
44
+ - Answer text is displayed in Markdown.
45
+ - Top videos are displayed in a structured HTML table with clickable links and embedded YouTube players.
46
+
47
+ ---
48
+
49
+ ## Installation
50
+
51
+ ## Steps to Run
52
+
53
+ 1. **Clone the repository:**
54
+
55
+ git clone <repo_url>
56
+ cd youtube_surfer_ai_agent
57
+
58
+ 2. **Create and activate a virtual environment:**
59
+
60
+ - Linux/macOS:
61
+
62
+ python -m venv .venv
63
+ source .venv/bin/activate
64
+
65
+ - Windows:
66
+
67
+ python -m venv .venv
68
+ .venv\Scripts\activate
69
+
70
+ 3. **Install dependencies:**
71
+
72
+ pip install -r requirements.txt
73
+
74
+ 4. **Create a `.env` file** in the project root with your API keys:
75
+
76
+ YOUTUBE_API_KEY=your_youtube_api_key
77
+ OPENAI_API_KEY=your_openai_api_key
78
+
79
+ 5. **Run the application:**
80
+
81
+ python app.py
82
+
83
+ 6. **Open the Gradio interface** in your browser (default: http://127.0.0.1:7860).
84
+
85
+ ---
86
+
87
+ ## How to Use
88
+
89
+ - **Index Channels:** Paste one or more YouTube channel URLs (comma or newline separated) and click "Index Channels".
90
+ - **Refresh Channels:** Use the sidebar "Refresh All Channels" button to update existing channels.
91
+ - **Ask Questions:** Type a query in the text box and click "Get Answer" to receive a structured response with embedded videos.
92
+ - **View Indexed Channels:** The sidebar lists all channels that have been indexed with clickable links.
93
+
94
+ ---
95
+
96
+ ## Notes
97
+
98
+ - The LLM uses structured outputs (`LLMAnswer` + `VideoItem`) internally to produce consistent results.
99
+ - Top videos are embedded as iframes in the Gradio interface.
100
+ - You can adjust the number of top videos returned by modifying the `top_k` parameter in `answer_query`.
101
+
102
+ ---
app.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ import gradio as gr
4
+ import chromadb
5
+ from modules.collector import fetch_channel_videos_from_url
6
+ from modules.db import get_indexed_channels
7
+ from modules.indexer import index_videos
8
+ from modules.answerer import answer_query, LLMAnswer, VideoItem, build_video_html
9
+ from dotenv import load_dotenv
10
+
11
+ load_dotenv()
12
+
13
+ # -------------------------------
14
+ # Setup Chroma
15
+ # -------------------------------
16
+ client = chromadb.PersistentClient(path="./youtube_db")
17
+ collection = client.get_or_create_collection("yt_metadata", embedding_function=None)
18
+
19
+
20
+ # -------------------------------
21
+ # Utils
22
+ # -------------------------------
23
+ def refresh_channel(api_key, channel_url: str):
24
+ """Fetch + re-index a single channel."""
25
+ videos = fetch_channel_videos_from_url(api_key, channel_url)
26
+ for v in videos:
27
+ v["channel_url"] = channel_url
28
+ index_videos(videos, collection, channel_url=channel_url)
29
+ return len(videos)
30
+
31
+
32
+ def index_channels(channel_urls: str):
33
+ yt_api_key = os.environ["YOUTUBE_API_KEY"]
34
+ urls = [u.strip() for u in re.split(r"[\n,]+", channel_urls) if u.strip()]
35
+ total_videos = sum(refresh_channel(yt_api_key, url) for url in urls)
36
+ return (
37
+ f"✅ Indexed {total_videos} videos from {len(urls)} channels.",
38
+ list_channels(),
39
+ )
40
+
41
+
42
+ def list_channels():
43
+ channels = get_indexed_channels(collection)
44
+ if not channels:
45
+ return "No channels indexed yet."
46
+ md = []
47
+ for key, val in channels.items():
48
+ if isinstance(val, dict):
49
+ cname = val.get("channel_title", "Unknown")
50
+ curl = val.get("channel_url", None)
51
+ else:
52
+ cname = val
53
+ curl = key
54
+ if curl:
55
+ md.append(f"- **{cname}** ([link]({curl}))")
56
+ else:
57
+ md.append(f"- **{cname}**")
58
+ return "\n".join(md)
59
+
60
+
61
+ def refresh_all_channels():
62
+ yt_api_key = os.environ["YOUTUBE_API_KEY"]
63
+ channels = get_indexed_channels(collection)
64
+ if not channels:
65
+ return "⚠️ No channels available to refresh.", list_channels()
66
+ total_videos = 0
67
+ for key, val in channels.items():
68
+ url = val.get("channel_url") if isinstance(val, dict) else key
69
+ if url:
70
+ total_videos += refresh_channel(yt_api_key, url)
71
+ return (
72
+ f"🔄 Refreshed {len(channels)} channels, re-indexed {total_videos} videos.",
73
+ list_channels(),
74
+ )
75
+
76
+
77
+ def handle_query(query: str):
78
+ (answer_text, video_html) = answer_query(query, collection) # returns LLMAnswer
79
+ return answer_text, video_html
80
+
81
+
82
+ # -------------------------------
83
+ # Gradio UI
84
+ # -------------------------------
85
+ def show_component():
86
+ return gr.update(visible=True)
87
+ def hide_component():
88
+ return gr.update(visible=False)
89
+ def close_component():
90
+ return gr.update(open=False)
91
+ def open_component():
92
+ return gr.update(open=True)
93
+
94
+
95
+ with gr.Blocks() as demo:
96
+ gr.Markdown("## 📺 YouTube Metadata Q&A Agent")
97
+ from gradio_modal import Modal
98
+
99
+ with Modal(visible=False) as add_channel_modal:
100
+ channel_input = gr.Textbox(
101
+ label="Channel URLs",
102
+ placeholder="Paste one or more YouTube channel URLs (comma or newline separated)",
103
+ )
104
+ save_add_channels_btn = gr.Button("Add Channels")
105
+ index_status = gr.Markdown(label="Index Status", container=False)
106
+
107
+ with gr.Row():
108
+ with gr.Sidebar() as my_sidebar:
109
+ gr.Markdown("### 📺 Channels")
110
+ channel_list = gr.Markdown(list_channels())
111
+ with gr.Row():
112
+ refresh_all_btn = gr.Button(
113
+ "🔄 Refresh", size="sm", scale=0
114
+ )
115
+ add_channels_btn = gr.Button("+ Add", size="sm", scale=0)
116
+ refresh_status = gr.Markdown(label="Refresh Status", container=False)
117
+ refresh_all_btn.click(
118
+ fn=refresh_all_channels,
119
+ inputs=None,
120
+ outputs=[refresh_status, channel_list],
121
+ )
122
+ add_channels_btn.click(close_component, outputs=[my_sidebar]).then(show_component, outputs=[add_channel_modal])
123
+ save_add_channels_btn.click(
124
+ index_channels,
125
+ inputs=[channel_input],
126
+ outputs=[index_status, channel_list],
127
+ ).then(hide_component, outputs=[add_channel_modal]).then(open_component, outputs=[my_sidebar])
128
+
129
+ with gr.Column(scale=3):
130
+ question = gr.Textbox(
131
+ label="Ask a Question",
132
+ placeholder="e.g., What topics did they cover on AI ethics?",
133
+ )
134
+ gr.Examples(
135
+ [
136
+ "Show me some videos that mention Ranganatha.",
137
+ "Slokas that mention gajendra moksham",
138
+ ],
139
+ inputs=question,
140
+ )
141
+
142
+ answer = gr.Markdown()
143
+ video_embed = gr.HTML() # iframe embeds will render here
144
+
145
+ ask_btn = gr.Button("Get Answer")
146
+ ask_btn.click(handle_query, inputs=question, outputs=[answer, video_embed])
147
+
148
+ if __name__ == "__main__":
149
+ demo.launch()
main.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # modules/
2
+ # ├── collector.py
3
+ # ├── indexer.py
4
+ # ├── retriever.py
5
+ # ├── answerer.py
6
+ # └── main.py
7
+
8
+ import os
9
+ import chromadb
10
+ from dotenv import load_dotenv
11
+
12
+ from modules.answerer import answer_query
13
+ from modules.collector import fetch_channel_videos
14
+ from modules.db import get_collection
15
+ from modules.indexer import index_videos
16
+
17
+ # -------------------------------
18
+ # 5. Main
19
+ # -------------------------------
20
+ def main():
21
+ load_dotenv()
22
+ YT_API_KEY = os.getenv("YOUTUBE_API_KEY")
23
+ CHANNELS = ["UCqa48rNanVRKmG4qxl-YmEQ"] # Youtube channel IDs
24
+
25
+ collection = get_collection()
26
+
27
+ # Collect + Index
28
+ for ch in CHANNELS:
29
+ videos = fetch_channel_videos(YT_API_KEY, ch)
30
+ index_videos(videos, collection)
31
+
32
+ # Ask a question
33
+ query = "Show me some videos that mention about ranganatha."
34
+ print(answer_query(query, collection))
35
+
36
+
37
+ if __name__ == "__main__":
38
+ main()
modules/answerer.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -------------------------------
2
+ # 4. Answerer
3
+ # -------------------------------
4
+ from typing import List
5
+ from pydantic import BaseModel
6
+ from openai import OpenAI
7
+ from modules.retriever import retrieve_videos
8
+
9
+ # -------------------------------
10
+ # Structured Output Classes
11
+ # -------------------------------
12
+ class VideoItem(BaseModel):
13
+ video_id: str
14
+ title: str
15
+ channel: str
16
+ description: str
17
+
18
+ class LLMAnswer(BaseModel):
19
+ answer_text: str
20
+ top_videos: List[VideoItem]
21
+
22
+ # -------------------------------
23
+ # Main Function
24
+ # -------------------------------
25
+ def answer_query(query: str, collection, top_k: int = 5) -> LLMAnswer:
26
+ """
27
+ Answer a user query using YouTube video metadata.
28
+ Returns an LLMAnswer object with textual answer + list of videos.
29
+ """
30
+ results = retrieve_videos(query, collection, top_k=top_k)
31
+
32
+ if not results:
33
+ return LLMAnswer(answer_text="No relevant videos found.", top_videos=[])
34
+
35
+ # Build context lines for the LLM
36
+ context_lines = []
37
+ top_videos_list = []
38
+ for r in results:
39
+ # Ensure each result is a dict
40
+ if not isinstance(r, dict):
41
+ continue
42
+ vid_id = r.get("video_id", "")
43
+ title = r.get("video_title") or r.get("title", "")
44
+ channel = r.get("channel") or r.get("channel_title", "")
45
+ description = r.get("description", "")
46
+ context_lines.append(f"- {title} ({channel}) (https://youtube.com/watch?v={vid_id})\n description: {description}")
47
+
48
+ top_videos_list.append(
49
+ VideoItem(
50
+ video_id=vid_id,
51
+ title=title,
52
+ channel=channel,
53
+ description=description
54
+ )
55
+ )
56
+
57
+ context_text = "\n".join(context_lines)
58
+
59
+ # Call LLM with structured output
60
+ client = OpenAI()
61
+ response = client.chat.completions.parse(
62
+ model="gpt-4o-mini",
63
+ messages=[
64
+ {
65
+ "role": "system",
66
+ "content": (
67
+ "You are a helpful assistant that answers questions using YouTube video metadata. "
68
+ "Return your response strictly as the LLMAnswer class, including 'answer_text' and a list of 'top_videos'."
69
+ )
70
+ },
71
+ {
72
+ "role": "user",
73
+ "content": f"Question: {query}\n\nRelevant videos:\n{context_text}\n\nAnswer based only on this."
74
+ }
75
+ ],
76
+ response_format=LLMAnswer
77
+ )
78
+
79
+ llm_answer = response.choices[0].message.parsed # already LLMAnswer object
80
+ answer_text = llm_answer.answer_text
81
+ video_html = build_video_html(llm_answer.top_videos)
82
+ return answer_text, video_html
83
+
84
+
85
+ def build_video_html(videos: list[VideoItem]) -> str:
86
+ """Build a clean HTML table from top_videos."""
87
+ if not videos:
88
+ return "<p>No relevant videos found.</p>"
89
+
90
+ html = """
91
+ <table border="1" style="border-collapse: collapse; width: 100%;">
92
+ <tr>
93
+ <th>Title</th>
94
+ <th>Channel</th>
95
+ <th>Description</th>
96
+ <th>Watch</th>
97
+ </tr>
98
+ """
99
+ for v in videos:
100
+ html += f"""
101
+ <tr>
102
+ <td>{v.title}</td>
103
+ <td>{v.channel}</td>
104
+ <td>{v.description}</td>
105
+ <td><a href="https://youtube.com/watch?v={v.video_id}" target="_blank">▶️ Watch</a></td>
106
+ </tr>
107
+ """
108
+ html += "</table>"
109
+ return html
modules/collector.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -------------------------------
2
+ # 1. Collector
3
+ # -------------------------------
4
+ from typing import List,Dict
5
+ from googleapiclient.discovery import build
6
+
7
+ from modules.youtube_utils import get_channel_id
8
+
9
+
10
+ def fetch_channel_videos_from_url(api_key: str, channel_url: str, max_results=20):
11
+ youtube = build("youtube", "v3", developerKey=api_key)
12
+ channel_id = get_channel_id(youtube, channel_url)
13
+
14
+ # Get channel details to fetch its title
15
+ channel_response = youtube.channels().list(
16
+ part="snippet",
17
+ id=channel_id
18
+ ).execute()
19
+ channel_title = channel_response["items"][0]["snippet"]["title"]
20
+
21
+ request = youtube.search().list(
22
+ part="snippet",
23
+ channelId=channel_id,
24
+ maxResults=max_results,
25
+ order="date"
26
+ )
27
+ response = request.execute()
28
+
29
+ videos = []
30
+ for item in response.get("items", []):
31
+ if item["id"]["kind"] == "youtube#video":
32
+ videos.append({
33
+ "video_id": item["id"]["videoId"],
34
+ "title": item["snippet"]["title"],
35
+ "description": item["snippet"].get("description", ""),
36
+ "channel_id": channel_id,
37
+ "channel_title": channel_title,
38
+ })
39
+ return videos
40
+
41
+ def fetch_channel_videos(api_key: str, channel_id: str, max_results=20):
42
+ youtube = build("youtube", "v3", developerKey=api_key)
43
+
44
+ # Fetch channel title
45
+ channel_response = youtube.channels().list(
46
+ part="snippet",
47
+ id=channel_id
48
+ ).execute()
49
+ channel_title = channel_response["items"][0]["snippet"]["title"]
50
+
51
+ request = youtube.search().list(
52
+ part="snippet",
53
+ channelId=channel_id,
54
+ maxResults=max_results,
55
+ order="date"
56
+ )
57
+ response = request.execute()
58
+
59
+ videos = []
60
+ for item in response.get("items", []):
61
+ if item["id"]["kind"] == "youtube#video":
62
+ videos.append({
63
+ "video_id": item["id"]["videoId"],
64
+ "title": item["snippet"]["title"],
65
+ "description": item["snippet"].get("description", ""),
66
+ "channel_id": channel_id,
67
+ "channel_title": channel_title,
68
+ })
69
+ return videos
modules/db.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import chromadb
2
+
3
+ def get_collection():
4
+ client = chromadb.PersistentClient(path="./youtube_db")
5
+
6
+ # Ensure fresh collection with correct dimension
7
+ try:
8
+ collection = client.get_collection("yt_metadata")
9
+ except Exception:
10
+ collection = client.create_collection("yt_metadata")
11
+
12
+ # Check dimension mismatch
13
+ try:
14
+ # quick test query
15
+ collection.query(query_embeddings=[[0.0] * 1536], n_results=1)
16
+ except Exception:
17
+ # Delete and recreate with fresh schema
18
+ client.delete_collection("yt_metadata")
19
+ collection = client.create_collection("yt_metadata")
20
+
21
+ return collection
22
+
23
+
24
+ # modules/db.py
25
+ def get_indexed_channels(collection):
26
+ results = collection.get(include=["metadatas"])
27
+ channels = {}
28
+
29
+ for meta in results["metadatas"]:
30
+ cid = meta.get("channel_id") # ✅ safe
31
+ cname = meta.get("channel_title", "Unknown Channel")
32
+
33
+ if cid: # only include if we have a channel_id
34
+ channels[cid] = cname
35
+
36
+ return channels
modules/indexer.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # modules/indexer.py
2
+ from typing import Dict, List
3
+ from openai import OpenAI
4
+
5
+ def index_videos(videos: List[Dict], collection,channel_url : str):
6
+ client = OpenAI()
7
+
8
+ for vid in videos:
9
+ text = f"{vid.get('title', '')} - {vid.get('description', '')}"
10
+ embedding = client.embeddings.create(
11
+ input=text,
12
+ model="text-embedding-3-small"
13
+ ).data[0].embedding
14
+
15
+ # build metadata safely
16
+ metadata = {
17
+ "video_id": vid.get("video_id"),
18
+ "video_title": vid.get("title", ""),
19
+ "description" : vid.get('description', ''),
20
+ "channel_url" : channel_url,
21
+ }
22
+
23
+ # add channel info if available
24
+ if "channel_id" in vid:
25
+ metadata["channel_id"] = vid["channel_id"]
26
+ if "channel_title" in vid:
27
+ metadata["channel_title"] = vid["channel_title"]
28
+
29
+ collection.add(
30
+ documents=[text],
31
+ embeddings=[embedding],
32
+ metadatas=[metadata],
33
+ ids=[vid.get("video_id")]
34
+ )
modules/retriever.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # modules/retriever.py
2
+ from typing import List, Dict
3
+ from openai import OpenAI
4
+
5
+ def retrieve_videos(query: str, collection, top_k: int = 3) -> List[Dict]:
6
+ client = OpenAI()
7
+
8
+ # Create embedding for query
9
+ embedding = client.embeddings.create(
10
+ input=query,
11
+ model="text-embedding-3-small"
12
+ ).data[0].embedding
13
+
14
+ # Query Chroma
15
+ results = collection.query(
16
+ query_embeddings=[embedding],
17
+ n_results=top_k,
18
+ include=["metadatas", "documents", "distances"]
19
+ )
20
+
21
+ # Build list of standardized dicts
22
+ videos = []
23
+ metadatas_list = results.get("metadatas", [[]])[0] # list of metadata dicts
24
+ documents_list = results.get("documents", [[]])[0] # list of text
25
+ distances_list = results.get("distances", [[]])[0] # optional
26
+
27
+ for idx, meta in enumerate(metadatas_list):
28
+ videos.append({
29
+ "video_id": meta.get("video_id", ""),
30
+ "video_title": meta.get("video_title", meta.get("title", documents_list[idx])),
31
+ "channel": meta.get("channel", meta.get("channel_title", "")),
32
+ "description": documents_list[idx] if idx < len(documents_list) else "",
33
+ "score": distances_list[idx] if idx < len(distances_list) else None
34
+ })
35
+
36
+ return videos
modules/youtube_utils.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def get_channel_id(youtube, channel_url: str) -> str:
2
+ """
3
+ Extract channel ID from a YouTube URL or handle.
4
+ Supports:
5
+ - https://www.youtube.com/channel/UCxxxx
6
+ - https://www.youtube.com/@handle
7
+ - @handle
8
+ """
9
+ # If already a UC... ID
10
+ if "channel/" in channel_url:
11
+ return channel_url.split("channel/")[-1].split("/")[0]
12
+
13
+ # If it's a handle (@xyz or full URL)
14
+ if "@" in channel_url:
15
+ handle = channel_url.split("@")[-1]
16
+ request = youtube.channels().list(
17
+ part="id",
18
+ forHandle=handle
19
+ )
20
+ response = request.execute()
21
+ return response["items"][0]["id"]
22
+
23
+ if channel_url.startswith("UC"):
24
+ return channel_url
25
+
26
+ raise ValueError(f"Unsupported channel URL format {channel_url}")
pyproject.toml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "youtube-surfer-ai-agent"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.13"
7
+ dependencies = [
8
+ "chromadb>=1.0.20",
9
+ "dotenv>=0.9.9",
10
+ "google-api-python-client>=2.179.0",
11
+ "gradio>=5.44.0",
12
+ "gradio-modal>=0.0.4",
13
+ "openai>=1.102.0",
14
+ ]
requirements.txt ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This file was autogenerated by uv via the following command:
2
+ # uv pip compile pyproject.toml -o requirements.txt
3
+ aiofiles==24.1.0
4
+ # via gradio
5
+ annotated-types==0.7.0
6
+ # via pydantic
7
+ anyio==4.10.0
8
+ # via
9
+ # gradio
10
+ # httpx
11
+ # openai
12
+ # starlette
13
+ # watchfiles
14
+ attrs==25.3.0
15
+ # via
16
+ # jsonschema
17
+ # referencing
18
+ audioop-lts==0.2.2
19
+ # via gradio
20
+ backoff==2.2.1
21
+ # via posthog
22
+ bcrypt==4.3.0
23
+ # via chromadb
24
+ brotli==1.1.0
25
+ # via gradio
26
+ build==1.3.0
27
+ # via chromadb
28
+ cachetools==5.5.2
29
+ # via google-auth
30
+ certifi==2025.8.3
31
+ # via
32
+ # httpcore
33
+ # httpx
34
+ # kubernetes
35
+ # requests
36
+ charset-normalizer==3.4.3
37
+ # via requests
38
+ chromadb==1.0.20
39
+ # via youtube-surfer-ai-agent (pyproject.toml)
40
+ click==8.2.1
41
+ # via
42
+ # typer
43
+ # uvicorn
44
+ colorama==0.4.6
45
+ # via
46
+ # build
47
+ # click
48
+ # tqdm
49
+ # uvicorn
50
+ coloredlogs==15.0.1
51
+ # via onnxruntime
52
+ distro==1.9.0
53
+ # via
54
+ # openai
55
+ # posthog
56
+ dotenv==0.9.9
57
+ # via youtube-surfer-ai-agent (pyproject.toml)
58
+ durationpy==0.10
59
+ # via kubernetes
60
+ fastapi==0.116.1
61
+ # via gradio
62
+ ffmpy==0.6.1
63
+ # via gradio
64
+ filelock==3.19.1
65
+ # via huggingface-hub
66
+ flatbuffers==25.2.10
67
+ # via onnxruntime
68
+ fsspec==2025.7.0
69
+ # via
70
+ # gradio-client
71
+ # huggingface-hub
72
+ google-api-core==2.25.1
73
+ # via google-api-python-client
74
+ google-api-python-client==2.179.0
75
+ # via youtube-surfer-ai-agent (pyproject.toml)
76
+ google-auth==2.40.3
77
+ # via
78
+ # google-api-core
79
+ # google-api-python-client
80
+ # google-auth-httplib2
81
+ # kubernetes
82
+ google-auth-httplib2==0.2.0
83
+ # via google-api-python-client
84
+ googleapis-common-protos==1.70.0
85
+ # via
86
+ # google-api-core
87
+ # opentelemetry-exporter-otlp-proto-grpc
88
+ gradio==5.44.0
89
+ # via
90
+ # youtube-surfer-ai-agent (pyproject.toml)
91
+ # gradio-modal
92
+ gradio-client==1.12.1
93
+ # via gradio
94
+ gradio-modal==0.0.4
95
+ # via youtube-surfer-ai-agent (pyproject.toml)
96
+ groovy==0.1.2
97
+ # via gradio
98
+ grpcio==1.74.0
99
+ # via
100
+ # chromadb
101
+ # opentelemetry-exporter-otlp-proto-grpc
102
+ h11==0.16.0
103
+ # via
104
+ # httpcore
105
+ # uvicorn
106
+ httpcore==1.0.9
107
+ # via httpx
108
+ httplib2==0.22.0
109
+ # via
110
+ # google-api-python-client
111
+ # google-auth-httplib2
112
+ httptools==0.6.4
113
+ # via uvicorn
114
+ httpx==0.28.1
115
+ # via
116
+ # chromadb
117
+ # gradio
118
+ # gradio-client
119
+ # openai
120
+ # safehttpx
121
+ huggingface-hub==0.34.4
122
+ # via
123
+ # gradio
124
+ # gradio-client
125
+ # tokenizers
126
+ humanfriendly==10.0
127
+ # via coloredlogs
128
+ idna==3.10
129
+ # via
130
+ # anyio
131
+ # httpx
132
+ # requests
133
+ importlib-metadata==8.7.0
134
+ # via opentelemetry-api
135
+ importlib-resources==6.5.2
136
+ # via chromadb
137
+ jinja2==3.1.6
138
+ # via gradio
139
+ jiter==0.10.0
140
+ # via openai
141
+ jsonschema==4.25.1
142
+ # via chromadb
143
+ jsonschema-specifications==2025.4.1
144
+ # via jsonschema
145
+ kubernetes==33.1.0
146
+ # via chromadb
147
+ markdown-it-py==4.0.0
148
+ # via rich
149
+ markupsafe==3.0.2
150
+ # via
151
+ # gradio
152
+ # jinja2
153
+ mdurl==0.1.2
154
+ # via markdown-it-py
155
+ mmh3==5.2.0
156
+ # via chromadb
157
+ mpmath==1.3.0
158
+ # via sympy
159
+ numpy==2.3.2
160
+ # via
161
+ # chromadb
162
+ # gradio
163
+ # onnxruntime
164
+ # pandas
165
+ oauthlib==3.3.1
166
+ # via
167
+ # kubernetes
168
+ # requests-oauthlib
169
+ onnxruntime==1.22.1
170
+ # via chromadb
171
+ openai==1.102.0
172
+ # via youtube-surfer-ai-agent (pyproject.toml)
173
+ opentelemetry-api==1.36.0
174
+ # via
175
+ # chromadb
176
+ # opentelemetry-exporter-otlp-proto-grpc
177
+ # opentelemetry-sdk
178
+ # opentelemetry-semantic-conventions
179
+ opentelemetry-exporter-otlp-proto-common==1.36.0
180
+ # via opentelemetry-exporter-otlp-proto-grpc
181
+ opentelemetry-exporter-otlp-proto-grpc==1.36.0
182
+ # via chromadb
183
+ opentelemetry-proto==1.36.0
184
+ # via
185
+ # opentelemetry-exporter-otlp-proto-common
186
+ # opentelemetry-exporter-otlp-proto-grpc
187
+ opentelemetry-sdk==1.36.0
188
+ # via
189
+ # chromadb
190
+ # opentelemetry-exporter-otlp-proto-grpc
191
+ opentelemetry-semantic-conventions==0.57b0
192
+ # via opentelemetry-sdk
193
+ orjson==3.11.3
194
+ # via
195
+ # chromadb
196
+ # gradio
197
+ overrides==7.7.0
198
+ # via chromadb
199
+ packaging==25.0
200
+ # via
201
+ # build
202
+ # gradio
203
+ # gradio-client
204
+ # huggingface-hub
205
+ # onnxruntime
206
+ pandas==2.3.2
207
+ # via gradio
208
+ pillow==11.3.0
209
+ # via gradio
210
+ posthog==5.4.0
211
+ # via chromadb
212
+ proto-plus==1.26.1
213
+ # via google-api-core
214
+ protobuf==6.32.0
215
+ # via
216
+ # google-api-core
217
+ # googleapis-common-protos
218
+ # onnxruntime
219
+ # opentelemetry-proto
220
+ # proto-plus
221
+ pyasn1==0.6.1
222
+ # via
223
+ # pyasn1-modules
224
+ # rsa
225
+ pyasn1-modules==0.4.2
226
+ # via google-auth
227
+ pybase64==1.4.2
228
+ # via chromadb
229
+ pydantic==2.11.7
230
+ # via
231
+ # chromadb
232
+ # fastapi
233
+ # gradio
234
+ # openai
235
+ pydantic-core==2.33.2
236
+ # via pydantic
237
+ pydub==0.25.1
238
+ # via gradio
239
+ pygments==2.19.2
240
+ # via rich
241
+ pyparsing==3.2.3
242
+ # via httplib2
243
+ pypika==0.48.9
244
+ # via chromadb
245
+ pyproject-hooks==1.2.0
246
+ # via build
247
+ pyreadline3==3.5.4
248
+ # via humanfriendly
249
+ python-dateutil==2.9.0.post0
250
+ # via
251
+ # kubernetes
252
+ # pandas
253
+ # posthog
254
+ python-dotenv==1.1.1
255
+ # via
256
+ # dotenv
257
+ # uvicorn
258
+ python-multipart==0.0.20
259
+ # via gradio
260
+ pytz==2025.2
261
+ # via pandas
262
+ pyyaml==6.0.2
263
+ # via
264
+ # chromadb
265
+ # gradio
266
+ # huggingface-hub
267
+ # kubernetes
268
+ # uvicorn
269
+ referencing==0.36.2
270
+ # via
271
+ # jsonschema
272
+ # jsonschema-specifications
273
+ requests==2.32.5
274
+ # via
275
+ # google-api-core
276
+ # huggingface-hub
277
+ # kubernetes
278
+ # posthog
279
+ # requests-oauthlib
280
+ requests-oauthlib==2.0.0
281
+ # via kubernetes
282
+ rich==14.1.0
283
+ # via
284
+ # chromadb
285
+ # typer
286
+ rpds-py==0.27.0
287
+ # via
288
+ # jsonschema
289
+ # referencing
290
+ rsa==4.9.1
291
+ # via google-auth
292
+ ruff==0.12.10
293
+ # via gradio
294
+ safehttpx==0.1.6
295
+ # via gradio
296
+ semantic-version==2.10.0
297
+ # via gradio
298
+ shellingham==1.5.4
299
+ # via typer
300
+ six==1.17.0
301
+ # via
302
+ # kubernetes
303
+ # posthog
304
+ # python-dateutil
305
+ sniffio==1.3.1
306
+ # via
307
+ # anyio
308
+ # openai
309
+ starlette==0.47.3
310
+ # via
311
+ # fastapi
312
+ # gradio
313
+ sympy==1.14.0
314
+ # via onnxruntime
315
+ tenacity==9.1.2
316
+ # via chromadb
317
+ tokenizers==0.21.4
318
+ # via chromadb
319
+ tomlkit==0.13.3
320
+ # via gradio
321
+ tqdm==4.67.1
322
+ # via
323
+ # chromadb
324
+ # huggingface-hub
325
+ # openai
326
+ typer==0.16.1
327
+ # via
328
+ # chromadb
329
+ # gradio
330
+ typing-extensions==4.15.0
331
+ # via
332
+ # chromadb
333
+ # fastapi
334
+ # gradio
335
+ # gradio-client
336
+ # huggingface-hub
337
+ # openai
338
+ # opentelemetry-api
339
+ # opentelemetry-exporter-otlp-proto-grpc
340
+ # opentelemetry-sdk
341
+ # opentelemetry-semantic-conventions
342
+ # pydantic
343
+ # pydantic-core
344
+ # typer
345
+ # typing-inspection
346
+ typing-inspection==0.4.1
347
+ # via pydantic
348
+ tzdata==2025.2
349
+ # via pandas
350
+ uritemplate==4.2.0
351
+ # via google-api-python-client
352
+ urllib3==2.5.0
353
+ # via
354
+ # kubernetes
355
+ # requests
356
+ uvicorn==0.35.0
357
+ # via
358
+ # chromadb
359
+ # gradio
360
+ watchfiles==1.1.0
361
+ # via uvicorn
362
+ websocket-client==1.8.0
363
+ # via kubernetes
364
+ websockets==15.0.1
365
+ # via
366
+ # gradio-client
367
+ # uvicorn
368
+ zipp==3.23.0
369
+ # via importlib-metadata
tests/search.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from chromadb import PersistentClient
2
+
3
+ from modules.db import get_collection
4
+ from modules.retriever import retrieve_videos
5
+ from dotenv import load_dotenv
6
+ load_dotenv()
7
+
8
+ collection = get_collection()
9
+
10
+ all_metas = collection.get(include=["metadatas"])["metadatas"]
11
+ print("Sample metadatas:", all_metas[:5])
12
+
13
+ print("-------")
14
+ retrieve_videos("Show me some videos that mention Ranganatha.", collection)
uv.lock ADDED
The diff for this file is too large to render. See raw diff