longobardomartin commited on
Commit
1d9f996
verified
1 Parent(s): db31bb9

final commit

Browse files
Files changed (7) hide show
  1. README.md +57 -12
  2. agent.py +80 -0
  3. app.py +39 -0
  4. knowledgebase.py +86 -0
  5. requirements.txt +15 -0
  6. urls.txt +9 -0
  7. utils.py +71 -0
README.md CHANGED
@@ -1,12 +1,57 @@
1
- ---
2
- title: Final Project Marbellia
3
- emoji: 馃憖
4
- colorFrom: gray
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.13.2
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)
2
+
3
+ # Final Project | Earnings Q&A Chatbot
4
+
5
+ ### Project Overview
6
+
7
+ This project builds a YouTube-based Q&A chatbot using Gradio, LangChain, Pinecone, and OpenAI. The chatbot provides answers about turism in Marbella city leveraging both internal document vectors and web search for responses. Users can update the knowledgebase by adding new url youtube videos to the urls.txt file.
8
+
9
+ Please visit the followinw link to make use of this agent: https://huggingface.co/spaces/longobardomartin/proyectofinal
10
+
11
+ ### Table of Contents
12
+
13
+ - Folder Structure
14
+ - Environment Setup
15
+ - Project Architecture
16
+ - Usage
17
+
18
+
19
+ ### Folder Structure
20
+
21
+ - app.py - Main script to run the chatbot.
22
+ - knowledgebase.py - Script to get transcripts from YouTube videos and storing the data in Pinecone.
23
+ - agent.py - Script to specify the agent behaivour.
24
+ - utils.py - Script for auxiliary functions.
25
+ - urls.txt - Text file containing YouTube video links used in the project.
26
+ - requirements.txt - Text file containing libraries and dependencies to be installed locally.
27
+ - README.md - Project documentation (this file).
28
+ - Marbella turism.pdf - Presentation slides for the Final Project.
29
+
30
+ ### Environment Setup
31
+
32
+ 1. Add your environment variables by setting up a .env file or using prompts in the script:
33
+ - OPENAI_API_KEY: API key for OpenAI.
34
+ - LANGCHAIN_API_KEY: API key for LangChain.
35
+ - PINECONE_API_KEY: API key for Pinecone.
36
+ - SERPAPI_API_KEY: API key for SerpAPI.
37
+
38
+ ### Project Architecture
39
+
40
+ The chatbot uses the following architecture:
41
+
42
+ ## Solution Architecture
43
+
44
+ ![Solution Architecture](solution_architecture.png)
45
+
46
+ - Data Retrieval: Combines a vector database (Pinecone) for structured data retrieval and SerpAPI for web search.
47
+ - Routing: Uses LangChain's ReAct agent to dynamically route user questions to the appropriate source (either vector store or web search) using Tools.
48
+ - Memory: ConversationBufferMemory maintains chat history for contextual, multi-turn conversations.
49
+ - LLM Integration: GPT-4 processes user queries, generates responses, and summarizes search results, aided by ConversationBufferMemory and Chathistory function.
50
+
51
+ ### Usage
52
+
53
+ - use requirements.txt to install the necessary packages
54
+ - Add video links to urls.txt (one link per line).
55
+ - Run the script to generate transcriptions, create embeddings, and set up the chatbot interface.
56
+ - Interact with the chatbot by typing questions relevant to the video content.
57
+ - the script also generates a Gradio file locally.
agent.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.agents import Tool
2
+ from langchain.agents import initialize_agent
3
+ from langchain_openai import ChatOpenAI
4
+ from langchain.chains.conversation.memory import ConversationBufferWindowMemory
5
+ from langchain.chains import LLMChain
6
+ from langchain.prompts import PromptTemplate
7
+ from utils import get_question_context, google_search_result
8
+ import os
9
+ from dotenv import load_dotenv, find_dotenv
10
+ _ = load_dotenv(find_dotenv())
11
+ OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
12
+
13
+ # Definimos el template para la consulta de turismo
14
+ turism_template = """You are a very experienced turist guide specialised in recommending activities \
15
+ and things to do in Marbella, a city located in Andalusia, Spain. \
16
+ You have an excellent knowledge of and understanding of restaurants, sports, activities, experiences and places to visit in the city \
17
+ specifically targeted to families, couples, friends and solo travelers. \
18
+ You have the ability to think, reflect, debate, discuss and evaluate the data stored in a knowledge base from youtube videos related to \
19
+ turism in Marbella, and the ability to make use of it to support your explanations to the future turists that will visit the city and ask for your advice. \
20
+ Remenber: You answer must be so accurate and based on your knowledbase. \
21
+ Here is a question from a user: \
22
+ {input}"""
23
+
24
+ default_template = """You are a bot specialised in giving answers to questions about a wide range of topics. \
25
+ You are provided with the user answer and context from the first non-sponsored URL from a Google search. \
26
+ If you don't know the answer simply say I don't know but if you do please answer the question precisely.\
27
+ Here is a question from a user and a bit of context from Google Search: \
28
+ {input}"""
29
+
30
+ def get_turism_answer(input):
31
+ input = get_question_context(query=input, top_k=3)
32
+ llm_prompt = PromptTemplate.from_template(turism_template)
33
+ chain = LLMChain(llm=llm, prompt=llm_prompt)
34
+ answer = chain.run(input)
35
+ return answer
36
+
37
+ def get_internet_answer(input):
38
+ context = google_search_result(input)
39
+ input = f"Pregunta del usuario: {input} \n Contexto para responder a la pregunta del usuario: {context}"
40
+ llm_prompt = PromptTemplate.from_template(default_template)
41
+ chain = LLMChain(llm=llm, prompt=llm_prompt)
42
+ answer = chain.run(input)
43
+ return answer
44
+
45
+ tools = [
46
+ Tool(
47
+ name='Turism knowledgebase tool',
48
+ func=get_turism_answer,
49
+ description=('Use this tool when answering questions about turism in Marbella.')
50
+ ),
51
+ Tool(
52
+ name='Default knowledgebase tool',
53
+ func=get_internet_answer,
54
+ description=(
55
+ 'use this tool when the input question is not related to turism in Marbella.'
56
+ )
57
+ )
58
+ ]
59
+
60
+ llm = ChatOpenAI(model='gpt-4',temperature=0)
61
+
62
+ # conversational memory
63
+ conversational_memory = ConversationBufferWindowMemory(
64
+ memory_key='chat_history',
65
+ k=5,
66
+ return_messages=True
67
+ )
68
+
69
+ agent = initialize_agent(
70
+ agent='chat-conversational-react-description',
71
+ tools=tools,
72
+ llm=llm,
73
+ verbose=True,
74
+ max_iterations=3,
75
+ early_stopping_method='generate',
76
+ memory=conversational_memory
77
+ )
78
+
79
+ def call_agent(input):
80
+ return agent(input)['output']
app.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from agent import call_agent
3
+ import os
4
+ from dotenv import load_dotenv, find_dotenv
5
+ _ = load_dotenv(find_dotenv())
6
+ OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
7
+
8
+
9
+ # Funci贸n del bot que procesa el mensaje del usuario
10
+ def chatbot(message, history=[]):
11
+ # Agregar el mensaje del usuario al historial
12
+ history.append(("Usuario:", message))
13
+ # Consultar al agente de OpenAI
14
+ response = call_agent(message)
15
+ # Generar una respuesta simple del bot
16
+ response = f"Bot:'{response}'"
17
+ history.append((response,))
18
+ # Formatear el historial como un bloque de texto
19
+ chat_history = "\n".join([f"{msg[0]} {msg[1]}" if len(msg) > 1 else msg[0] for msg in history])
20
+ return chat_history, history
21
+
22
+ # Interfaz de Gradio
23
+ with gr.Blocks() as demo:
24
+ gr.Markdown("## Chatbot sencillo con Gradio")
25
+
26
+ # Caja para mostrar el historial de mensajes
27
+ chatbox = gr.Textbox(lines=10, label="Historial de mensajes", interactive=False)
28
+
29
+ # Caja para escribir mensajes
30
+ input_box = gr.Textbox(lines=1, placeholder="Escribe tu mensaje aqu铆", label="Mensaje")
31
+
32
+ # Almacenamiento interno para el historial de chat
33
+ state = gr.State([])
34
+
35
+ # L贸gica al presionar Enter en la caja de texto
36
+ input_box.submit(chatbot, inputs=[input_box, state], outputs=[chatbox, state])
37
+
38
+ # Ejecutar la aplicaci贸n
39
+ demo.launch()
knowledgebase.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #yt-dlp --write-subs --skip-download [youtube_url]
2
+ from pinecone import Pinecone
3
+ from pinecone import ServerlessSpec
4
+ from youtube_transcript_api import YouTubeTranscriptApi
5
+ import os
6
+ from dotenv import load_dotenv, find_dotenv
7
+ import torch
8
+ from sentence_transformers import SentenceTransformer
9
+ from tqdm import tqdm
10
+
11
+ _ = load_dotenv(find_dotenv())
12
+ PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
13
+
14
+ # Get youtube ids
15
+ def get_youtube_ids(route):
16
+ yt_ids = []
17
+ with open(route, 'r') as file:
18
+ for line in file:
19
+ yt_ids.append(line.split('=')[1].strip())
20
+ return yt_ids
21
+
22
+ # Get transcriptions clean
23
+ def get_clean_transcriptions(yt_ids):
24
+ trans_bruto = YouTubeTranscriptApi.get_transcripts(yt_ids, languages=['es','en'])
25
+ return {k:" ".join([d['text'] for d in v if len(v)!=0]) for k, v in trans_bruto[0].items()}
26
+
27
+ # Create index
28
+ def create_index():
29
+ pc = Pinecone(api_key=PINECONE_API_KEY)
30
+ cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
31
+ region = os.environ.get('PINECONE_REGION') or 'us-east-1'
32
+ spec = ServerlessSpec(cloud=cloud, region=region)
33
+ index_name = "youtube-videos"
34
+ if index_name not in pc.list_indexes().names():
35
+ # create the index if it does not exist
36
+ pc.create_index(index_name, dimension=768, metric="cosine", spec=spec)
37
+ # connect to index we created
38
+ index = pc.Index(index_name)
39
+ return pc, index
40
+
41
+ # Load retriever model
42
+ def load_retriever():
43
+ # set device to GPU if available
44
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
45
+ # load the retriever model from huggingface model hub
46
+ retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base', device=device) #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
47
+ return retriever
48
+
49
+ # Create embeddings and upsert them into the index
50
+ def create_embeddings(dicc, index, retriever):
51
+ # Passage id
52
+ p_id = 0
53
+ # Itearte over transcriptions
54
+ for yt_id, transcription in dicc.items():
55
+ # Split the transcription into passages
56
+ passages = [transcription[i:i+1000] for i in range(0, len(transcription), 1000)]
57
+ # For each passage, create an embedding and upsert it into the index
58
+ for passage in tqdm(passages):
59
+ emb = retriever.encode(passage, convert_to_tensor=True)
60
+ meta = {'yt_id': yt_id, 'passage_text': passage}
61
+ to_upsert = [(str(p_id), emb.tolist(), meta)]
62
+ _ = index.upsert(vectors=to_upsert)
63
+ p_id += 1
64
+ # upsert/insert these records to pinecone
65
+ _ = index.upsert(vectors=to_upsert)
66
+ # check that we have all vectors in index
67
+ print(index.describe_index_stats())
68
+
69
+ """
70
+ # Obtenemos las ids de los v铆deos
71
+ ls_ids = get_youtube_ids('./urls.txt')
72
+
73
+ # Obtenemos las transcripciones de los v铆deos
74
+ d_trans = get_clean_transcriptions(ls_ids)
75
+
76
+ # Creo el index
77
+ pc, index = create_index()
78
+
79
+ # Load retriever model
80
+ retriever = load_retriever()
81
+
82
+ # Poblate the database
83
+ create_embeddings(d_trans, index, retriever)
84
+ """
85
+
86
+
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain
2
+ serpapi
3
+ langchain_openai
4
+ pinecone
5
+ youtube_transcript_api
6
+ torch
7
+ python-dotenv
8
+ tqdm
9
+ bs4
10
+ regex
11
+ sentence-transformers
12
+ transformers
13
+ requests
14
+ gradio
15
+
urls.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ https://www.youtube.com/watch?v=7nDyUry3esM
2
+ https://www.youtube.com/watch?v=sH9iFSeef-g
3
+ https://www.youtube.com/watch?v=bCy5zSWSKL8
4
+ https://www.youtube.com/watch?v=3CPzO9bHEOM
5
+ https://www.youtube.com/watch?v=spAraLH3N-4
6
+ https://www.youtube.com/watch?v=20UPUvLHKUY
7
+ https://www.youtube.com/watch?v=nDC2PqM4YpY
8
+ https://www.youtube.com/watch?v=QaiOb9I-ogA
9
+ https://www.youtube.com/watch?v=HJd0LnkR63o
utils.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from knowledgebase import create_index, load_retriever
2
+ from bs4 import BeautifulSoup
3
+ import requests
4
+ import serpapi
5
+ import os
6
+ import re
7
+ from transformers import BartTokenizer
8
+ from dotenv import load_dotenv, find_dotenv
9
+ load_dotenv(find_dotenv())
10
+ SERPAPI_API_KEY = os.getenv('SERPAPI_API_KEY')
11
+ HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')
12
+
13
+ def query_pinecone(query, top_k, index, retriever):
14
+ # generate embeddings for the query
15
+ xq = retriever.encode([query], convert_to_tensor=True).tolist()[0]
16
+ # search pinecone index for context passage with the answer
17
+ xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
18
+ return xc
19
+
20
+ def format_query(query, context):
21
+ # extract passage_text from Pinecone search result and add the <P> tag
22
+ context = " ".join([f"<P> {m['metadata']['passage_text']}" for m in context['matches']])
23
+ # contcatinate the query and context passages
24
+ query = f"Pregunta del usuario: {query} \n Contexto para responder a la pregunta del usuario: {context}"
25
+ return query
26
+
27
+ def get_question_context(query, top_k):
28
+ # Creo el index
29
+ _, index = create_index()
30
+ # Load retriever model
31
+ retriever = load_retriever()
32
+ # search pinecone index for context passage with the answer
33
+ context = query_pinecone(query, top_k, index, retriever)
34
+ # format query with context passages
35
+ query = format_query(query, context)
36
+ return query
37
+
38
+ # Funci贸n que realiza la b煤squeda en Google y extrae el contenido relevante de la primera URL no patrocinada
39
+ def google_search_result(query):
40
+ # Make a Google search
41
+ s = serpapi.search(q=query, engine="google", location="Madrid, Spain", hl="es", gl="es", api_key=SERPAPI_API_KEY)
42
+ # Get the first non-ad URL
43
+ url = s["organic_results"][0]["link"]
44
+
45
+ # Extraer el contenido de la p谩gina
46
+ response = requests.get(url)
47
+ soup = BeautifulSoup(response.text, 'html.parser')
48
+
49
+ # Extraer el texto relevante de la p谩gina
50
+ page_content = soup.get_text()
51
+
52
+ page_content = re.sub(r'\n+', ' ', page_content)
53
+ page_content = re.sub(r'\s+', ' ', page_content)
54
+
55
+ # Cargar el tokenizador para BART
56
+ tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
57
+
58
+ # Tokenizar el contenido para contar los tokens
59
+ tokens = tokenizer.encode(page_content, truncation=True, max_length=1000)
60
+
61
+ # Decodificar los tokens de nuevo en texto truncado si es necesario
62
+ truncated_content = tokenizer.decode(tokens, skip_special_tokens=True)
63
+
64
+ # Resume el contenido de la p谩gina
65
+ API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
66
+ # Set the API headers
67
+ headers = {"Authorization":"Bearer "+HUGGINGFACEHUB_API_TOKEN}
68
+ # Make a request to the API
69
+ response = requests.post(API_URL, headers=headers, json={"inputs":truncated_content})
70
+ # Get the summary text from the response
71
+ return response.json()[0]['summary_text'] if len(response.json())>0 else "No se ha podido obtener un resumen de la p谩gina"