{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "# Use Watsonx to respond to natural language questions using RAG approach for Doctor AI" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "\n", "#### About Retrieval Augmented Generation\n", "Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.\n", "\n", "In its simplest form, RAG requires 3 steps:\n", "\n", "- Index knowledge base passages (once)\n", "- Retrieve relevant passage(s) from the knowledge base (for every user query)\n", "- Generate a response by feeding retrieved passage into a large language model (for every user query)\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "## Set up the environment" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Install and import dependecies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "#!pip install chromadb==0.3.27\n", "#!pip install sentence_transformers \n", "#!pip install pandas \n", "#!pip install rouge_score \n", "#!pip install nltk\n", "#!pip install \"ibm-watson-machine-learning>=1.0.312\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** Please restart the notebook kernel to pick up proper version of packages installed above." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import os, getpass\n", "import pandas as pd\n", "from typing import Optional, Dict, Any, Iterable, List\n", "\n", "try:\n", " from sentence_transformers import SentenceTransformer\n", "except ImportError:\n", " raise ImportError(\"Could not import sentence_transformers: Please install sentence-transformers package.\")\n", " \n", "try:\n", " import chromadb\n", " from chromadb.api.types import EmbeddingFunction\n", "except ImportError:\n", " raise ImportError(\"Could not import chromdb: Please install chromadb package.\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Watsonx API connection\n", "This cell defines the credentials required to work with watsonx API for Foundation\n", "Model inferencing.\n", "\n", "**Action:** Provide the IBM Cloud user API key. For details, see\n", "[documentation](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Python program to read\n", "# json file\n", "import json\n", "# Opening JSON file\n", "f = open('./credentials/api.json')\n", "# returns JSON object as\n", "# a dictionary\n", "data = json.load(f)\n", "# Ensure you have your API key set in your environment\n", "#in ./credentials/api.json\n", "IBM_CLOUD_API = data['IBM_CLOUD_API']\n", "PROJECT_ID = data['PROJECT_ID']\n", "# Closing file\n", "f.close()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "credentials = {\n", " \"url\": \"https://us-south.ml.cloud.ibm.com\",\n", " \"apikey\": IBM_CLOUD_API\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Defining the project id\n", "The API requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "try:\n", " project_id = os.environ[\"PROJECT_ID\"]\n", "except KeyError:\n", " project_id = PROJECT_ID" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "## Train data loading" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Load train and test datasets. At first, training dataset (`train_data`) should be used to work with the models to prepare and tune prompt. Then, test dataset (`test_data`) should be used to calculate the metrics score for selected model, defined prompts and parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import numpy as np\n", "import pandas as pd\n", "# load data\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "filename_data = \"../2-Data/dialogues_embededd.pkl\"\n", "data = pd.read_pickle(filename_data)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "#data = data.reset_index()\n", "#data.rename(columns = {'index':'ids'}, inplace = True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "train_data, test_data= train_test_split(data, test_size=0.05)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(950, 6)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_data.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50, 6)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data.shape" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Build up knowledge base\n", "\n", "The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.\n", "\n", "We can generate dense vector representations using embedding models. In this notebook, we use [SentenceTransformers](https://www.google.com/search?client=safari&rls=en&q=sentencetransformers&ie=UTF-8&oe=UTF-8) [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to embed both the knowledge base passages and user queries. `all-MiniLM-L6-v2` is a performant open-source model that is small enough to run locally.\n", "\n", "A vector database is optimized for dense vector indexing and retrieval. This notebook uses [Chroma](https://docs.trychroma.com), a user-friendly open-source vector database, licensed under Apache 2.0, which offers good speed and performance with all-MiniLM-L6-v2 embedding model." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The dataset we are using is already split into self-contained passages that can be ingested by Chroma. \n", "\n", "The size of each passage is limited by the embedding model's context window (which is 256 tokens for `all-MiniLM-L6-v2`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load knowledge base documents\n", "\n", "Load set of documents used further to build knowledge base. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "data_root = \"../2-Data/\"\n", "knowledge_base_dir = f\"{data_root}/knowledge_base\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'../2-Data//knowledge_base'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knowledge_base_dir" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#if not os.path.exists(knowledge_base_dir):\n", "# from zipfile import ZipFile\n", "# with ZipFile(knowledge_base_dir + \".zip\", 'r') as zObject:\n", "# zObject.extractall(data_root)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "#documents = pd.read_csv(f\"{knowledge_base_dir}/psgs.tsv\", sep='\\t', header=0)\n", "#documents['indextext'] = documents['title'].astype(str) + \"\\n\" + documents['text']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Question | \n", "Patient | \n", "Answer | \n", "combined | \n", "
---|---|---|---|---|
0 | \n", "Q. What does abutment of the nerve root mean? | \n", "Hi doctor,I am just wondering what is abutting... | \n", "Hi. I have gone through your query with dilige... | \n", "Question: Q. What does abutment of the nerve r... | \n", "
1 | \n", "Q. What should I do to reduce my weight gained... | \n", "Hi doctor, I am a 22-year-old female who was d... | \n", "Hi. You have really done well with the hypothy... | \n", "Question: Q. What should I do to reduce my wei... | \n", "
\n", " | ids | \n", "Question | \n", "Patient | \n", "Answer | \n", "combined | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "Q. What does abutment of the nerve root mean? | \n", "Hi doctor,I am just wondering what is abutting... | \n", "Hi. I have gone through your query with dilige... | \n", "Question: Q. What does abutment of the nerve r... | \n", "
1 | \n", "1 | \n", "Q. What should I do to reduce my weight gained... | \n", "Hi doctor, I am a 22-year-old female who was d... | \n", "Hi. You have really done well with the hypothy... | \n", "Question: Q. What should I do to reduce my wei... | \n", "
2 | \n", "2 | \n", "Q. I have started to get lots of acne on my fa... | \n", "Hi doctor! I used to have clear skin but since... | \n", "Hi there Acne has multifactorial etiology. Onl... | \n", "Question: Q. I have started to get lots of acn... | \n", "
3 | \n", "3 | \n", "Q. Why do I have uncomfortable feeling between... | \n", "Hello doctor,I am having an uncomfortable feel... | \n", "Hello. The popping and discomfort what you fel... | \n", "Question: Q. Why do I have uncomfortable feeli... | \n", "
4 | \n", "4 | \n", "Q. My symptoms after intercourse threatns me e... | \n", "Hello doctor,Before two years had sex with a c... | \n", "Hello. The HIV test uses a finger prick blood ... | \n", "Question: Q. My symptoms after intercourse thr... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
246533 | \n", "256911 | \n", "Why is hair fall increasing while using Bontre... | \n", "I am suffering from excessive hairfall. My doc... | \n", "Hello Dear Thanks for writing to us, we are he... | \n", "Question: Why is hair fall increasing while us... | \n", "
246534 | \n", "256912 | \n", "Why was I asked to discontinue Androanagen whi... | \n", "Hi Doctor, I have been having severe hair fall... | \n", "hello, hair4u is combination of minoxid... | \n", "Question: Why was I asked to discontinue Andro... | \n", "
246535 | \n", "256913 | \n", "Can Mintop 5% Lotion be used by women for seve... | \n", "Hi..i hav sever hair loss problem so consulted... | \n", "HI I have evaluated your query thoroughly you... | \n", "Question: Can Mintop 5% Lotion be used by wome... | \n", "
246536 | \n", "256914 | \n", "Is Minoxin 5% lotion advisable instead of Foli... | \n", "Hi, i am 25 year old girl, i am having massive... | \n", "Hello and Welcome to ‘Ask A Doctor’ service.I ... | \n", "Question: Is Minoxin 5% lotion advisable inste... | \n", "
246537 | \n", "256915 | \n", "Are Biotin supplements need to reduce severe h... | \n", "iam having hairfall for a decade.. but fews we... | \n", "you did'nt mention about thyroid problem ...us... | \n", "Question: Are Biotin supplements need to reduc... | \n", "
246538 rows × 5 columns
\n", "