
# Use Watsonx to respond to natural language questions using RAG approach for Doctor AI



#### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from the knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)


<a id="setup"></a>
##  Set up the environment

### Install and import dependecies

In [1]:
#!pip install chromadb==0.3.27
#!pip install sentence_transformers 
#!pip install pandas 
#!pip install rouge_score 
#!pip install nltk
#!pip install "ibm-watson-machine-learning>=1.0.312" 

**Note:** Please restart the notebook kernel to pick up proper version of packages installed above.

In [2]:
import os, getpass
import pandas as pd
from typing import Optional, Dict, Any, Iterable, List

try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    raise ImportError("Could not import sentence_transformers: Please install sentence-transformers package.")
    
try:
    import chromadb
    from chromadb.api.types import EmbeddingFunction
except ImportError:
    raise ImportError("Could not import chromdb: Please install chromadb package.")

### Watsonx API connection
This cell defines the credentials required to work with watsonx API for Foundation
Model inferencing.

**Action:** Provide the IBM Cloud user API key. For details, see
[documentation](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).

In [3]:
# Python program to read
# json file
import json
# Opening JSON file
f = open('./credentials/api.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Ensure you have your API key set in your environment
#in ./credentials/api.json
IBM_CLOUD_API = data['IBM_CLOUD_API']
PROJECT_ID = data['PROJECT_ID']
# Closing file
f.close()

In [4]:
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": IBM_CLOUD_API
}

### Defining the project id
The API requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.


In [5]:
try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = PROJECT_ID

<a id="data"></a>
## Train data loading

Load train and test datasets. At first, training dataset (`train_data`) should be used to work with the models to prepare and tune prompt. Then, test dataset (`test_data`) should be used to calculate the metrics score for selected model, defined prompts and parameters.

In [6]:
# imports
import numpy as np
import pandas as pd
# load data


In [7]:
filename_data = "../2-Data/dialogues_embededd.pkl"
data =  pd.read_pickle(filename_data)


In [8]:
#data = data.reset_index()
#data.rename(columns = {'index':'ids'}, inplace = True)

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_data, test_data= train_test_split(data, test_size=0.05)

In [11]:
train_data.shape

(950, 6)

In [12]:
test_data.shape

(50, 6)

## Build up knowledge base

The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

We can generate dense vector representations using embedding models. In this notebook, we use [SentenceTransformers](https://www.google.com/search?client=safari&rls=en&q=sentencetransformers&ie=UTF-8&oe=UTF-8) [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to embed both the knowledge base passages and user queries. `all-MiniLM-L6-v2` is a performant open-source model that is small enough to run locally.

A vector database is optimized for dense vector indexing and retrieval. This notebook uses [Chroma](https://docs.trychroma.com), a user-friendly open-source vector database, licensed under Apache 2.0, which offers good speed and performance with all-MiniLM-L6-v2 embedding model.

The dataset we are using is already split into self-contained passages that can be ingested by Chroma. 

The size of each passage is limited by the embedding model's context window (which is 256 tokens for `all-MiniLM-L6-v2`).

### Load knowledge base documents

Load set of documents used further to build knowledge base. 

In [13]:
data_root = "../2-Data/"
knowledge_base_dir = f"{data_root}/knowledge_base"

In [14]:
knowledge_base_dir

'../2-Data//knowledge_base'

In [15]:
#if not os.path.exists(knowledge_base_dir):
#    from zipfile import ZipFile
#    with ZipFile(knowledge_base_dir + ".zip", 'r') as zObject:
#        zObject.extractall(data_root)

In [16]:
#documents = pd.read_csv(f"{knowledge_base_dir}/psgs.tsv", sep='\t', header=0)
#documents['indextext'] = documents['title'].astype(str) + "\n" + documents['text']

In [17]:
# load & inspect dataset
df = pd.read_csv("../2-Data/dialogues.csv", sep = '\t')
df = df.dropna()#.head(1000)
df.rename(columns = {'Description':'Question',"Doctor":"Answer"}, inplace = True)
#df["case"] = (" Patient: " + df.Patient.str.strip()+ "\n" + "Question: " + df.Question.str.strip() +)
#df["combined"] = ("Question: " + df.Question.str.strip() + "\n" +" Patient: " + df.Patient.str.strip()+  "\n" +" Answer: " + df.Answer.str.strip())

df["combined"] = ("Question: " + df.Question.str.strip() + "\n" +" Answer: " + df.Answer.str.strip())

df.head(2)

Unnamed: 0,Question,Patient,Answer,combined
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Question: Q. What does abutment of the nerve r...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Question: Q. What should I do to reduce my wei...


In [18]:
df.shape

(256916, 4)

In [19]:
df =df.drop_duplicates()

In [20]:
df.shape

(246538, 4)

In [21]:
df = df.reset_index()
df.rename(columns = {'index':'ids'}, inplace = True)

In [22]:
documents=df

In [23]:
documents.shape

(246538, 5)

In [24]:
documents

Unnamed: 0,ids,Question,Patient,Answer,combined
0,0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Question: Q. What does abutment of the nerve r...
1,1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Question: Q. What should I do to reduce my wei...
2,2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...,Question: Q. I have started to get lots of acn...
3,3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...,Question: Q. Why do I have uncomfortable feeli...
4,4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...,Question: Q. My symptoms after intercourse thr...
...,...,...,...,...,...
246533,256911,Why is hair fall increasing while using Bontre...,I am suffering from excessive hairfall. My doc...,"Hello Dear Thanks for writing to us, we are he...",Question: Why is hair fall increasing while us...
246534,256912,Why was I asked to discontinue Androanagen whi...,"Hi Doctor, I have been having severe hair fall...","hello, hair4u is combination of minoxid...",Question: Why was I asked to discontinue Andro...
246535,256913,Can Mintop 5% Lotion be used by women for seve...,Hi..i hav sever hair loss problem so consulted...,HI I have evaluated your query thoroughly you...,Question: Can Mintop 5% Lotion be used by wome...
246536,256914,Is Minoxin 5% lotion advisable instead of Foli...,"Hi, i am 25 year old girl, i am having massive...",Hello and Welcome to ‘Ask A Doctor’ service.I ...,Question: Is Minoxin 5% lotion advisable inste...


In [30]:
documents=documents.head(2000)

In [31]:
documents.shape

(2000, 5)

### Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of chromadb may differ depending on the embedding model used.

In [32]:
class MiniLML6V2EmbeddingFunction(EmbeddingFunction):
    MODEL = SentenceTransformer('all-MiniLM-L6-v2')
    def __call__(self, texts):
        return MiniLML6V2EmbeddingFunction.MODEL.encode(texts).tolist()
emb_func = MiniLML6V2EmbeddingFunction()

### Set up Chroma upsert

Upserting a document means update the document even if it exists in the database. Otherwise re-inserting a document throws an error. This is useful for experimentation purpose.

In [33]:
class ChromaWithUpsert:
    def __init__(
            self,
            name: Optional[str] = "watsonx_rag_collection",
            persist_directory:Optional[str]=None,
            embedding_function: Optional[EmbeddingFunction]=None,
            collection_metadata: Optional[Dict] = None,
    ):
        self._client_settings = chromadb.config.Settings()
        if persist_directory is not None:
            self._client_settings = chromadb.config.Settings(
                chroma_db_impl="duckdb+parquet",
                persist_directory=persist_directory,
            )
        self._client = chromadb.Client(self._client_settings)
        self._embedding_function = embedding_function
        self._persist_directory = persist_directory
        self._name = name
        self._collection = self._client.get_or_create_collection(
            name=self._name,
            embedding_function=self._embedding_function
            if self._embedding_function is not None
            else None,
            metadata=collection_metadata,
        )

    def upsert_texts(
        self,
        texts: Iterable[str],
        metadata: Optional[List[dict]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            :param texts (Iterable[str]): Texts to add to the vectorstore.
            :param metadatas (Optional[List[dict]], optional): Optional list of metadatas.
            :param ids (Optional[List[str]], optional): Optional list of IDs.
            :param metadata: Optional[List[dict]] - optional metadata (such as title, etc.)
        Returns:
            List[str]: List of IDs of the added texts.
        """
        # TODO: Handle the case where the user doesn't provide ids on the Collection
        if ids is None:
            import uuid
            ids = [str(uuid.uuid1()) for _ in texts]
        embeddings = None
        self._collection.upsert(
            metadatas=metadata, documents=texts, ids=ids
        )
        return ids

    def is_empty(self):
        return self._collection.count()==0

    def persist(self):
        self._client.persist()

    def query(self, query_texts:str, n_results:int=5):
        """
        Returns the closests vector to the question vector
        :param query_texts: the question
        :param n_results: number of results to generate
        :return: the closest result to the given question
        """
        return self._collection.query(query_texts=query_texts, n_results=n_results)

In [55]:
%%time
chroma = ChromaWithUpsert(
    name=f"nq910_minilm6v2",
    embedding_function=emb_func,  # you can have something here using /embed endpoint
    persist_directory=knowledge_base_dir,
)
if chroma.is_empty():
    _ = chroma.upsert_texts(
        texts=documents.combined.tolist(),
        # we handle tokenization, embedding, and indexing automatically. 
        #You can skip that and add your own embeddings as well
        metadata=[{'Question': Question,
                   'Patient':Patient,
                   'ids': ids}
                  for (Question,Patient,ids) in
                  zip(documents.Question,documents.Patient, documents.ids)],  # filter on these!
        ids=[str(i) for i in documents.ids],  # unique for each doc
    )
    chroma.persist()

CPU times: total: 93.8 ms
Wall time: 93 ms


### Embed and index documents with Chroma

**Note: Could take several minutes if you don't have pre-built indices**

In [34]:
%%time
chroma = ChromaWithUpsert(
    name=f"nq910_minilm6v2",
    embedding_function=emb_func,  # you can have something here using /embed endpoint
    persist_directory=knowledge_base_dir,
)
if chroma.is_empty():
    _ = chroma.upsert_texts(
        texts=documents.combined.tolist(),
        # we handle tokenization, embedding, and indexing automatically. 
        #You can skip that and add your own embeddings as well
        metadata=[{'Question': Question, 
                   'ids': ids}
                  for (Question,ids) in
                  zip(documents.Question, documents.ids)],  # filter on these!
        ids=[str(i) for i in documents.ids],  # unique for each doc
    )
    chroma.persist()

CPU times: total: 20.1 s
Wall time: 16.3 s


<a id="models"></a>
## Foundation Models on Watsonx

You need to specify `model_id` that will be used for inferencing.

**Action**: Use `FLAN_UL2` model.

In [35]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

In [36]:
model_id = ModelTypes.FLAN_UL2

<a id="predict"></a>
## Generate a retrieval-augmented response to a question

### Select questions

Get questions from the previously loaded test dataset.

In [37]:
question_texts = [q.strip("?") + "?" for q in test_data['Question'].tolist()]
print("\n".join(question_texts))

Q. Every time I eat spicy food, I poop blood. Why?
Q. Will Kalarchikai cure multiple ovarian cysts in PCOD?
Q. Please enlighten me on non-invasive procedures to detect prostate cancer.?
Q. My sciatica is heavy after a minor herniated disc L4 or L5. Why?
Q. I feel as if the skin over my belly button is firm. Is it hernia?
Q. A white patch has been formed at the tip of the penis associated with skin tightness. Why?
Q. I masturbate only by rubbing the tip of the penis. Is it a wrong way?
Q. Every time I eat spicy food, I poop blood. Why?
Q. Please provide opinion on my complete blood count report.?
Q. My child got hurt while playing. Can we use T-Bact or Neosporin ointment?
Q. Please comment on the severity of my wife's wrist x-ray.?
Q. Why am I having extreme bloating, abdominal pain, and fatigue with scaly marks?
Q. I masturbate only by rubbing the tip of the penis. Is it a wrong way?
Q. What can be done for tender and itchy red spots on hands?
Q. Will Kalarchikai cure multiple ovarian 

### Retrieve relevant context

Fetch paragraphs similar to the question.

In [38]:
relevant_contexts = []

for question_text in question_texts:
    relevant_chunks = chroma.query(
        query_texts=[question_text],
        n_results=5,
    )
    relevant_contexts.append(relevant_chunks)

Get the set of chunks for one of the questions.

In [39]:
sample_chunks = relevant_contexts[0]
for i, chunk in enumerate(sample_chunks['documents'][0]):
    print("=========")
    print("Paragraph index : ", sample_chunks['ids'][0][i])
    print("Paragraph : ", chunk)
    print("Distance : ", sample_chunks['distances'][0][i])

Paragraph index :  10
Paragraph :  Question: Q. Every time I eat spicy food, I poop blood. Why?
 Answer: Hello. I have gone through your information and test reports (attachment removed to protect patient identity). So, in view of that, there are a couple of things that I can opine upon: Hope that helps. For more information consult a general surgeon online -->
Distance :  0.23510286211967468
Paragraph index :  2968
Paragraph :  Question: Q. Why is there burning sensation after passing stools?
 Answer: Hello. Intake of spicy food may cause burning sensation and irritation of anal mucosa which may lead to a burning pain during defecation. It is due to the spiciness of the food. You may try yogurt, cucumber, tender coconut water, probiotic capsules, buttermilk. Use tablet Nexium before breakfast for one week. Avoid spicy food intake. If symptoms do not improve, please consult a physician or post me a query.
Distance :  0.7934628129005432
Paragraph index :  397
Paragraph :  Question: Q. C

### Feed the context and the questions to `watsonx.ai` model.

Define instructions for the model.

**Note:** Please start with finding better prompts using small subset of training records (under `train_data` variable). Make sure to not run an inference of all of `train_data`, as it'll take a long time to get the results. To get a sample from `train_data`, you can use e.g.`train_data.head(n=10)` to get first 10 records, or `train_data.sample(n=10)` to get random 10 records. Only once you have identified the best performing prompt, update this notebook to use the prompt and compute the metrics on the test data.

**Action:** Please edit the below cell and add your own prompt here. In the below prompt, we have the instruction (first sentence) and one example included in the prompt. If you want to change the prompt or add your own examples or more examples, please change the below prompt accordingly.

In [41]:
def make_prompt(context, question_text):
    return (f"Please answer the following.\n"
          + f"{context}:\n\n"
          + f"{question_text}")

prompt_texts = []

for relevant_context, question_text in zip(relevant_contexts, question_texts):
    context = "\n\n\n".join(relevant_context["documents"][0])
    prompt_text = make_prompt(context, question_text)
    prompt_texts.append(prompt_text)

Inspect prompt for sample question.

In [42]:
print(prompt_texts[0])

Please answer the following.
Question: Q. Every time I eat spicy food, I poop blood. Why?
 Answer: Hello. I have gone through your information and test reports (attachment removed to protect patient identity). So, in view of that, there are a couple of things that I can opine upon: Hope that helps. For more information consult a general surgeon online -->


Question: Q. Why is there burning sensation after passing stools?
 Answer: Hello. Intake of spicy food may cause burning sensation and irritation of anal mucosa which may lead to a burning pain during defecation. It is due to the spiciness of the food. You may try yogurt, cucumber, tender coconut water, probiotic capsules, buttermilk. Use tablet Nexium before breakfast for one week. Avoid spicy food intake. If symptoms do not improve, please consult a physician or post me a query.


Question: Q. Can you explain the reason behind burning sensation of gums?
 Answer: Hi. Hurting of gums while taking spicy food might be due to the follo

### Defining the model parameters
We need to provide a set of model parameters that will influence the result:

In [43]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 200
}

Initialize the `Model` class.

In [44]:
#this cell should never fail, and will produce no output
import requests

def getBearer(apikey):
    form = {'apikey': apikey, 'grant_type': "urn:ibm:params:oauth:grant-type:apikey"}
    print("About to create bearer")
#    print(form)
    response = requests.post("https://iam.cloud.ibm.com/oidc/token", data = form)
    if response.status_code != 200:
        print("Bad response code retrieving token")
        raise Exception("Failed to get token, invalid status")
    json = response.json()
    if not json:
        print("Invalid/no JSON retrieving token")
        raise Exception("Failed to get token, invalid response")
    print("Bearer retrieved")
    return json.get("access_token")

In [45]:
credentials["token"] = getBearer(credentials["apikey"])

About to create bearer
Bearer retrieved


In [46]:
from ibm_watson_machine_learning.foundation_models import Model
model = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id)

### Generate a retrieval-augmented response

**Note:** Execution of this cell could take several minutes.

In [47]:
prompt_texts[:1]

['Please answer the following.\nQuestion: Q. Every time I eat spicy food, I poop blood. Why?\n Answer: Hello. I have gone through your information and test reports (attachment removed to protect patient identity). So, in view of that, there are a couple of things that I can opine upon: Hope that helps. For more information consult a general surgeon online -->\n\n\nQuestion: Q. Why is there burning sensation after passing stools?\n Answer: Hello. Intake of spicy food may cause burning sensation and irritation of anal mucosa which may lead to a burning pain during defecation. It is due to the spiciness of the food. You may try yogurt, cucumber, tender coconut water, probiotic capsules, buttermilk. Use tablet Nexium before breakfast for one week. Avoid spicy food intake. If symptoms do not improve, please consult a physician or post me a query.\n\n\nQuestion: Q. Can you explain the reason behind burning sensation of gums?\n Answer: Hi. Hurting of gums while taking spicy food might be due 

In [48]:
len(prompt_texts[:1])

1

In [49]:
results = []
for prompt_text in prompt_texts[:1]:
    results.append(model.generate_text(prompt=prompt_text))

In [50]:
#test_data

In [51]:
for idx, result in enumerate(results):
    print("Question = ", test_data.iloc[idx]['Question'])
    print("Answer = ", result)
    print("Expected Answer(s) (may not be appear with exact wording in the dataset) = ", test_data.iloc[idx]['Answer'])
    print("\n")

Question =  Q. Every time I eat spicy food, I poop blood. Why?
Answer =  Hello. I have gone through your information and test reports (attachment removed to protect patient identity). So, in view of that, there are a couple of things that I can opine upon: Hope that helps. For more information consult a general surgeon online -->
Expected Answer(s) (may not be appear with exact wording in the dataset) =  Hello. I have gone through your information and test reports (attachment removed to protect patient identity). So, in view of that, there are a couple of things that I can opine upon: Hope that helps. For more information consult a general surgeon online -->




<a id="score"></a>
## Calculate rougeL metric

In this sample notebook `rouge_score` module was used for rougeL calculation.

#### Rouge Metric

**Note:** The Rouge (Recall-Oriented Understudy for Gisting Evaluation) metric is a set of evaluation measures used in natural language processing (NLP) and specifically in text summarization and machine translation tasks. The Rouge metrics are designed to assess the quality of generated summaries or translations by comparing them to one or more reference texts.

The main idea behind Rouge is to measure the overlap between the generated summary (or translation) and the reference text(s) in terms of n-grams or longest common subsequences. By calculating recall, precision, and F1 scores based on these overlapping units, Rouge provides a quantitative assessment of the summary's content overlap with the reference(s).

Rouge-1 focuses on individual word overlap, Rouge-2 considers pairs of consecutive words, and Rouge-L takes into account the ordering of words and phrases. These metrics provide different perspectives on the similarity between two texts and can be used to evaluate different aspects of summarization or text generation models.

In [52]:
from rouge_score import rouge_scorer
from collections import defaultdict
import numpy as np

def get_rouge_score(predictions, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
    aggregate_score = defaultdict(list)

    for result, ref in zip(predictions, references):
        for key, val in scorer.score(result, ref).items():
            aggregate_score[key].append(val.fmeasure)

    scores = {}
    for key in aggregate_score:
        scores[key] = np.mean(aggregate_score[key])
    
    return scores

In [53]:
print(get_rouge_score(results, test_data.Answer))

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
