{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the dataset\n", "We will combine the Description and Patient text into a single combined text. The model will encode this combined text and it will output a single vector embedding." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import pandas as pd\n", "import tiktoken\n", "from openai.embeddings_utils import get_embedding\n", "import time" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# embedding model parameters\n", "embedding_model = \"text-embedding-ada-002\"\n", "embedding_encoding = \"cl100k_base\" # this the encoding for text-embedding-ada-002\n", "max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# load & inspect dataset\n", "df = pd.read_csv(\"../2-Data/dialogues.csv\", sep = '\\t')\n", "df = df.dropna()#.head(1000)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df.rename(columns = {'Description':'Question',\"Doctor\":\"Answer\"}, inplace = True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuestionPatientAnswer
0Q. What does abutment of the nerve root mean?Hi doctor,I am just wondering what is abutting...Hi. I have gone through your query with dilige...
1Q. What should I do to reduce my weight gained...Hi doctor, I am a 22-year-old female who was d...Hi. You have really done well with the hypothy...
2Q. I have started to get lots of acne on my fa...Hi doctor! I used to have clear skin but since...Hi there Acne has multifactorial etiology. Onl...
3Q. Why do I have uncomfortable feeling between...Hello doctor,I am having an uncomfortable feel...Hello. The popping and discomfort what you fel...
4Q. My symptoms after intercourse threatns me e...Hello doctor,Before two years had sex with a c...Hello. The HIV test uses a finger prick blood ...
............
256911Why is hair fall increasing while using Bontre...I am suffering from excessive hairfall. My doc...Hello Dear Thanks for writing to us, we are he...
256912Why was I asked to discontinue Androanagen whi...Hi Doctor, I have been having severe hair fall...hello, hair4u is combination of minoxid...
256913Can Mintop 5% Lotion be used by women for seve...Hi..i hav sever hair loss problem so consulted...HI I have evaluated your query thoroughly you...
256914Is Minoxin 5% lotion advisable instead of Foli...Hi, i am 25 year old girl, i am having massive...Hello and Welcome to ‘Ask A Doctor’ service.I ...
256915Are Biotin supplements need to reduce severe h...iam having hairfall for a decade.. but fews we...you did'nt mention about thyroid problem ...us...
\n", "

256916 rows × 3 columns

\n", "
" ], "text/plain": [ " Question \\\n", "0 Q. What does abutment of the nerve root mean? \n", "1 Q. What should I do to reduce my weight gained... \n", "2 Q. I have started to get lots of acne on my fa... \n", "3 Q. Why do I have uncomfortable feeling between... \n", "4 Q. My symptoms after intercourse threatns me e... \n", "... ... \n", "256911 Why is hair fall increasing while using Bontre... \n", "256912 Why was I asked to discontinue Androanagen whi... \n", "256913 Can Mintop 5% Lotion be used by women for seve... \n", "256914 Is Minoxin 5% lotion advisable instead of Foli... \n", "256915 Are Biotin supplements need to reduce severe h... \n", "\n", " Patient \\\n", "0 Hi doctor,I am just wondering what is abutting... \n", "1 Hi doctor, I am a 22-year-old female who was d... \n", "2 Hi doctor! I used to have clear skin but since... \n", "3 Hello doctor,I am having an uncomfortable feel... \n", "4 Hello doctor,Before two years had sex with a c... \n", "... ... \n", "256911 I am suffering from excessive hairfall. My doc... \n", "256912 Hi Doctor, I have been having severe hair fall... \n", "256913 Hi..i hav sever hair loss problem so consulted... \n", "256914 Hi, i am 25 year old girl, i am having massive... \n", "256915 iam having hairfall for a decade.. but fews we... \n", "\n", " Answer \n", "0 Hi. I have gone through your query with dilige... \n", "1 Hi. You have really done well with the hypothy... \n", "2 Hi there Acne has multifactorial etiology. Onl... \n", "3 Hello. The popping and discomfort what you fel... \n", "4 Hello. The HIV test uses a finger prick blood ... \n", "... ... \n", "256911 Hello Dear Thanks for writing to us, we are he... \n", "256912 hello, hair4u is combination of minoxid... \n", "256913 HI I have evaluated your query thoroughly you... \n", "256914 Hello and Welcome to ‘Ask A Doctor’ service.I ... \n", "256915 you did'nt mention about thyroid problem ...us... \n", "\n", "[256916 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuestionPatientAnswercombined
0Q. What does abutment of the nerve root mean?Hi doctor,I am just wondering what is abutting...Hi. I have gone through your query with dilige...Question: Q. What does abutment of the nerve r...
1Q. What should I do to reduce my weight gained...Hi doctor, I am a 22-year-old female who was d...Hi. You have really done well with the hypothy...Question: Q. What should I do to reduce my wei...
\n", "
" ], "text/plain": [ " Question \\\n", "0 Q. What does abutment of the nerve root mean? \n", "1 Q. What should I do to reduce my weight gained... \n", "\n", " Patient \\\n", "0 Hi doctor,I am just wondering what is abutting... \n", "1 Hi doctor, I am a 22-year-old female who was d... \n", "\n", " Answer \\\n", "0 Hi. I have gone through your query with dilige... \n", "1 Hi. You have really done well with the hypothy... \n", "\n", " combined \n", "0 Question: Q. What does abutment of the nerve r... \n", "1 Question: Q. What should I do to reduce my wei... " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"combined\"] = (\n", " \"Question: \" + df.Question.str.strip() + \"; Patient: \" + df.Patient.str.strip()+ \"; Answer: \" + df.Answer.str.strip()\n", ")\n", "df.head(2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#df[\"combined\"] = ( \"Description: \" + df.Description.str.strip() + \"; Patient: \" + df.Patient.str.strip())\n", "#df.head(2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# subsample to 1k most recent reviews and remove samples that are too long\n", "top_n = df.shape[0]\n", "#df = df.tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "256916" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoding = tiktoken.get_encoding(embedding_encoding)\n", "# omit reviews that are too long to embed\n", "df[\"n_tokens\"] = df.combined.apply(lambda x: len(encoding.encode(x)))\n", "df = df[df.n_tokens <= max_tokens].tail(top_n)\n", "len(df)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DescriptionPatientDoctorcombinedn_tokens
0Q. What does abutment of the nerve root mean?Hi doctor,I am just wondering what is abutting...Hi. I have gone through your query with dilige...Description: Q. What does abutment of the nerv...95
1Q. What should I do to reduce my weight gained...Hi doctor, I am a 22-year-old female who was d...Hi. You have really done well with the hypothy...Description: Q. What should I do to reduce my ...519
2Q. I have started to get lots of acne on my fa...Hi doctor! I used to have clear skin but since...Hi there Acne has multifactorial etiology. Onl...Description: Q. I have started to get lots of ...285
3Q. Why do I have uncomfortable feeling between...Hello doctor,I am having an uncomfortable feel...Hello. The popping and discomfort what you fel...Description: Q. Why do I have uncomfortable fe...324
4Q. My symptoms after intercourse threatns me e...Hello doctor,Before two years had sex with a c...Hello. The HIV test uses a finger prick blood ...Description: Q. My symptoms after intercourse ...442
..................
256911Why is hair fall increasing while using Bontre...I am suffering from excessive hairfall. My doc...Hello Dear Thanks for writing to us, we are he...Description: Why is hair fall increasing while...211
256912Why was I asked to discontinue Androanagen whi...Hi Doctor, I have been having severe hair fall...hello, hair4u is combination of minoxid...Description: Why was I asked to discontinue An...154
256913Can Mintop 5% Lotion be used by women for seve...Hi..i hav sever hair loss problem so consulted...HI I have evaluated your query thoroughly you...Description: Can Mintop 5% Lotion be used by w...191
256914Is Minoxin 5% lotion advisable instead of Foli...Hi, i am 25 year old girl, i am having massive...Hello and Welcome to ‘Ask A Doctor’ service.I ...Description: Is Minoxin 5% lotion advisable in...232
256915Are Biotin supplements need to reduce severe h...iam having hairfall for a decade.. but fews we...you did'nt mention about thyroid problem ...us...Description: Are Biotin supplements need to re...213
\n", "

256916 rows × 5 columns

\n", "
" ], "text/plain": [ " Description \\\n", "0 Q. What does abutment of the nerve root mean? \n", "1 Q. What should I do to reduce my weight gained... \n", "2 Q. I have started to get lots of acne on my fa... \n", "3 Q. Why do I have uncomfortable feeling between... \n", "4 Q. My symptoms after intercourse threatns me e... \n", "... ... \n", "256911 Why is hair fall increasing while using Bontre... \n", "256912 Why was I asked to discontinue Androanagen whi... \n", "256913 Can Mintop 5% Lotion be used by women for seve... \n", "256914 Is Minoxin 5% lotion advisable instead of Foli... \n", "256915 Are Biotin supplements need to reduce severe h... \n", "\n", " Patient \\\n", "0 Hi doctor,I am just wondering what is abutting... \n", "1 Hi doctor, I am a 22-year-old female who was d... \n", "2 Hi doctor! I used to have clear skin but since... \n", "3 Hello doctor,I am having an uncomfortable feel... \n", "4 Hello doctor,Before two years had sex with a c... \n", "... ... \n", "256911 I am suffering from excessive hairfall. My doc... \n", "256912 Hi Doctor, I have been having severe hair fall... \n", "256913 Hi..i hav sever hair loss problem so consulted... \n", "256914 Hi, i am 25 year old girl, i am having massive... \n", "256915 iam having hairfall for a decade.. but fews we... \n", "\n", " Doctor \\\n", "0 Hi. I have gone through your query with dilige... \n", "1 Hi. You have really done well with the hypothy... \n", "2 Hi there Acne has multifactorial etiology. Onl... \n", "3 Hello. The popping and discomfort what you fel... \n", "4 Hello. The HIV test uses a finger prick blood ... \n", "... ... \n", "256911 Hello Dear Thanks for writing to us, we are he... \n", "256912 hello, hair4u is combination of minoxid... \n", "256913 HI I have evaluated your query thoroughly you... \n", "256914 Hello and Welcome to ‘Ask A Doctor’ service.I ... \n", "256915 you did'nt mention about thyroid problem ...us... \n", "\n", " combined n_tokens \n", "0 Description: Q. What does abutment of the nerv... 95 \n", "1 Description: Q. What should I do to reduce my ... 519 \n", "2 Description: Q. I have started to get lots of ... 285 \n", "3 Description: Q. Why do I have uncomfortable fe... 324 \n", "4 Description: Q. My symptoms after intercourse ... 442 \n", "... ... ... \n", "256911 Description: Why is hair fall increasing while... 211 \n", "256912 Description: Why was I asked to discontinue An... 154 \n", "256913 Description: Can Mintop 5% Lotion be used by w... 191 \n", "256914 Description: Is Minoxin 5% lotion advisable in... 232 \n", "256915 Description: Are Biotin supplements need to re... 213 \n", "\n", "[256916 rows x 5 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are different ways to convert text into a vector or into embeddings.\n", "\n", "Unfortunately, most good methods to get embeddings in Python are not free.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Get embeddings using SentenceTransformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us use SentenceTransformers, a Python framework for state-of-the-art sentence, text, and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we verify that Torch is CUDA capable" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "torch.cuda.is_available()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define our list of sentences. You can use a larger list (it is best to use a list of sentences for easier processing of each sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can install Sentence BERT using:\n", "`!pip install sentence-transformers`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Step 1: We will then load the pre-trained BERT model. There are many other pre-trained models available." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We proceed to test the embeding creation" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "model = SentenceTransformer('paraphrase-MiniLM-L6-v2')\n", "#Sentences we want to encode. Example:\n", "sentence = ['This framework generates embeddings for each input sentence']\n", "#Sentences are encoded by calling model.encode()\n", "embedding = model.encode(sentence)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This framework generates embeddings for each input sentence']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def get_embeddings(x,transformer='paraphrase-MiniLM-L6-v2'):\n", " model = SentenceTransformer(transformer)\n", " #Sentences we want to encode\n", " sentence =x\n", " #Sentences are encoded by calling model.encode()\n", " embedding = model.encode(sentence)\n", " return embedding" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# This may take a few minutes\n", "embedding_mod='paraphrase-MiniLM-L6-v2'\n", "#df[\"embedding\"] = df.combined.apply(lambda x: get_embeddings(x, transformer=embedding_mod))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "df=df.head(1000)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "#embedding_doctor\n", "# This may take a few minutes\n", "df[\"embedding\"] = df.Answer.apply(lambda x: get_embeddings(x, transformer=embedding_mod))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuestionPatientAnswercombinedn_tokensembedding
0Q. What does abutment of the nerve root mean?Hi doctor,I am just wondering what is abutting...Hi. I have gone through your query with dilige...Question: Q. What does abutment of the nerve r...95[-0.109211065, -0.17469415, 0.18996556, 0.0599...
1Q. What should I do to reduce my weight gained...Hi doctor, I am a 22-year-old female who was d...Hi. You have really done well with the hypothy...Question: Q. What should I do to reduce my wei...519[-0.014065318, 0.0440334, 0.26095688, 0.086799...
2Q. I have started to get lots of acne on my fa...Hi doctor! I used to have clear skin but since...Hi there Acne has multifactorial etiology. Onl...Question: Q. I have started to get lots of acn...285[-0.39175138, -0.025890486, -0.024644196, -0.0...
3Q. Why do I have uncomfortable feeling between...Hello doctor,I am having an uncomfortable feel...Hello. The popping and discomfort what you fel...Question: Q. Why do I have uncomfortable feeli...324[-0.29406005, -0.31878802, 0.27588362, 0.09649...
4Q. My symptoms after intercourse threatns me e...Hello doctor,Before two years had sex with a c...Hello. The HIV test uses a finger prick blood ...Question: Q. My symptoms after intercourse thr...442[-0.36187398, 0.18491694, -0.3090741, -0.30197...
.....................
995Q. My lax les is 38 cm with inflamed gastric f...Hello doctor, My lax les is 38 cm with inflame...Hello. Gastritis is an inflammation of stomach...Question: Q. My lax les is 38 cm with inflamed...214[-0.1555396, -0.44157797, -0.15364785, 0.25760...
996Q. I am suffering from mood swings. Kindly adv...Hello doctor,I want to get some information re...Hello. Let me answer your questions via some b...Question: Q. I am suffering from mood swings. ...491[-0.2296337, 0.119730674, 0.37153018, 0.062901...
997Q. I am having swollen lymph node in my neck. ...Hello doctor, I went to the chiropractor and g...Hello. I do not think that because of chiropra...Question: Q. I am having swollen lymph node in...395[-0.10149522, -0.33532476, 0.40812746, -0.2713...
998Q. How good is Albenza for a raccoon roundworm...Hello doctor,I am concerned about a possible r...Hello. Albendazole 400 mg single star dose is ...Question: Q. How good is Albenza for a raccoon...240[-0.06408733, 0.17669381, 0.09132431, -0.09456...
999Q. Will Kalarchikai cure multiple ovarian cyst...Hello doctor, I have multiple small cysts in b...Hello. I just read your query. See Kalarachi K...Question: Q. Will Kalarchikai cure multiple ov...309[0.03657364, 0.24297515, 0.09555141, 0.0270566...
\n", "

1000 rows × 6 columns

\n", "
" ], "text/plain": [ " Question \\\n", "0 Q. What does abutment of the nerve root mean? \n", "1 Q. What should I do to reduce my weight gained... \n", "2 Q. I have started to get lots of acne on my fa... \n", "3 Q. Why do I have uncomfortable feeling between... \n", "4 Q. My symptoms after intercourse threatns me e... \n", ".. ... \n", "995 Q. My lax les is 38 cm with inflamed gastric f... \n", "996 Q. I am suffering from mood swings. Kindly adv... \n", "997 Q. I am having swollen lymph node in my neck. ... \n", "998 Q. How good is Albenza for a raccoon roundworm... \n", "999 Q. Will Kalarchikai cure multiple ovarian cyst... \n", "\n", " Patient \\\n", "0 Hi doctor,I am just wondering what is abutting... \n", "1 Hi doctor, I am a 22-year-old female who was d... \n", "2 Hi doctor! I used to have clear skin but since... \n", "3 Hello doctor,I am having an uncomfortable feel... \n", "4 Hello doctor,Before two years had sex with a c... \n", ".. ... \n", "995 Hello doctor, My lax les is 38 cm with inflame... \n", "996 Hello doctor,I want to get some information re... \n", "997 Hello doctor, I went to the chiropractor and g... \n", "998 Hello doctor,I am concerned about a possible r... \n", "999 Hello doctor, I have multiple small cysts in b... \n", "\n", " Answer \\\n", "0 Hi. I have gone through your query with dilige... \n", "1 Hi. You have really done well with the hypothy... \n", "2 Hi there Acne has multifactorial etiology. Onl... \n", "3 Hello. The popping and discomfort what you fel... \n", "4 Hello. The HIV test uses a finger prick blood ... \n", ".. ... \n", "995 Hello. Gastritis is an inflammation of stomach... \n", "996 Hello. Let me answer your questions via some b... \n", "997 Hello. I do not think that because of chiropra... \n", "998 Hello. Albendazole 400 mg single star dose is ... \n", "999 Hello. I just read your query. See Kalarachi K... \n", "\n", " combined n_tokens \\\n", "0 Question: Q. What does abutment of the nerve r... 95 \n", "1 Question: Q. What should I do to reduce my wei... 519 \n", "2 Question: Q. I have started to get lots of acn... 285 \n", "3 Question: Q. Why do I have uncomfortable feeli... 324 \n", "4 Question: Q. My symptoms after intercourse thr... 442 \n", ".. ... ... \n", "995 Question: Q. My lax les is 38 cm with inflamed... 214 \n", "996 Question: Q. I am suffering from mood swings. ... 491 \n", "997 Question: Q. I am having swollen lymph node in... 395 \n", "998 Question: Q. How good is Albenza for a raccoon... 240 \n", "999 Question: Q. Will Kalarchikai cure multiple ov... 309 \n", "\n", " embedding \n", "0 [-0.109211065, -0.17469415, 0.18996556, 0.0599... \n", "1 [-0.014065318, 0.0440334, 0.26095688, 0.086799... \n", "2 [-0.39175138, -0.025890486, -0.024644196, -0.0... \n", "3 [-0.29406005, -0.31878802, 0.27588362, 0.09649... \n", "4 [-0.36187398, 0.18491694, -0.3090741, -0.30197... \n", ".. ... \n", "995 [-0.1555396, -0.44157797, -0.15364785, 0.25760... \n", "996 [-0.2296337, 0.119730674, 0.37153018, 0.062901... \n", "997 [-0.10149522, -0.33532476, 0.40812746, -0.2713... \n", "998 [-0.06408733, 0.17669381, 0.09132431, -0.09456... \n", "999 [0.03657364, 0.24297515, 0.09555141, 0.0270566... \n", "\n", "[1000 rows x 6 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "from ast import literal_eval\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "df[\"embedding\"] = df.embedding.apply(np.array) # convert string to numpy array" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "#df[\"embedding_doctor\"] = df.embedding_doctor.apply(np.array) # convert string to numpy array" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "df.to_pickle(\"../2-Data/dialogues_embededd.pkl\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "#df.to_csv(\"../2-Data/dialogues_embededd.csv\", sep = '\\t', encoding='utf-8', index=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Get embeddings using OpenAI (optional)\n", "If we have a subscription in OpenAI, you can follow the following steps.\n", "Is optional, we are going to use the previous method." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Python program to read\n", "# json file\n", "import json\n", "# Opening JSON file\n", "f = open('./credentials/api.json')\n", "# returns JSON object as\n", "# a dictionary\n", "data = json.load(f)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n", "import openai\n", "openai.api_key = data['OPENAI_API_KEY']\n", "# Closing file\n", "f.close()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# This may take a few minutes\n", "df[\"embedding\"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"../2-Data/dialogues_embededd_openai.csv\", sep='\\t', encoding='utf-8', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Notes (not neeeded)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# list of text documents\n", "text = [\"I am doga.\",\n", " \"I am a dog\"]\n", "# create the transform\n", "vectorizer = TfidfVectorizer()\n", "# tokenize and build vocab\n", "vectorizer.fit(text)\n", "# summarize\n", "print(vectorizer.vocabulary_)\n", "print(vectorizer.idf_)\n", "# encode document\n", "vector = vectorizer.transform([text[0]])\n", "# summarize encoded vector\n", "print(vector.shape)\n", "print(vector.toarray())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n", "# list of text documents\n", "text = [\"I am doc.\", \"I am dog\"]\n", "# create the transform\n", "vectorizer = HashingVectorizer(n_features=20)\n", "# encode document\n", "vector = vectorizer.transform(text)\n", "# summarize encoded vector\n", "print(vector.shape)\n", "print(vector.toarray())" ] } ], "metadata": { "kernelspec": { "display_name": "Python3 (GPT)", "language": "python", "name": "gpt" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" } } }, "nbformat": 4, "nbformat_minor": 4 }