Spaces:

mgbam
/

Medic

Running

App Files Files Community

mgbam commited on 5 days ago

Commit

b8986a1

1 Parent(s): 879a34e

Add application file

Browse files

Files changed (7) hide show

README.md +86 -5
app.py +38 -0
backend.py +42 -0
mini_ladder.py +56 -0
requirements.txt +9 -0
retrieval.py +126 -0
visualization.py +42 -0

README.md CHANGED Viewed

@@ -1,12 +1,93 @@
 ---
-title: Medic
-emoji: 🌖
-colorFrom: yellow
-colorTo: indigo
 sdk: streamlit
 sdk_version: 1.43.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Med
+emoji: 🏆
+colorFrom: blue
+colorTo: purple
 sdk: streamlit
 sdk_version: 1.43.2
 app_file: app.py
 pinned: false
+short_description: Medical field with next‑gen technology
 ---
+# AI-Powered Medical Knowledge Graph Assistant (Mini-LADDER Demo)
+This repository demonstrates a Streamlit application that:
+- Retrieves PubMed abstracts via NCBI’s E-utilities.
+- Indexes and retrieves relevant documents using ChromaDB.
+- Generates biomedical answers using Microsoft BioGPT-Large-PubMedQA.
+- Applies a two-stage self-improvement mechanism (inspired by Tufa Labs’ LADDER) that:
+  - Generates naive sub-questions.
+  - Produces an initial answer.
+  - Self-critiques and refines the answer.
+- Visualizes key terms in a knowledge graph using PyVis.
+## Key Features
+1. **PubMed + Chroma**: Retrieves and indexes relevant abstracts.
+2. **BioGPT-Large-PubMedQA**: Generates an initial answer.
+3. **Mini-LADDER Approach**:
+   - **Sub-Question Decomposition**: Generates sub-questions from the main query.
+   - **Self-Critique & Refinement**: Uses a second pass to critique and refine the answer.
+4. **Interactive Knowledge Graph**: Displays a PyVis graph of the top documents and key terms.
+## Setup Instructions
+1. **Install Dependencies**
+   Create a virtual environment (optional) and install the required packages:
+   ```bash
+   pip install -r requirements.txt
+Set Environment Variables
+Set your PubMed API key:
+bash
+Copy
+export PUBMED_API_KEY=<YOUR_NCBI_API_KEY>
+Run the App
+Launch the Streamlit app:
+bash
+Copy
+streamlit run app.py
+Access the App
+Open your browser at http://localhost:8501.
+Project Structure
+Copy
+.
+├── app.py
+├── backend.py
+├── mini_ladder.py
+├── retrieval.py
+├── visualization.py
+├── README.md
+└── requirements.txt
+About the Mini-LADDER Approach
+Inspired by Tufa Labs’ LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), this demo shows how a model might:
+Decompose a query into simpler sub-questions.
+Generate an initial answer using retrieval-augmented generation.
+Self-critique and refine the answer based on detected gaps.
+Ultimately, this approach hints at how autonomous, recursive learning could be implemented.
+Enjoy exploring potential extensions into code generation, theorem proving, or other domains!
+yaml
+Copy
+---
+## File: `requirements.txt`
+```txt
+streamlit
+pyvis
+chromadb
+transformers
+sentence-transformers
+torch
+requests

app.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import streamlit as st
+import streamlit.components.v1 as components
+from backend import process_medical_query, docs_cache
+from visualization import create_medical_graph
+def main():
+    st.title("AI-Powered Medical Knowledge Graph Assistant")
+    st.markdown(
+        "**Using BioGPT-Large-PubMedQA + PubMed + Chroma** for advanced retrieval-augmented generation."
+    )
+    user_query = st.text_input("Enter biomedical/medical query", "Malaria and cough treatment")
+    if st.button("Submit"):
+        with st.spinner("Generating answer..."):
+            final_answer, sub_questions, initial_answer, critique = process_medical_query(user_query)
+        st.subheader("Sub-Question Decomposition")
+        st.write(sub_questions)
+        st.subheader("Initial AI Answer")
+        st.write(initial_answer)
+        st.subheader("Self-Critique")
+        st.write(critique)
+        st.subheader("Refined AI Answer")
+        st.write(final_answer)
+        st.subheader("Knowledge Graph")
+        docs = docs_cache.get(user_query, [])
+        if docs:
+            graph_html = create_medical_graph(user_query, docs)
+            components.html(graph_html, height=600, scrolling=True)
+        else:
+            st.info("No documents to visualize.")
+if __name__ == "__main__":
+    main()

backend.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from transformers import pipeline
+from retrieval import get_relevant_pubmed_docs
+from mini_ladder import generate_sub_questions, self_critique_and_refine
+# Use Microsoft BioGPT-Large-PubMedQA for generation
+MODEL_NAME = "microsoft/BioGPT-Large-PubMedQA"
+qa_pipeline = pipeline("text-generation", model=MODEL_NAME)
+# In-memory cache for documents (used for graph generation)
+docs_cache = {}
+def process_medical_query(query: str):
+    """
+    Processes the query in four steps:
+    1. Generate sub-questions.
+    2. Retrieve relevant PubMed documents.
+    3. Generate an initial answer.
+    4. Self-critique and refine the answer.
+    """
+    # Step 1: Generate sub-questions (naively)
+    sub_questions = generate_sub_questions(query)
+    # Step 2: Retrieve relevant documents via PubMed and Chroma
+    relevant_docs = get_relevant_pubmed_docs(query)
+    docs_cache[query] = relevant_docs
+    if not relevant_docs:
+        return ("No documents found for this query.", sub_questions, "", "")
+    # Step 3: Generate an initial answer
+    context_text = "\n\n".join(relevant_docs)
+    prompt = f"Question: {query}\nContext: {context_text}\nAnswer:"
+    initial_gen = qa_pipeline(prompt, max_new_tokens=100, truncation=True)
+    if initial_gen and isinstance(initial_gen, list):
+        initial_answer = initial_gen[0]["generated_text"]
+    else:
+        initial_answer = "No answer found."
+    # Step 4: Self-critique and refine the answer
+    final_answer, critique = self_critique_and_refine(query, initial_answer, relevant_docs)
+    return (final_answer, sub_questions, initial_answer, critique)

mini_ladder.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from transformers import pipeline
+# A second pipeline for self-critique (using a lighter model for demonstration)
+CRITIQUE_MODEL = "gpt2"  # This can be replaced with another model as needed
+critique_pipeline = pipeline("text-generation", model=CRITIQUE_MODEL)
+def generate_sub_questions(main_query: str):
+    """
+    Naively generates sub-questions for the given main query.
+    """
+    return [
+        f"1) What are common causes of {main_query}?",
+        f"2) Which medications are typically used for {main_query}?",
+        f"3) What are non-pharmacological approaches to {main_query}?"
+    ]
+def self_critique_and_refine(query: str, initial_answer: str, docs: list):
+    """
+    Critiques the initial answer and refines it if necessary.
+    """
+    # Step 1: Generate a critique using a critique prompt
+    critique_prompt = (
+        f"The following is an answer to the question '{query}'. "
+        "Evaluate its correctness, clarity, and completeness. "
+        "List any missing details or inaccuracies.\n\n"
+        f"ANSWER:\n{initial_answer}\n\n"
+        "CRITIQUE:"
+    )
+    critique_gen = critique_pipeline(critique_prompt, max_new_tokens=80, truncation=True)
+    if critique_gen and isinstance(critique_gen, list):
+        critique_text = critique_gen[0]["generated_text"]
+    else:
+        critique_text = "No critique generated."
+    # Step 2: If the critique suggests issues, refine the answer using the original QA pipeline.
+    if any(word in critique_text.lower() for word in ["missing", "incomplete", "incorrect", "lacks"]):
+        refine_prompt = (
+            f"Question: {query}\n"
+            f"Current Answer: {initial_answer}\n"
+            f"Critique: {critique_text}\n"
+            "Refine the answer by adding missing or corrected information. "
+            "Use the context below if needed:\n\n"
+            + "\n\n".join(docs)
+            + "\nREFINED ANSWER:"
+        )
+        # Import the qa_pipeline from backend to reuse it (local import to avoid circular dependencies)
+        from backend import qa_pipeline
+        refined_gen = qa_pipeline(refine_prompt, max_new_tokens=120, truncation=True)
+        if refined_gen and isinstance(refined_gen, list):
+            refined_answer = refined_gen[0]["generated_text"]
+        else:
+            refined_answer = initial_answer
+    else:
+        refined_answer = initial_answer
+    return refined_answer, critique_text

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+uvicorn
+sacremoses
+streamlit
+pyvis
+chromadb
+transformers
+sentence-transformers
+torch
+requests

retrieval.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import os
+import requests
+import torch
+from typing import List
+import chromadb
+from chromadb.config import Settings
+from transformers import AutoTokenizer, AutoModel
+# Optional: Set your PubMed API key from environment variables
+PUBMED_API_KEY = os.environ.get("PUBMED_API_KEY", "<YOUR_NCBI_API_KEY>")
+#############################################
+# 1) FETCH PUBMED ABSTRACTS
+#############################################
+def fetch_pubmed_abstracts(query: str, max_results: int = 5) -> List[str]:
+    """
+    Fetches PubMed abstracts for the given query using NCBI's E-utilities.
+    Returns a list of abstract texts.
+    """
+    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
+    params = {
+        "db": "pubmed",
+        "term": query,
+        "retmax": max_results,
+        "api_key": PUBMED_API_KEY,
+        "retmode": "json"
+    }
+    r = requests.get(search_url, params=params)
+    r.raise_for_status()
+    data = r.json()
+    pmid_list = data["esearchresult"].get("idlist", [])
+    abstracts = []
+    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
+    for pmid in pmid_list:
+        fetch_params = {
+            "db": "pubmed",
+            "id": pmid,
+            "rettype": "abstract",
+            "retmode": "text",
+            "api_key": PUBMED_API_KEY
+        }
+        fetch_resp = requests.get(fetch_url, params=fetch_params)
+        fetch_resp.raise_for_status()
+        abstract_text = fetch_resp.text.strip()
+        if abstract_text:
+            abstracts.append(abstract_text)
+    return abstracts
+#############################################
+# 2) CHROMA + EMBEDDINGS SETUP
+#############################################
+class EmbedFunction:
+    """
+    Wraps a Hugging Face embedding model to produce embeddings for a list of strings.
+    """
+    def __init__(self, model_name: str):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModel.from_pretrained(model_name)
+        self.model.eval()
+    def __call__(self, input: List[str]) -> List[List[float]]:
+        if not input:
+            return []
+        tokenized = self.tokenizer(
+            input,
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+            max_length=512
+        )
+        with torch.no_grad():
+            outputs = self.model(**tokenized, output_hidden_states=True)
+        last_hidden = outputs.hidden_states[-1]
+        pooled = last_hidden.mean(dim=1)
+        embeddings = pooled.cpu().tolist()
+        return embeddings
+EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
+embed_function = EmbedFunction(EMBED_MODEL_NAME)
+client = chromadb.Client(
+    settings=Settings(
+        persist_directory="chromadb_data",
+        anonymized_telemetry=False
+    )
+)
+# Updated collection name for clarity.
+collection = client.get_or_create_collection(
+    name="ai_medical_knowledge",
+    embedding_function=embed_function
+)
+def index_pubmed_docs(docs: List[str], prefix: str = "doc"):
+    """
+    Adds documents to the Chroma collection with unique IDs.
+    """
+    for i, doc in enumerate(docs):
+        if doc.strip():
+            doc_id = f"{prefix}-{i}"
+            collection.add(documents=[doc], ids=[doc_id])
+def query_similar_docs(query: str, top_k: int = 3) -> List[str]:
+    """
+    Retrieves the top_k similar documents from Chroma based on embedding similarity.
+    """
+    results = collection.query(query_texts=[query], n_results=top_k)
+    return results["documents"][0] if results and results["documents"] else []
+#############################################
+# 3) MAIN RETRIEVAL PIPELINE
+#############################################
+def get_relevant_pubmed_docs(user_query: str) -> List[str]:
+    """
+    End-to-end pipeline:
+      1. Fetch PubMed abstracts for the query.
+      2. Index them in Chroma.
+      3. Retrieve the top relevant documents.
+    """
+    new_abstracts = fetch_pubmed_abstracts(user_query, max_results=5)
+    if not new_abstracts:
+        return []
+    index_pubmed_docs(new_abstracts, prefix=user_query)
+    top_docs = query_similar_docs(user_query, top_k=3)
+    return top_docs

visualization.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import re
+import tempfile
+import os
+from pyvis.network import Network
+def extract_key_terms(text: str):
+    """
+    A naive approach to extract key terms by matching capitalized words.
+    """
+    return re.findall(r"\b[A-Z][a-zA-Z]+\b", text)
+def create_medical_graph(query: str, docs: list) -> str:
+    """
+    Builds a PyVis network:
+      - A central "QUERY" node.
+      - A node for each retrieved document.
+      - Sub-nodes for extracted key terms.
+    Returns the full HTML of the generated graph.
+    """
+    net = Network(height="600px", width="100%", directed=False)
+    net.add_node("QUERY", label=f"Query: {query}", color="red", shape="star")
+    for i, doc in enumerate(docs):
+        doc_id = f"Doc_{i}"
+        net.add_node(doc_id, label=f"Abstract {i+1}", color="blue")
+        net.add_edge("QUERY", doc_id)
+        terms = extract_key_terms(doc)
+        for term in set(terms):
+            term_id = f"{doc_id}_{term}"
+            net.add_node(term_id, label=term, color="green")
+            net.add_edge(doc_id, term_id)
+    # Write the network HTML to a temporary file and return its content
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as tmp:
+        temp_filename = tmp.name
+    net.show(temp_filename)
+    with open(temp_filename, "r", encoding="utf-8") as f:
+        html_content = f.read()
+    os.remove(temp_filename)
+    return html_content