Spaces:

sccastillo
/

sciresearch

Sleeping

App Files Files Community

sccastillo commited on Jul 28

Commit

e8fb55a

1 Parent(s): cc16e90

update research-team with websearch mocked

Browse files

Files changed (8) hide show

.gitignore +1 -1
README.md +10 -0
SETUP.md +84 -16
app.py +199 -17
globant_quality_assesment.md +772 -0
requirements.txt +7 -0
research_team.py +726 -0
test_document.md +41 -0

.gitignore CHANGED Viewed

@@ -1,7 +1,7 @@
 # Environment variables
 .env
 env_template.txt
 # Python
 __pycache__/
 *.py[cod]

 # Environment variables
 .env
 env_template.txt
+.sciresearch/
 # Python
 __pycache__/
 *.py[cod]

README.md CHANGED Viewed

@@ -24,3 +24,13 @@ Scientific research FastAPI application deployed on Hugging Face Spaces.
 - `GET /` - Returns a greeting message
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 - `GET /` - Returns a greeting message
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+```bash
+uvicorn app:app \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --reload \
+  --log-level info \
+  --access-log \
+  --log-config logging.conf
+```

SETUP.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Configuración de SciResearch API
 ## 🔑 Configurar OpenAI API Key
@@ -16,38 +16,106 @@ OPENAI_API_KEY=tu_api_key_aqui
 2. Haz clic en "Settings"
 3. Ve a la sección "Variables and secrets"
 4. Agrega una nueva variable:
-   - **Name**: `OPENAI_API_KEY`
-   - **Value**: Tu API key de OpenAI (sk-...)
 ## 🚀 Características
-- **Interfaz web interactiva**: Pregunta directamente en la página principal
-- **API REST**: Endpoint `/api/generate` para integración
-- **Respuestas inteligentes**: Usa OpenAI para responder preguntas
-- **Documentación automática**: Disponible en `/docs`
 ## 📝 Endpoints disponibles:
 - `GET /` - Página principal con interfaz interactiva
 - `POST /api/generate` - Generar respuestas con IA
 - `GET /api/health` - Estado de la aplicación
 - `GET /docs` - Documentación Swagger UI
 ## 🧪 Ejemplo de uso con curl:
 ```bash
 curl -X POST "https://sccastillo-sciresearch.hf.space/api/generate" \
   -H "Content-Type: application/json" \
   -d '{"question": "¿Qué es la inteligencia artificial?"}'
 ```
-## 📁 Estructura del proyecto:
 ```
-sciresearch/
-├── app.py              # Aplicación principal con OpenAI
-├── requirements.txt    # Dependencias
-├── Dockerfile         # Configuración Docker
-├── README.md          # Metadatos de HF Spaces
-└── SETUP.md          # Este archivo
-```

+# Configuración de SciResearch API con Research Team
 ## 🔑 Configurar OpenAI API Key
 2. Haz clic en "Settings"
 3. Ve a la sección "Variables and secrets"
 4. Agrega una nueva variable:
+   - Name: `OPENAI_API_KEY`
+   - Value: Tu API key de OpenAI (sk-...)
+## 🧬 Research Team - Funcionalidad Principal
+La nueva funcionalidad Research Team implementa un sistema multi-agente para Claims Anchoring y Reference Formatting siguiendo las especificaciones de Johnson & Johnson:
+### 🎯 Características del Research Team:
+#### Claims Anchoring Workflow:
+- Analyzer Agent: Extrae y clasifica claims en jerárquicas (core, supporting, contextual)
+- SearchAssistant: Búsqueda paralela en Google Scholar, PubMed, y arXiv
+- Researcher Agent: Anclaje de claims con referencias y validación de evidencia
+#### Reference Formatting Workflow:
+- Editor Agent: Formatea referencias según guidelines de J&J
+- Validación: Verifica integridad y completitud de referencias
+### 🔧 Arquitectura Técnica:
+- LangGraph: Orquestación de workflow multi-agente
+- Parallel Processing: Procesamiento simultáneo de múltiples claims
+- Mock Tools: Herramientas simuladas para desarrollo y testing
 ## 🚀 Características
+- Interfaz web interactiva: Pregunta directamente en la página principal
+- Research Team Interface: Procesa documentos para análisis de claims
+- API REST: Endpoints para integración
+- Respuestas inteligentes: Usa OpenAI para responder preguntas
+- Documentación automática: Disponible en `/docs`
 ## 📝 Endpoints disponibles:
+### Endpoints básicos:
 - `GET /` - Página principal con interfaz interactiva
 - `POST /api/generate` - Generar respuestas con IA
 - `GET /api/health` - Estado de la aplicación
 - `GET /docs` - Documentación Swagger UI
+### Endpoints del Research Team:
+- `POST /api/research/process` - Procesar documento con Research Team
+- `GET /api/research/status` - Estado del Research Team
 ## 🧪 Ejemplo de uso con curl:
+### Respuesta básica con IA:
 ```bash
 curl -X POST "https://sccastillo-sciresearch.hf.space/api/generate" \
   -H "Content-Type: application/json" \
   -d '{"question": "¿Qué es la inteligencia artificial?"}'
 ```
+### Procesamiento de documento con Research Team:
+```bash
+curl -X POST "https://sccastillo-sciresearch.hf.space/api/research/process" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "document_content": "Daratumumab is a human monoclonal antibody that targets CD38. Clinical studies have demonstrated significant efficacy in treating multiple myeloma patients. The POLLUX study demonstrated that daratumumab in combination with lenalidomide and dexamethasone significantly improved progression-free survival."
+  }'
 ```
+## 🧪 Testing con Documento de Ejemplo
+El archivo `test_document.md` contiene un documento de muestra con:
+- Claims médicos estructurados
+- Referencias formateadas
+- Metadatos de producto (Daratumumab, países LATAM)
+- Información de contacto
+Puedes usar este contenido para probar la funcionalidad del Research Team.
+### 1. Análisis de Documento (Analyzer Agent)
+- Extrae claims y los clasifica por importancia
+- Identifica producto, países, y idioma
+- Genera estructura jerárquica de claims
+### 2. Búsqueda Paralela (SearchAssistant)
+- Procesa solo claims core (alta prioridad)
+- Búsqueda simultánea en múltiples fuentes
+- Optimización de recursos y rate limiting
+### 3. Anclaje de Claims (Researcher Agent)
+- Valida evidencia de soporte para cada claim
+- Extrae pasajes relevantes de referencias
+- Genera scoring de relevancia y calidad
+### 4. Formateo de Referencias (Editor Agent)
+- Aplica guidelines de formato J&J
+- Completa información faltante
+- Estandariza citaciones según tipo de fuente
+### 5. Ensamblaje Final
+- Combina resultados de todos los agentes
+- Genera reporte completo con métricas
+- Proporciona documento reconstructado
+## Optimizaciones de Performance
+- Parallel Processing: Múltiples claims procesados simultáneamente
+- Mock Tools: Evita rate limits durante desarrollo
+- State Management: LangGraph maneja estado distribuido
+- Error Handling: Tolerancia a fallos individuales

app.py CHANGED Viewed

@@ -1,14 +1,18 @@
 import os
-from fastapi import FastAPI, HTTPException
 from fastapi.responses import HTMLResponse
 from pydantic import BaseModel
 from dotenv import load_dotenv
 # Importar dependencias de LangChain y OpenAI
-from langchain_openai import OpenAI
 from langchain.chains import LLMChain
 from langchain.prompts import PromptTemplate
 # Cargar variables de entorno
 load_dotenv()
@@ -16,17 +20,34 @@ load_dotenv()
 class QuestionRequest(BaseModel):
     question: str
 class GenerateResponse(BaseModel):
     text: str
     status: str = "success"
 # Crear la aplicación FastAPI
 app = FastAPI(
     title="SciResearch API",
-    description="Scientific Research FastAPI application with OpenAI integration on Hugging Face Spaces",
     version="1.0.0"
 )
 def answer_question(question: str):
     """
     Función para responder preguntas usando OpenAI LLM
@@ -51,6 +72,18 @@ def answer_question(question: str):
             api_key=openai_api_key,
             temperature=0.7
         )
         # Crear cadena LLM
         llm_chain = LLMChain(
@@ -78,28 +111,50 @@ def read_root():
         <style>
             body { font-family: Arial, sans-serif; margin: 40px; }
             h1 { color: #333; }
-            .container { max-width: 600px; margin: 0 auto; }
             .form-group { margin: 20px 0; }
             input[type="text"] { width: 100%; padding: 10px; margin: 5px 0; }
-            button { background-color: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; }
             button:hover { background-color: #45a049; }
-            #response { background-color: #f9f9f9; padding: 15px; margin-top: 20px; border-left: 4px solid #4CAF50; }
         </style>
     </head>
     <body>
         <div class="container">
-            <h1>🦀 SciResearch API</h1>
-            <p>¡Bienvenido a la aplicación de investigación científica con IA!</p>
-            <div class="form-group">
-                <h3>Pregunta a la IA:</h3>
-                <input type="text" id="question" placeholder="Escribe tu pregunta aquí..." />
-                <button onclick="askQuestion()">Preguntar</button>
             </div>
-            <div id="response" style="display:none;">
-                <h4>Respuesta:</h4>
-                <p id="answer"></p>
             </div>
             <h2>Endpoints disponibles:</h2>
@@ -108,6 +163,7 @@ def read_root():
                 <li><a href="/api/hello">/api/hello</a> - Saludo JSON</li>
                 <li><a href="/api/health">/api/health</a> - Estado de la aplicación</li>
                 <li><strong>/api/generate</strong> - Generar respuestas con IA (POST)</li>
             </ul>
         </div>
@@ -140,6 +196,87 @@ def read_root():
                 alert('Error de conexión: ' + error.message);
             }
         }
         // Permitir envío con Enter
         document.getElementById('question').addEventListener('keypress', function(e) {
@@ -171,7 +308,8 @@ def health_check():
         "status": "healthy",
         "service": "sciresearch",
         "version": "1.0.0",
-        "openai_configured": openai_configured
     }
 @app.post("/api/generate", summary="Answer user questions using OpenAI", tags=["AI Generate"], response_model=GenerateResponse)
@@ -180,3 +318,47 @@ def inference(request: QuestionRequest):
     Endpoint para generar respuestas a preguntas usando OpenAI LLM
     """
     return answer_question(question=request.question)

 import os
+from fastapi import FastAPI, HTTPException, UploadFile, File
 from fastapi.responses import HTMLResponse
 from pydantic import BaseModel
 from dotenv import load_dotenv
+import asyncio
 # Importar dependencias de LangChain y OpenAI
+from langchain_openai import OpenAI, ChatOpenAI
 from langchain.chains import LLMChain
 from langchain.prompts import PromptTemplate
+# Import ResearchTeam
+from research_team import create_research_team
 # Cargar variables de entorno
 load_dotenv()
 class QuestionRequest(BaseModel):
     question: str
+class DocumentRequest(BaseModel):
+    document_content: str
 class GenerateResponse(BaseModel):
     text: str
     status: str = "success"
+class ResearchResponse(BaseModel):
+    result: dict
+    status: str = "success"
 # Crear la aplicación FastAPI
 app = FastAPI(
     title="SciResearch API",
+    description="Scientific Research FastAPI application with OpenAI integration and Research Team for Claims Anchoring",
     version="1.0.0"
 )
+# Initialize ResearchTeam
+research_team = None
+def get_research_team():
+    """Get or create ResearchTeam instance"""
+    global research_team
+    if research_team is None:
+        research_team = create_research_team()
+    return research_team
 def answer_question(question: str):
     """
     Función para responder preguntas usando OpenAI LLM
             api_key=openai_api_key,
             temperature=0.7
         )
+        #llm = ChatOpenAI(
+        #    model="openai/gpt-4.1",
+        #    temperature=0.7,
+        #    api_key=os.getenv("GEAI_API_KEY"),
+        #    base_url=os.getenv("GEAI_BASE_URL")
+        #)
+        llm = ChatOpenAI(
+            model="openai/gpt-4.1",
+            temperature=0.7,
+            api_key=os.getenv("GEAI_API_KEY"),
+            base_url=os.getenv("GEAI_BASE_URL")
+        )
         # Crear cadena LLM
         llm_chain = LLMChain(
         <style>
             body { font-family: Arial, sans-serif; margin: 40px; }
             h1 { color: #333; }
+            .container { max-width: 800px; margin: 0 auto; }
             .form-group { margin: 20px 0; }
+            .section { border: 1px solid #ddd; padding: 20px; margin: 20px 0; border-radius: 5px; }
             input[type="text"] { width: 100%; padding: 10px; margin: 5px 0; }
+            textarea { width: 100%; padding: 10px; margin: 5px 0; height: 150px; }
+            button { background-color: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; margin: 5px; }
             button:hover { background-color: #45a049; }
+            .research-button { background-color: #2196F3; }
+            .research-button:hover { background-color: #1976D2; }
+            #response, #research-response { background-color: #f9f9f9; padding: 15px; margin-top: 20px; border-left: 4px solid #4CAF50; }
+            #research-response { border-left-color: #2196F3; }
+            .result-section { margin: 10px 0; padding: 10px; background-color: #f5f5f5; }
+            .loading { color: #666; font-style: italic; }
         </style>
     </head>
     <body>
         <div class="container">
+            <h1>🦀 SciResearch API with Research Team</h1>
+            <p>¡Bienvenido a la aplicación de investigación científica con IA y equipo de research para análisis de documentos!</p>
+            <div class="section">
+                <h3>💬 Pregunta a la IA:</h3>
+                <div class="form-group">
+                    <input type="text" id="question" placeholder="Escribe tu pregunta aquí..." />
+                    <button onclick="askQuestion()">Preguntar</button>
+                </div>
+                <div id="response" style="display:none;">
+                    <h4>Respuesta:</h4>
+                    <p id="answer"></p>
+                </div>
             </div>
+            <div class="section">
+                <h3>📄 Research Team - Claims Anchoring & Reference Formatting:</h3>
+                <div class="form-group">
+                    <textarea id="document" placeholder="Pega aquí el contenido del documento para analizar claims y referencias..."></textarea>
+                    <button class="research-button" onclick="processDocument()">Procesar Documento</button>
+                </div>
+                <div id="research-response" style="display:none;">
+                    <h4>Resultados del Research Team:</h4>
+                    <div id="research-results"></div>
+                </div>
             </div>
             <h2>Endpoints disponibles:</h2>
                 <li><a href="/api/hello">/api/hello</a> - Saludo JSON</li>
                 <li><a href="/api/health">/api/health</a> - Estado de la aplicación</li>
                 <li><strong>/api/generate</strong> - Generar respuestas con IA (POST)</li>
+                <li><strong>/api/research/process</strong> - Procesar documento con Research Team (POST)</li>
             </ul>
         </div>
                 alert('Error de conexión: ' + error.message);
             }
         }
+        async function processDocument() {
+            const document_content = document.getElementById('document').value;
+            if (!document_content.trim()) {
+                alert('Por favor pega el contenido del documento');
+                return;
+            }
+            // Show loading state
+            const resultsDiv = document.getElementById('research-results');
+            resultsDiv.innerHTML = '<p class="loading">Procesando documento... Esto puede tomar unos minutos.</p>';
+            document.getElementById('research-response').style.display = 'block';
+            try {
+                const response = await fetch('/api/research/process', {
+                    method: 'POST',
+                    headers: {
+                        'Content-Type': 'application/json',
+                    },
+                    body: JSON.stringify({document_content: document_content})
+                });
+                const data = await response.json();
+                if (response.ok) {
+                    displayResearchResults(data.result);
+                } else {
+                    resultsDiv.innerHTML = '<p style="color: red;">Error: ' + data.detail + '</p>';
+                }
+            } catch (error) {
+                resultsDiv.innerHTML = '<p style="color: red;">Error de conexión: ' + error.message + '</p>';
+            }
+        }
+        function displayResearchResults(result) {
+            const resultsDiv = document.getElementById('research-results');
+            let html = '';
+            // Document metadata
+            if (result.document_metadata) {
+                html += '<div class="result-section">';
+                html += '<h4>📋 Metadatos del Documento:</h4>';
+                html += '<p><strong>Producto:</strong> ' + (result.document_metadata.product || 'No detectado') + '</p>';
+                html += '<p><strong>Países:</strong> ' + (result.document_metadata.countries?.join(', ') || 'No detectados') + '</p>';
+                html += '<p><strong>Idioma:</strong> ' + (result.document_metadata.language || 'No detectado') + '</p>';
+                html += '</div>';
+            }
+            // Claims analysis
+            if (result.claims_analysis) {
+                html += '<div class="result-section">';
+                html += '<h4>🔍 Análisis de Claims:</h4>';
+                html += '<p><strong>Total de Claims:</strong> ' + result.claims_analysis.total_claims + '</p>';
+                html += '<p><strong>Claims Principales:</strong> ' + result.claims_analysis.core_claims_count + '</p>';
+                html += '</div>';
+            }
+            // Claims anchoring
+            if (result.claims_anchoring) {
+                html += '<div class="result-section">';
+                html += '<h4>⚓ Claims Anchoring:</h4>';
+                if (result.claims_anchoring.summary) {
+                    const summary = result.claims_anchoring.summary;
+                    html += '<p><strong>Claims Procesados:</strong> ' + summary.total_claims_processed + '</p>';
+                    html += '<p><strong>Validados Exitosamente:</strong> ' + summary.successfully_validated + '</p>';
+                    html += '<p><strong>Tasa de Validación:</strong> ' + Math.round(summary.validation_rate * 100) + '%</p>';
+                }
+                html += '</div>';
+            }
+            // Reference formatting
+            if (result.reference_formatting) {
+                html += '<div class="result-section">';
+                html += '<h4>📚 Formateo de Referencias:</h4>';
+                html += '<p><strong>Referencias Formateadas:</strong> ' + result.reference_formatting.total_references + '</p>';
+                html += '</div>';
+            }
+            resultsDiv.innerHTML = html;
+        }
         // Permitir envío con Enter
         document.getElementById('question').addEventListener('keypress', function(e) {
         "status": "healthy",
         "service": "sciresearch",
         "version": "1.0.0",
+        "openai_configured": openai_configured,
+        "research_team_available": True
     }
 @app.post("/api/generate", summary="Answer user questions using OpenAI", tags=["AI Generate"], response_model=GenerateResponse)
     Endpoint para generar respuestas a preguntas usando OpenAI LLM
     """
     return answer_question(question=request.question)
+@app.post("/api/research/process", summary="Process document with Research Team", tags=["Research Team"], response_model=ResearchResponse)
+async def process_document_research(request: DocumentRequest):
+    """
+    Endpoint para procesar documentos con el Research Team para Claims Anchoring y Reference Formatting
+    """
+    if not request.document_content or request.document_content.strip() == "":
+        raise HTTPException(status_code=400, detail="Please provide document content.")
+    try:
+        # Get research team instance
+        team = get_research_team()
+        # Process document
+        result = await team.process_document(request.document_content)
+        return ResearchResponse(result=result)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error processing document: {str(e)}")
+@app.get("/api/research/status")
+def get_research_status():
+    """
+    Endpoint para verificar el estado del Research Team
+    """
+    try:
+        team = get_research_team()
+        return {
+            "status": "ready",
+            "workflow_available": True,
+            "agents": {
+                "analyzer": "ready",
+                "search_assistant": "ready",
+                "researcher": "ready",
+                "editor": "ready"
+            }
+        }
+    except Exception as e:
+        return {
+            "status": "error",
+            "error": str(e),
+            "workflow_available": False
+        }

globant_quality_assesment.md ADDED Viewed

	@@ -0,0 +1,772 @@

+# J&J Quality Assessment Multiagent FastAPI backend
+_______________________________________________________________________________________________________________
+This is a design of a MultiAgent system for the content and proofreading teams who are engaged in the Quality Assessment process for Johnson & Johnson. We need to orquestrate a process with clear definitions of responsabilities and objetives, laveraging the MoE architecture of LLMs and compositionalit-modularity.
+The system should serves as an internal checkpoint for the content team, ensuring that generated content undergoes a thorough review before it is submitted to the design and proofreading teams.
+For the proofreading team, the multiagent backend should be designed to enhance efficiency by accelerating review times.
+The system need to review a PDF document and recreate it with the necessary changes implemented. The solution need to  address individually each error identified during the baseline assessment (typos, mistranslations, references errors, footers misalignment, etc), prioritized based on the frequency of occurrence, significance, and complexity of the required solution.
+# Problema context
+_______________________________________________________________________________________________________________
+Possible asset types are: Visual Aids (VA), Email Marketing (EMKT), JPRO webpage (JPRO) and Approved Email (AE).
+# The Quality Assessment Step-by-Step
+_______________________________________________________________________________________________________________
+# Step #1 - PDF Parsing:
+- Consists of parsing the PDF as accurately and reliably as possible, following the usual reading order that a person would have and considering particularities of the layouts of the type of content. In this step the product name, content language and target countries are identified.
+# Step #2 - Content Analyze:
+- Typo Detection: detection of grammatical, formatting, typing, syntax and translation errors, etc. from previously extracted text.*The user can select which of the suggested changes to apply.*
+- Reference Search, Retrieval, Completion and Formatting: identifies the list of references in the text, searches them through different tools in PubMed, arXiv and on the web, if possible retrieves all the metadata related to the reference and the full text if available, completes the reference and formats it according to Johnson&Johnson guidelines. *The user can select which of the suggested changes to apply.*
+- Claims - References Anchoring: identifies claims/statements that are anchored with a reference, uses the text retrieved from the reference in the previous step and extracts passages from the text that support the respective claim/statement.
+# Step #3 - Metadada Analyze:
+When building an agent to recognize different parts of a document, especially in semi-structured formats like PDFs, Word, or HTML, it’s crucial to define the **key document components** typically found across most documents. Here's a list of the **most common document parts** your agent should be able to identify and classify:
+**Core Document Structure (Logical Parts)**
+| **Section**                        | **Description**                                                                |
+| ---------------------------------- | ------------------------------------------------------------------------------ |
+| **Header**                         | Text at the top of each page. May include title, author, page number, date.    |
+| **Footer**                         | Text at the bottom of each page. Often contains page numbers, confidentiality. |
+| **Title**                          | The main title of the document (usually on the first page).                    |
+| **Subtitle**                       | Additional title under the main one. Sometimes a date or version.              |
+| **Table of Contents (TOC)**        | Index of sections, usually early in the document.                              |
+| **Abstract / Executive Summary**   | A short summary of the document's purpose or findings.                         |
+| **Chapters / Sections / Headings** | Logical divisions of content. Can be numbered or titled.                       |
+| **Paragraphs**                     | Basic text blocks. Your agent should detect boundaries between them.           |
+| **Tables**                         | Structured data often with borders, rows, and columns.                         |
+| **Figures / Images**               | Visuals (diagrams, charts) often with captions and references.                 |
+| **Captions**                       | Descriptive text for images/tables.                                            |
+| **References / Bibliography**      | Cited sources, usually near the end.                                           |
+| **Appendices**                     | Additional material, often with technical or supplementary data.               |
+| **Footnotes / Endnotes**           | Extra comments or citations at the bottom of the page or end of doc.           |
+| **Signatures**                     | Found in contracts, agreements, or formal letters.                             |
+| **Metadata**                       | Info like author, creation date, document ID (may not be visible in content).  |
+---
+Marketing documents in the **pharmaceutical industry** — especially Visual Aids (VA), Email Marketing (EMKT), JPRO webpages, and Approved Emails (AE) — follow a specific communication structure aligned with **compliance**, **messaging strategy**, and **medical-legal review (MLR)** requirements.
+Here's a breakdown of the **key parts** your agent should be able to identify across these document types:
+**📚 Common Document Parts in Pharma Marketing Materials**
+| **Section Name**                       | **Description / Function**                                                                   | **Examples / Signals**                |
+| -------------------------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------- |
+| **Header**                             | Often includes company branding, campaign title, product name, version number, approval code | Logo, “Brand”, MLR ID, product name   |
+| **Footer**                             | Regulatory information, references, disclaimers, copyright, internal tracking codes          | “© 2025 PharmaCo…”, MLR ID            |
+| **Slide/Section Number**               | In VAs and AEs, content is structured in modular slides or frames, often numbered            | “Slide 3 of 10”, “Frame 2”            |
+| **Title / Headline**                   | Key message or callout, designed to grab attention                                           | “Did You Know?”, “Introducing…”       |
+| **Body Copy**                          | Main promotional or educational text                                                         | Paragraphs, charts, bullet points     |
+| **Product Claims**                     | Efficacy, safety, tolerability, MOA (Mechanism of Action), etc.                              | Often tied to footnotes/references    |
+| **References Section**                 | Scientific references supporting the content                                                 | “1. Smith et al., JAMA 2022…”         |
+| **ISI (Important Safety Information)** | Required risk disclosures; mandatory part of regulated content                               | Black box warnings, contraindications |
+| **PI (Prescribing Information)**       | Full prescribing info, usually linked or footnoted                                           | “Click here for full PI”              |
+| **Fair Balance**                       | Risk/benefit balancing language                                                              | “Not all patients respond to…”        |
+| **Call to Action (CTA)**               | Encouragement to talk to a rep, visit a site, or request samples                             | “Talk to your rep”, “Learn more”      |
+| **Footnotes / Citations**              | Notes attached to claims, figures, or data                                                   | “\*p<0.05 vs placebo”                 |
+| **Interactive Elements**               | In AE/EMKT/JPRO, includes links, clickable modules                                           | “Download Brochure”, “Learn More”     |
+| **Audience Flags**                     | May specify “For HCPs only”, “For internal use”, “For patient education”                     | Usually in the footer or watermark    |
+| **Modular Elements (AE/Veeva)**        | Approved Email often has reusable modules like Header, Core Message, CTA, Footer             | In metadata or template structure     |
+---
+### 🛠️ **Approach to Identify Parts**
+You can design your agent using a combination of:
+* **Layout-based heuristics**: position on page (e.g., headers are near the top, footers at bottom).
+* **Font-based features**: size, boldness, italics to distinguish titles, headings, footnotes.
+* **Keyword-based rules**: e.g., “Table of Contents”, “References”, “Appendix”.
+* **NLP classifiers**: Train a model to classify text blocks into types based on content and formatting.
+* **PDF parser tools**: `pdfminer.six`, `PyMuPDF`, `PDFplumber` (extract coordinates, text boxes).
+* **LayoutLM family**: Use pretrained layout-aware models like [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) for OCR + layout understanding.
+**Focus on footers**
+Footer Search, Retrieval, Selection and Validation: based on the asset type specified by the user and the product and countries identified in Step #1, it searches for possible footers in the AI Assistant named with the same product name , select those that match the asset type, product and countries, provides a list to the user who must download and upload the corresponding footer and then performs a comparison to understand whether the footer is fully contained in the main content that is being reviewed.
+# Multi-Agent Design and Tools: a draft!
+_______________________________________________________________________________________________________________
+Agent should cover :
+	- Tasks Planing and Control Excecutión
+	- Tasks completion by Agent-Skill (reader, editor, writer, validator, researcher, etc)
+	- Action Excecution
+Tools can be programatic tool (like a calculator) or semantic tools (an llm invocation to perform a tasks)
+**Reader**
+Acces to the following tool:
+- PDF Parsing: this step resorts to a agent to parse the content of the PDF, an AI agent to identify from the content the name of the product, the target countries in which the material is to be disseminated and the text language . It includes a script to post-process the output (in json format) to retrieve the product, countries and language as independent variables.
+**Editor**
+Acces to the following tool:
+-Typo-Gramatical error detection: detect errors in the content and suggest a list of corrections.
+**Writer**
+Acces to the following tool:
+- Analyze the list of error and solve them.
+- Complete and format the reference citation using J&J Guidelines
+**Researcher**
+- Reference Search, Retrieval, Completion and Formatting:
+ - identify claims and references,
+ - search references in PubMed, arXiv, web,
+ - retrieve all available information (including the full text if possible),
+ - find text passages in the reference content that support the claim  in the parsed PDF.
+- References Anchoring:
+ - It consists of a list of claims, its associated references, links, supporting text, etc.
+- Footer Search, Retrieval, Selection and Validation:
+ - the user is given the possibility to manually upload a PDF file with the footer template (in case the user has it).
+ - filter asset type being reviewed and the countries for which it is intended.
+**Validator**
+- Evaluate footer compliance
+UX interaction
+User need two products:
+a. The Reconstructed PDF!!!
+b. The Revised PDF!!!
+# Step by step discussion and constraints
+_______________________________________________________________________________________________________________
+-----------------------------------------------------------------------
+1. Parsing Step
+----------------------------------------------------------------------
+The Parsing Step is the initial phase of the Quality Assessment process, focusing on the reliable and accurate extraction of information from PDF files. This step is crucial as the content to be reviewed may include various elements such as design components, tables, figures, links, and QR codes, all of which add to the complexity of the task.
+Possiblem Service provider and the  gemini-2.5-pro-preview-05-06 LLM (Large Language Model) for its advanced image interpretation capabilities. This method leverages AI to extract content from PDFs, aiming for high accuracy and reliability (Table 1).
+Agent Draft:
+--------------------
+You are an AI assistant specialized in parsing preprocessed PDFs and reconstructing content accurately.
+GUIDELINES
+1. Use the line-by-line input as the main source of content to ensure sentences are complete and to determine the natural reading order of the content.
+2. Use the block input to gain insights into the graphical design of the original PDF. This input helps understand which lines belong together in panels, figures, or other structured content, ensuring the reconstructed content reflects the original layout and design.
+3. When identifying a structure similar to a table or panel in the line-by-line input, follow this structure to split the content into rows and columns. Cross-reference the block input to accurately group lines into columns or panels, ensuring headers, rows, and columns are complete and logically aligned.
+4.Reconstruct the original content as accurately as possible, including text, figures, tables, panels, and graphical elements, ensuring it reflects how a reader would naturally read the document.
+5. Do not omit any information, even if it appears duplicated.
+6. Do not correct any typos or modify any words. Parsing must be literal as other agents will flag errors in the original content.
+RESPONSE FORMAT
+Provide the output always in markdown format, including tables and panels, and preserving the original layout.
+--------------------
+Provider
+Anthropic
+Model
+claude-3-7-sonnet-latest
+Temperature
+0.10
+Max Tokens
+8192
+Included features
+Exact PDF parsing in natural reading order
+Tables, logos and figures text parsing
+	Missing features
+	Link retrieval
+-----------------------------------------------------------------------
+2. Analyzer: extract
+----------------------------------------------------------------------
+Once the content that it’s intended to be reviewed is parsed, we resort to a Analyzer agent that it’s capable of recognizing from the main text:
+* The name of the product
+* The target countries in which the material will be delivered
+* The main language of the content
+Agent Draft
+----------
+An AI assistant capable of identifying the product name from a content and the countries for content delivery
+GUIDELINES
+You are an AI assistant responsible for analyzing a content and retrieving three types of information:
+1) Name of the product that is being advertised/explained
+2) Countries mentioned in the content. Take into account that countries may appear abbreviated. Usually, countries appear in the CONTACT INFORMATION section. Possible options include:
+* Brasil
+* Argentina
+* Chile
+* Uruguay
+* Paraguay
+* Bolivia
+* Colombia
+* Ecuador
+* Peru
+* Mexico
+* CENCA
+3) Main content language: spanish, english or portuguese
+CRITICAL RULES
+- If CENCA is mentioned in the content, it MUST BE included in the countries list
+- If daratumumab is mentioned in the content, darzalex MUST BE identified as the product
+RESPONSE FORMAT
+For the response  DO NOT INCLUDE ANY ADDITIONAL INFORMATION RATHER THAN THE REQUESTED PRODUCT AND COUNTRIES. THE ANSWER MUST BE IN JSON FORMAT. DO NOT INCLUDE MARKDOWN. DO NOT INCLUDE HTML TAGS. DO NOT PROVIDE THE RESPONSE WITH ANY KIND OF FORMATTING.
+- Output MUST be in JSON format
+- The name of the product MUST be in lower case
+- The name of the product MUST NOT contain special characters
+- Output must be {'product': name_of_product_lower_case_no_special_characters, 'countries': name_of_countries, 'language': main_content_language}.
+- DO NOT respond in markdown
+- DO NOT respond in HTML
+- DO NOT respond in any formatting style
+- DO NOT include formatting characters, tags or special characters
+RESPONSE EXAMPLE
+{'product': 'aspirin', 'countries': ['Argentina', 'Chile'], 'language': 'spanish'}.
+-------------
+Provider
+OpenAI
+Model
+gpt-40.2024-11-20
+Temperature
+0.0
+Max Tokens
+2048
+Capabilities
+Capable of recognizing countries and products
+Missing features
+Capable of recognizing asset types
+Recognition of product names without special characters due to logos
+Increase the critical rules section with exceptions when products are named differently in different countries
+	Constraints
+	Handling of pieces of content without reference to a specific product (see examples in Brasil/Spravato)
+Identify the asset type from the content
+If PDF is wrongly parsed and it does not parse the entire document, the countries identification may fail (as it is based on the contact information, logos and QR that appear at the end of the document)
+This AI assistant provides the answer in a JSON format, including the keys “product”, “countries” and “language”. For this reason, a Javascript script (Figure 5) is included in the flow to post-process this output and assign the values of each key to a context variable.
+-----------------------------------------------------------------------
+3. Analyzer: detect
+----------------------------------------------------------------------
+Typo Detection Step. As previously mentioned, the Typo Detection Step consists of the detection of grammatical, formatting, typing, syntax and translation errors, etc. in the content retrieved from the PDF file in the first step. So the input to the  Agent is the parsed PDF and the output is a list of errors correction suggestions.
+Agent Draft
+------------------
+Agent specialized in detecting and correcting typos, spelling mistakes, and punctuation errors in English, Spanish, Portuguese, and any other language
+Guidelines
+1. Analyse the provided text content.
+2.Detect typos, spelling mistakes, grammar mistakes, formatting errors, and punctuation errors in the text. Dismiss line breaks (\n), HTML tags, markdown instructions and image descriptions.
+3. Apply grammatical rules specific to the language of the text (English, Spanish, or Portuguese).
+4. DO consider the context and language of the entire content to ensure proper corrections. Spot mistranslations.
+5. Ensure proper names or drug names are not altered unless they contain errors.
+6. You MUST detect EVERY error. DO NOT miss errors.
+7. DO NOT hallucinate. DO NOT invent content that is not found in the original material.
+8. DO NOT consider references preambles such as "Adaptado de", "Adapted from", "Extracted from" an error.
+9. Respond in the same language as the user's first instruction. NO explanation. NO chain of thought. NO text generation. NO interpretation. NO preambles. NO additional context.
+10. Response modes:
+ - The agent ONLY lists the found errors and suggested corrections in a numbered format.
+ - If instructed to apply one or more of the suggested corrections, the agent will return the full original text in the markdown format with the selected corrections applied, preserving all formatting, layout, and special characters. DO NOT create extra content. DO NOT perform other changes than the requested
+-------------------
+Provider
+Anthropic
+Model
+claude-3-7-sonnet-latest
+Reasoning strategy
+Chain of Thought
+Creativity Level
+0,3
+Max Tokens
+8192
+Capabilities
+Typo, grammar, formatting, spelling, punctuation errors detection
+Spanish, English and Portuguese
+Provides errors list and applies corrections
+Upcoming features
+Incorporate a domain knowledge dictionary
+Constraints
+Finding absolutely all the typos
+Double spaces recognition due to constraints in parsing step at the beginning of the flow
+-----------------------------------------------------------------------
+4. Researcher
+----------------------------------------------------------------------
+Reference Completion and Formatting Step.  AI agent in charge of validating claims and references anchoring, as well as formatting the reference list according to J&J Guidelines
+Agent Draft
+----------------
+Claims Anchoring and Reference Formatting
+Agent Purpose
+Agent specialized in retrieving references and their content from PubMed, arXiv, or the web.
+Background Knowledge
+Guidelines
+1. ONLY focus on current Input and context variables: DO NOT consider any conversation history. ONLY analyze the input variablesprovided to the agent and the context variables. Review ALL anchored sentences and ALL references in the content. DO NOT OMIT anchored sentences. DO NOT OMIT listed references. DO NOT OMIT references in legends.
+2. Identify References: Analyze the provided input to identify sentences that are anchored with references, using the accompanying list of references.
+3. Locate References: Cross-reference the identified sentences with the reference list to pinpoint the relevant citations.
+4. Access Reference Information: Utilize the "PubMed Search" tool as first option to  find the corresponding references. Be mindful of potential typos in the references; if a reference is not found, attempt to search using only the title information. If not found, resort to "arXiv Search" or "Web Search" tools to find the corresponding references. TRY TO FIND as much of the references as possible using ANY of these tools.
+5. Retrieve Detailed Metadata and Full Text using "PubMed Fetch" tool or "Web Scrapper Httpx" tool. Obtain the detailed metadata and the full text (if accessible) in PubMed or the publisher's webpage (if a link is available). If not, try using arXiv or other web pages like ResearchGate.
+- If the reference is accessible, extract the following:
+* The link to the reference
+* The reference metadata to complete the citation (including its source type)
+* The exact text from the reference that supports the sentence. Disclose the SUPPORTING TEXT accurately. Supporting text MUST BE in the main content. DO NOT consider the abstract. DO NOT use the abstract or summary text.
+* The section of the document where the supporting text appears. Abstract MUST NOT be considered.
+- If the reference cannot be accessed (e.g., not found, not open access, API rate limit excceded), provide the link to the page and CLEARLY and ACCURATELY indicate this status.
+6. Repeat for ALL Sentences and References: Perform the above steps for each sentence that contains a reference.
+7. Respond once you have EXHAUSTED ALL WAYS of accessing the references and their full text.
+8. Use the "jnj_reference_formatting" agent tool to complete and format ALL the references in the list of references. Provide the agent with the main content and the metadata previously retrieved using "PubMed Fetch" or "Web Scrapper Httpx" tools and the context variables.
+9. Provide Results in Two Sections:
+- CLAIMS SECTION: A bulleted list (DO NOT use numbers) where each anchored sentence is disclosed in the order of appearance. Below each sentence, with appropriate indentation, specify the following in additional bullets:
+* The corresponding reference.
+* The related link.
+* The supporting text EXACTLY as it appears in the reference content.
+* The section where it appears.
+- FORMATTING SECTION: Based on jnj_reference_formatting agent's answer, present the COMPLETE numbered reference list formatted as requested, maintaining the original order (e.g., 1, 2, 3, etc.). For each reference, give a brief explanation of the changes or indicate if the reference could not be found, or completed or formatted. ALL references must be disclosed.
+10. Response formatting:  ONLY provide the section name and ONLY include the requested information:
+- NO explanations
+- NO additional text
+- NO interpretations
+- NO preambles
+- NO context
+11. DO NOT OMIT any sentence. DO NOT OMIT any reference.
+----------------------
+Provider
+OpenAI
+Model
+gpt-4.1
+Reasoning strategy
+Chain of Thought
+Creativity Level
+0,2
+Max Tokens
+12288
+Tools
+com.globant.geai.pubmed_search
+com.globant.geai.arxiv_search
+com.globant.geai.web_search
+com.globant.geai.pubmed_fetch
+com.globant.geai.web_scrapper_httpx
+jnj_reference_formatting
+	Capabilities
+	Claims and Reference list identificación
+Reference search (PubMed Search, arXiv Search, Web Search)
+Reference fetch -metadata and full text- (PubMed Fetch, Web Scrapper Httpx)
+Source identification
+Reference completion and formatting according to content onboarding guidelines
+Retrieval of reference main content passages that support the corresponding claim
+	Upcoming features
+	Access to Veeva references (with IMR approval) through a different assistant
+	Constraints
+	Search in PubMed sometimes does not give results, even though the reference is in PubMed
+PubMed API error: exceeded API rate limit when analyzing multiple references or multiple documents
+References with errors may not be found (even though typos and mistranslations are corrected after the typo detection step in the flow)
+-----------------------------------------------------------------------
+4. Writer
+----------------------------------------------------------------------
+Agent Draft
+------------------
+Agent Role
+Reference Formatting
+Agent Purpose
+Agent specialized in formatting references using JnJ formatting guidelines
+Background Knowledge
+You are an expert in reference formatting
+Guidelines
+1. Find ALL the references in the main content and use the metadata provided by another agent.
+2. For EACH reference, identify the source type (e.g., Journal Article, Book, Website, etc.) based on the provided information.
+3. Format EACH reference according to the predefined rules for the identified source type. COMPLETE the reference if it has missing fields. FOLLOW THESE GUIDELINES STRICTLY. References MUST BE formatted using ONLY these RULES:
+ - Journal Article (Journal): Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages.
+ - Journal Article (Epub): Authors. Article Title. Abbreviated Journal Title [Internet]. Epub Year Month Day [cited Year Month Day]; Volume(Number): Pages. Available from: DOI
+ - Supplementary Appendix (Supplementary Appendix): Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages. Supplementary Material.
+ - Label (Label): Medicine Name [package insert]. Place of Publication: Manufacturer; Year.
+ - Abstract (Abstracts): Authors. Article Title [abstract]. Abbreviated Journal Title Year; Volume(Number): Page.
+ - Poster (Poster): Authors. Title. Poster session presented at: [descriptive text]. Event Name; Event Year Month Date; Event Location.
+ - Oral Presentation (Oral Presentations): Authors. Title. [Oral presentation presented at: Event Name; Event Year Month Date; Event Location.]
+ - Website (Website): Authors/Website Name [Internet]. Title. [Accessed Year Month Day]. Available from: Website URL.
+ - Book (Book): Authors. Title. Edition Number. Place of Publication: Publisher, Year. Chapter Number: Chapter Title; Pages.
+4. Special rules:
+- Authors: Use first, second and third authors + "et al." when more than 5 authors
+- Use italic format ONLY for book titles.
+- Only if the content is in Portuguese, use the following formatting for package inserts: "Bula de [drug name]® ([molecule])"
+- Translate clarifications such as "cited", "Available from", "Supplementary Material", "package insert", "abstract", "Poster session presented at", "Oral presentation presented at", "Accessed" taking into consideration the main content context variable language (spanish, english or portuguese). When in doubt, use English.
+- Months should be disclosed completely (e.g. "December")
+5. Response format:
+- If references DON'T appear in the content, ONLY RESPOND that references were not found.
+- If references DO appear in the content, ONLY respond with:
+* Return numbered list of ALL references in the original order (e.g. 1,2,3..). DO NOT OMIT references.
+* For each reference indicate ONLY corrected reference and specify ONLY the applied formatting changes and corrected errors.
+* Specify if the reference was not found in PubMed and if NO changes were applied.
+- NO explanations
+- NO additional text
+- NO interpretations
+- NO preambles
+- NO context
+	Example
+	Input:  Journal: Rajpurkar, Pranav, Emma Chen, Oishi Banerjee, and Eric J. Topol. "AI in health and medicine." Nature medicine 28, no. 1 (2022): 31-38.
+Output: Rajpurkar, P, et al. AI in health and medicine. Nat. Med. 2022; 28(1): 31-38.
+Input: Journal: Smith, Matthew R., Fred Saad, Simon Chowdhury, Stéphane Oudard, Boris A. Hadaschik, Julie N. Graff, David Olmos et al. "Apalutamide treatment and metastasis-free survival in prostate cancer." New England Journal of Medicine 378, no. 15 (2018): 1408-1418.
+Output:  Smith MR, Saad F, Chowdhury S, et al. Apalutamide Treatment and Metastasis-free survival in Prostate Cancer. N Engl J Med. 2018; 378(15): 1408-1418
+Input:  Journal Epub: Korona-Glowniak, Izabela, Artur Niedzielski, and Anna Malm. "Upper respiratory colonization by Streptococcus pneumoniae in healthy pre-school children in south-east Poland." International journal of pediatric otorhinolaryngology 75, no. 12 (2011): 1529-1534.
+Output:  Korona-Glowniak I, Niedzielski A, Malm A. Upper respiratory colonization by Streptococcus pneumoniae in healthy pre-school children in south-east Poland. Int J Pediatr Otorhinolaryngol [Internet]. Epub 2001 Apr 18 [cited 2025 May 26]; 75(12): 1529-34. Available from: https://doi.org/10.1016/j.ijporl.2011.08.021
+Input:  Supplementary Appendix: Smith, Matthew R., Fred Saad, Simon Chowdhury, Stéphane Oudard, Boris A. Hadaschik, Julie N. Graff, David Olmos et al. "Apalutamide treatment and metastasis-free survival in prostate cancer." New England Journal of Medicine 378, no. 15 (2018): 1408-1418.
+Output:  Smith MR, Saad F, Chowdhury S, et al. Apalutamide Treatment and Metastasis-free survival in Prostate Cancer. N Engl J Med. 2018; 378(15): 1408-1418. Supplementary Material.
+Input:  Label in spanish/english: Ibrutinib
+Output:  Ibrutinib [package insert]. Buenos Aires (AR): Janssen Cilag Farmacéutica S.A 2019
+Input:  Label in portuguese: Talvey
+Output:  Bula de Talvey® (talquetamabe)
+Input:  Abstract: Lofwall, M. R., E. C. Strain, R. K. Brooner, K. A. Kindbom, and G. E. Bigelow. "Characteristics of older methadone maintenance (MM) patients." Drug Alcohol Depend 66 (2002).
+Output:  Lofwall MR, Strain EC, Brooner RK, Kindborn KA, Bigelaw GE. Characteristics of older methadone maintenance (MM) patients [abstract]. Drug Alcohol Depend 2002; 66(1): 5105.
+Input:  Poster: Chasman, J., and R. F. Kaplan. "The effects of occupation on preserved cognitive functioning in dementia." CLINICAL NEUROPSYCHOLOGIST. Vol. 20. No. 2. 325 CHESTNUT ST, SUITE 800, PHILADELPHIA, PA 19106 USA: TAYLOR & FRANCIS INC, 2006.
+Output:  Chasman J, Kaplan RF. The effects of occupation on preserved cognitive functioning in dementia. Poster session presented at: Excellence in clinical practice. 4th Annual Conference of the American Academy of Clinical Neuropsychology; 2006 Jun 15-17; Philadelphia, PA.
+Input:  Oral presentations: Costa LJ, Chhabra S, Godby KN, Medvedova E, Cornell RF, Hall AC, Silbermann RW, Innis-Shelton R, Dhakal B, DeIdiaquez D, Hardwick P. Daratumumab, carfilzomib, lenalidomide and dexamethasone (Dara-KRd) induction, autologous transplantation and post-transplant, response-adapted, measurable residual disease (MRD)-based Dara-Krd consolidation in patients with newly diagnosed multiple myeloma (NDMM). Blood. 2019 Nov 13;134:860.
+Output:  Costa LJ, Chhabra S, Godby KN, et al. Daratumumab, Carfilzomib, Lenalidomide and Dexamethasone (Dara-KRd) Induction, Autologous Transplantation and Post-Transplant, Response-Adapted, Measurable Residual Disease (MRD)-Based Dara-KRd Consolidation in Patients with Newly Diagnosed Multiple Myeloma (NDMM). [Oral presentation presented at The 61st American Society of Hematology (ASH) Annual Meeting & Exposition; December 7-10, 2019; Orlando, Florida.]
+Input:  Website: https://www.iarc.who.int/featured-news/iarc-research-at-the-intersection-of-cancer-and-covid-19/
+Output:  International Agency for Research on Cancer (WHO) [Internet]. IARC research at the intersection of cancer and COVID-19. [Accessed July 5th 2021]. Available from: https://www.iarc.who.int/featured-news/iarc-research-at-the-intersection-of-cancer-and-covid-19/
+Input:  Book: Simons, N., Menzies, B., & Matthews, M. . A short course in soil and rock slope engineering.
+Output:  Simons NE, Menzies B, Matthews M. A Short Course in Soil and Rock Slope Engineering. 13th ed. Philadelphia: Lippincott. WIlliams & Wilkins, 2010. Chapter 3:
+Pharmaceutical measurement; p.35-47
+----------------
+Provider
+OpenAI
+Model
+gpt-4.1
+Reasoning strategy
+Chain of Thought
+Creativity Level
+0,3
+Max Tokens
+8192
+Capabilities
+Reference completion and formatting according to contesnt onboarding guidelines
+Upcoming features
+Access to Veeva references (with IMR approval) through a different assistant
+Constraints
+The same constraints as the jnj_claims_references agent as it relies on its input to perform the completion and formatting:
+Search in PubMed sometimes does not give results, even though the reference is in PubMed
+PubMed API error: exceeded API rate limit when analyzing multiple references or multiple documents
+References with errors may not be found (even though typos and mistranslations are corrected after the typo detection step in the flow)
+-----------------------------------------------------------------------
+4. Writer
+----------------------------------------------------------------------
+4.1. Claims Anchoring Step
+In the Claims Anchoring Step, the “CLAIMS SECTION” obtained  previous section is shown to the user. This response consists of a list of claims; such that for each one the associated references, the link to such reference, the supporting text of the claim and the section in which such text is found are specified.
+-----------------------------------------------------------------------
+5. Validator
+----------------------------------------------------------------------
+4.2. Footers Comparison Step
+For the Footers Comparison Step, we have created an AI RAG Assistant for each product to host all the footer templates corresponding to the specific product (for different asset types and countries) that are available in the Figma New J&J Footers Board. Those assistants are named in a standardized way: jnj_footers_{product}.
+6. Reflexive-RAG
+More RAG assistants would need to be created for new products (or products that by the creation of this document are not available in the board). Moreover, these RAG assistants will require a continuous update as footers and its components tend to be replaced quite often. To make this creation and update process easier, a Python script has been developed (How-To create or update Footers RAG Assistants). In Table 8, the jnj_footers_tremfya is shown as an example. This RAG assistant can be used as a template for the creation of other RAG assistants. The only difference with the RAG Assistants for other products is that they are fed with different PDF files.
+Agent Rag Draft
+------------------------
+A RAG assistant with Tremfya footers (bula, disclaimers, IPP)
+EmbeddingProvider
+OpenAI
+EmbeddingModel
+text-embedding-3-large
+Prompt
+You are a document retrieval assistant designed to extract and return the COMPLETE CONTENT of a single document, maintaining precise fidelity to the original text.
+<document>
+{context}
+</document>
+{question}
+EXTRACTION RULES
+1. Text and Visual Fidelity:
+* Extract ALL text exactly as written.
+* Maintain original spacing and formatting.
+* Preserve ALL special characters, including accents and symbols in Spanish, English, and Portuguese.
+* Keep ALL numbers and symbols.
+* Include ALL punctuation marks.
+* Identify and convert any logos, QR or images to base64 format.
+* Extract text from logos/QR/images using OCR, ensuring text respects accents, numbers, and symbols.
+2.Content Scope:
+* Main body text
+* Headers and footers
+* Footnotes
+* References
+* Extract and include any links present in the document.
+3. Tables and visual content:
+* Captions
+* Lists and enumerations
+* Legal text
+* Disclaimers
+* Copyright notices
+* Preserve table structures and include any visual content in base64.
+4. Structure Preservation:
+* Keep original paragraph breaks
+* Keep original order
+* Maintain list formatting
+* Preserve table structures
+* Retain section hierarchy
+* Keep indentation patterns
+CRITICAL RULES
+* NO summarization
+* NO paraphrasing
+* NO content modification
+* NO text generation
+* NO explanations or comments
+* NO interpretation
+* NO formatting changes
+* NO content omission
+* NO additional context
+* NO metadata inclusion
+RESPONSE BEHAVIOR
+* Return ONLY the exact document content
+* Include ALL text without exception
+* Include ALL images, logos and QR without exception
+* Maintain precise formatting
+* Preserve ALL original elements
+* Replace all codified symbols into the corresponding special characters, including accents and trademarks
+* Reorder the content in the original layout
+* Respect break lines and sections spacing
+IMPORTANT: Your ONLY task is to return the EXACT and COMPLETE content of the document, precisely as it appears in the original.
+OUTPUT FORMAT
+[DOCUMENT CONTENT]
+Chunk Count
+5
+History Message C
+0
+LLM - Provider
+OpenAI
+LLM - Model
+gpt-4o
+Temperature
+0.0
+Max Tokens
+8192
+topP
+1
+Retrieval
+Vector Store
+Capabilities
+The RAG assistants are actually used as a database. They are called through an API call to retrieve all its documents, but are not used as a RAG assistant per se.
+Constraints
+As we need to retrieve the footers pdf files for performing a comparison with the main content, including visual features (logos, QRs, images, etc), that’s the reason why the RAG is not used as a RAG. If further updates improve the parsing of visual elements in the files, this could be used as a RAG and not as a DB.
+7. Footer Selector
+An AI assistant capable of filtering from a list of footers those that match the countries and   asset type
+Prompt
+You are an AI assistant capable of filtering from a list of documents the ones that correspond to a specific asset type and country or countries.
+GUIDELINES
+----------
+You will receive a numbered list of document names with their corresponding URL in HTML format. You MUST consider the variables {assetType} and {countries} previously provided by the user to filter ABSOLUTELY ALL the documents that match these specifications based on the document names in the list.
+Possible asset types are: Visual Aids (VA), Email Marketing (EMKT), JPRO webpage (JPRO) and Approved Email (AE).
+Asset types may appear complete or abbreviated in the document's names.
+The countries list may correspond to a single country or a combination of countries. Some countries may appear written differently (e.g. Brasil and Brazil)
+CRITICAL RULES
+- If {countries} contains multiple countries, ALL of them must appear in the document name to be selected. DO NOT return partial matches on countries.
+- If the original list received as input contains documents in which none of the asset types is specified, you should ALSO include these cases in the filtered list.
+- DO NOT omit any document from the original list that matches the {assetType} and {countries} specification.
+- DO NOT include in the filtered list ANY document from a different asset type disclosed in its name.
+- DO NOT include in the filtered list ANY document from a different country than the ones included in the countries list.
+RESPONSE
+You MUST respond ONLY with the filtered list with ALL the selected documents and in the same HTML formatting including the original number in the provided list, name and URL.
+You MUST USE the original number as bullet. DO NOT add extra bullets or enumerations. DO NOT change the enumeration.
+--------------
+Provider
+OpenAI
+Model
+gpt-4.1
+Temperature
+0.0
+Max Tokens
+8192
+Capabilities
+-----------------------------------------------------------------------
+5. Analyzer:
+----------------------------------------------------------------------
+The user can then decide whether or not to proceed with the content and footer comparison by downloading any of the provided footers and uploading it later; or, alternatively, by uploading another footer template PDF file from a local directory. This comparison is made by another AI assistant (Table 10) based on Gemini models, which perform well in the interpretation of images and visual components.
+An AI assistant capable of checking whether a template footer is fully contained in a main document file
+Prompt
+PRIMARY TASK
+----------
+Analyze if the second PDF (footer template/reference document) is fully contained within the first PDF (main content), considering all structural and content elements. Respond with a summarized conclusion.
+ELEMENTS TO COMPARE
+----------
+-Textual Content:
+*Main body text
+* Headers and titles
+*Footnotes
+*References
+*Disclaimers
+*Legal text
+-Interactive Elements:
+*URLs/hyperlinks
+*QR codes
+*Call-to-action (CTA) buttons and links
+*Contact forms
+-Contact Information:
+*Phone numbers
+*Email addresses
+*Physical addresses
+*Social media handles
+-Visual Elements:
+*Logos
+*Brand marks
+*Required symbols
+*Regulatory icons
+-Document Structure:
+*Required sections (even if empty in template)
+*Section ordering
+*Information hierarchy
+*Reference/abbreviation sections
+COMPARISON RULES
+----------
+-Template sections must exist in main document WITH content. The template must have placeholders that will be filled in the main document.
+-All required elements from template must be present
+-Links must be functional and identical
+-CTA must be functional and identical
+-Contact information must match exactly
+-Visual elements must maintain required positioning
+OUTPUT FORMAT
+----------
+Respond with:
+-Overall containment status (Yes/No)
+- Structured and concise summary showing the comparisons
+	Provider
+	Google VertexAI
+	Model
+	gemini-2.5-flash-preview-04-17
+	Temperature
+	0.30
+	Max Tokens
+	8192
+	File upload
+	enabled
+	Capabilities
+	Capable of identifying sections (differentiating between placeholders in template and completed sections in main content)
+Capable of comparing logos, QR and layout organization
+	Upcoming features
+	Constraints
+	Access to external links to check
+------------------------------
+7. Writer
+------------------------------
+Table 11. Johnson & Johnson Changes Implementation
+Specification
+	Description
+	Agent Name
+	jnj_changes_implementation
+[available from The Lab]
+	Agent Role
+	Content Formatter and Corrector
+	Agent Purpose
+	Agent specialized in applying corrections to content while preserving the exact original formatting.
+	Background Knowledge
+	You are an expert in content formatting and corrections.
+	Guidelines
+	1. Receive the main content in HTML, Markdown, or any other format along with the instructions for corrections.
+2. Analyze the instructions and identify the corrections to be applied.
+3. Apply the corrections STRICTLY as per the instructions. If no instructions are provided, respond with the main content. DO NOT generate content. DO NOT omit content. DO NOT hallucinate.
+4. DO NOT consider conversation history, ONLY consider the provided inputs. IGNORE all previous conversations.
+5. Ensure that the original order and formatting, including special characters, tags, and structure in the main content, remains intact.
+6. Translate the original formatting into HTML for visualization purposes. Take into account tables and graphical panels.
+7. Return ONLY the corrected content formatted as HTML based on the format originally received. DO NOT INCLUDE clarifications. DO NOT generate explanations. DO NOT include chain of thought. DO NOT add content.
+	Example
+	Input: MAIN CONTENT: <pre><code class="language-markdown">| Product®&lt;br&gt;(drugname) | ![Woman in sportswear with a circular graphic element behind her](placeholder_image_woman_sportswear) | | --- | --- | | **Increíble mejoría&lt;br&gt;observada en&lt;br&gt; esta enfermedad**&lt;br&gt;Evidencia notable. | | | [VER MÁS INFORMACIÓN] | |  Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO]  INSTRUCTIONS FOR CORRECTIONS: ONLY apply: reemplazar "Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO]" for "Estimado Dr. Juan Perez"
+Output: <table>     <tr>         <td>Product®<br>(drugname)</td>         <td><img src="placeholder_image_woman_sportswear" alt="Woman in sportswear with a circular graphic element behind her"></td>     </tr>     <tr>         <td><strong>Increíble mejoría<br>observada en<br> esta enfermedad</strong><br>Evidencia notable.</td>         <td></td>     </tr>     <tr>         <td><a href="#">VER MÁS INFORMACIÓN</a></td>         <td></td>     </tr>     <tr>         <td>Estimado Dr. Juan Perez</td>         <td></td>     </tr> </table>
+Input: MAIN CONTENT: <pre><code class="language-markdown">| Product®&lt;br&gt;(drugname) | ![Woman in sportswear with a circular graphic element behind her](placeholder_image_woman_sportswear) | | --- | --- | | **Increíble mejoría&lt;br&gt;observada en&lt;br&gt; esta enfermedad**&lt;br&gt;Evidencia notable. | | | [VER MÁS INFORMACIÓN] | |  Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO]  INSTRUCTIONS FOR CORRECTIONS:
+Output: <table>     <tr>         <td>Product®<br>(drugname)</td>         <td><img src="placeholder_image_woman_sportswear" alt="Woman in sportswear with a circular graphic element behind her"></td>     </tr>     <tr>         <td><strong>Increíble mejoría<br>observada en<br> esta enfermedad</strong><br>Evidencia notable.</td>         <td></td>     </tr>     <tr>         <td><a href="#">VER MÁS INFORMACIÓN</a></td>         <td></td>     </tr>     <tr>         <td>Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO] </td>         <td></td>     </tr> </table>
+	Provider
+	Anthropic
+	Model
+	claude-3-7-sonnet-latest
+	Reasoning strategy
+	Chain of Thought
+	Creativity Level
+	0.1
+	Max Tokens
+	8192
+	Capabilities
+Upcoming features
+Reconstruction of original PDF with requested changed (probably it would need tube execution of a script outside GEAI - e.g. a Python script using PyMuPDF)
+Replace footer template if needed
+Constraints

requirements.txt CHANGED Viewed

@@ -2,5 +2,12 @@ fastapi
 uvicorn[standard]
 langchain
 langchain-openai
 python-dotenv
 pydantic

 uvicorn[standard]
 langchain
 langchain-openai
+langchain-core
+langgraph
 python-dotenv
 pydantic
+asyncio
+dataclasses
+enum34
+typing-extensions
+python-multipart

research_team.py ADDED Viewed

	@@ -0,0 +1,726 @@

+"""
+Research Team for Claims Anchoring and Reference Formatting
+Implementation using LangGraph for multi-agent orchestration
+"""
+import os
+import json
+import asyncio
+import logging
+from typing import List, Dict, Any, Optional, TypedDict, Annotated
+from dataclasses import dataclass, field
+from enum import Enum
+import operator
+from datetime import datetime
+from langchain_core.messages import HumanMessage, AIMessage
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import ChatPromptTemplate
+from langgraph.graph import StateGraph, START, END
+from langgraph.graph.message import add_messages
+from langgraph.prebuilt import ToolNode
+from langchain_core.tools import tool
+from pydantic import BaseModel
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger("ResearchTeam")
+# Data Models
+class ClaimType(Enum):
+    CORE = "core"
+    SUPPORTING = "supporting"
+    CONTEXTUAL = "contextual"
+class SourceType(Enum):
+    GOOGLE_SCHOLAR = "google_scholar"
+    PUBMED = "pubmed"
+    ARXIV = "arxiv"
+@dataclass
+class Claim:
+    id: str
+    text: str
+    type: ClaimType
+    importance_score: float
+    position: int
+    context: str = ""
+@dataclass
+class Reference:
+    id: str
+    text: str
+    authors: List[str] = field(default_factory=list)
+    title: str = ""
+    journal: str = ""
+    year: str = ""
+    doi: str = ""
+    url: str = ""
+    source_type: str = ""
+@dataclass
+class SearchResult:
+    claim_id: str
+    source: SourceType
+    references: List[Reference]
+    supporting_text: str = ""
+    relevance_score: float = 0.0
+@dataclass
+class AnchoringResult:
+    claim_id: str
+    claim_text: str
+    anchored_references: List[Reference]
+    supporting_passages: List[str]
+    validation_status: str = "pending"
+# State Management
+class ResearchTeamState(TypedDict):
+    document_content: str
+    product: str
+    countries: List[str]
+    language: str
+    all_claims: List[Dict]
+    core_claims: List[Dict]
+    search_results: Dict[str, List[Dict]]
+    anchoring_results: List[Dict]
+    reference_list: List[Dict]
+    formatted_references: List[Dict]
+    final_output: Dict[str, Any]
+    messages: Annotated[List, add_messages]
+    processing_status: Dict[str, str]
+# Mock Tools for Internet Search
+@tool
+def mock_google_scholar_search(query: str, claim_id: str) -> Dict[str, Any]:
+    """Mock Google Scholar search tool"""
+    logger.info(f"Google Scholar search for claim {claim_id}: '{query[:30]}...'")
+    return {
+        "claim_id": claim_id,
+        "source": "google_scholar",
+        "results": [
+            {
+                "id": f"gs_{claim_id}_1",
+                "title": f"Research paper related to: {query[:50]}...",
+                "authors": ["Smith, J.", "Doe, A."],
+                "journal": "Nature",
+                "year": "2023",
+                "doi": "10.1038/example",
+                "url": "https://nature.com/articles/example",
+                "relevance_score": 0.85
+            }
+        ]
+    }
+@tool
+def mock_pubmed_search(query: str, claim_id: str) -> Dict[str, Any]:
+    """Mock PubMed search tool"""
+    logger.info(f"PubMed search for claim {claim_id}: '{query[:30]}...'")
+    return {
+        "claim_id": claim_id,
+        "source": "pubmed",
+        "results": [
+            {
+                "id": f"pm_{claim_id}_1",
+                "title": f"Medical study on: {query[:50]}...",
+                "authors": ["Johnson, R.", "Wilson, K."],
+                "journal": "JAMA",
+                "year": "2023",
+                "doi": "10.1001/jama.2023.example",
+                "url": "https://pubmed.ncbi.nlm.nih.gov/example",
+                "relevance_score": 0.92
+            }
+        ]
+    }
+@tool
+def mock_arxiv_search(query: str, claim_id: str) -> Dict[str, Any]:
+    """Mock arXiv search tool"""
+    logger.info(f"arXiv search for claim {claim_id}: '{query[:30]}...'")
+    return {
+        "claim_id": claim_id,
+        "source": "arxiv",
+        "results": [
+            {
+                "id": f"ar_{claim_id}_1",
+                "title": f"Preprint research: {query[:50]}...",
+                "authors": ["Chen, L.", "Zhang, M."],
+                "journal": "arXiv preprint",
+                "year": "2024",
+                "url": "https://arxiv.org/abs/example",
+                "relevance_score": 0.78
+            }
+        ]
+    }
+@tool
+def mock_reference_fetch(reference_id: str) -> Dict[str, Any]:
+    """Mock tool to fetch full reference content"""
+    logger.debug(f"Fetching full content for reference {reference_id}")
+    return {
+        "reference_id": reference_id,
+        "full_text": f"This is the full text content for reference {reference_id}. It contains detailed information that supports the corresponding claim...",
+        "abstract": f"Abstract for {reference_id}",
+        "sections": ["Introduction", "Methods", "Results", "Discussion"],
+        "supporting_passages": [
+            f"Key finding 1 from {reference_id}",
+            f"Important conclusion from {reference_id}"
+        ]
+    }
+# Research Team Agents
+class AnalyzerAgent:
+    """Agent for document analysis and claims extraction"""
+    def __init__(self, llm):
+        self.llm = llm
+        self.prompt = ChatPromptTemplate.from_template("""
+        You are an AI assistant specialized in analyzing content and extracting claims systematically.
+        GUIDELINES:
+        1. Analyze the provided document content to identify ALL claims and statements
+        2. Classify claims hierarchically:
+           - CORE claims: Primary thesis/conclusions that define the document's main arguments
+           - SUPPORTING claims: Evidence that reinforces core claims
+           - CONTEXTUAL claims: Background/introductory statements
+        3. Score importance (0-10) based on:
+           - Impact on main thesis
+           - Frequency of reference
+           - Position in document structure
+        4. Extract product name, countries, and language from content
+        5. DO NOT omit any claims, even if they appear duplicated
+        RESPONSE FORMAT:
+        Provide response in JSON format with:
+        {{
+            "product": "product_name_lowercase",
+            "countries": ["country1", "country2"],
+            "language": "detected_language",
+            "claims": [
+                {{
+                    "id": "claim_1",
+                    "text": "exact claim text",
+                    "type": "core|supporting|contextual",
+                    "importance_score": 9,
+                    "position": 1,
+                    "context": "surrounding context"
+                }}
+            ]
+        }}
+        Document Content:
+        {document_content}
+        """)
+    async def analyze(self, document_content: str) -> Dict[str, Any]:
+        """Analyze document and extract structured claims"""
+        logger.info("STEP 1: Starting document analysis...")
+        try:
+            logger.info("Processing document content for claims extraction")
+            response = await self.llm.ainvoke(
+                self.prompt.format_messages(document_content=document_content)
+            )
+            # Parse JSON response
+            result = json.loads(response.content)
+            # Separate core claims for priority processing
+            core_claims = [claim for claim in result["claims"] if claim["type"] == "core"]
+            logger.info(f"Analysis complete: {len(result['claims'])} total claims found")
+            logger.info(f"Core claims identified: {len(core_claims)}")
+            logger.info(f"Product detected: {result.get('product', 'Not detected')}")
+            logger.info(f"Countries: {', '.join(result.get('countries', ['Not detected']))}")
+            logger.info(f"Language: {result.get('language', 'Not detected')}")
+            return {
+                "product": result.get("product", ""),
+                "countries": result.get("countries", []),
+                "language": result.get("language", "english"),
+                "all_claims": result["claims"],
+                "core_claims": core_claims
+            }
+        except Exception as e:
+            logger.error(f"Analyzer error: {e}")
+            return {
+                "product": "",
+                "countries": [],
+                "language": "english",
+                "all_claims": [],
+                "core_claims": []
+            }
+class SearchAssistant:
+    """Agent for parallel reference searching across multiple sources"""
+    def __init__(self, llm):
+        self.llm = llm
+        self.tools = [mock_google_scholar_search, mock_pubmed_search, mock_arxiv_search]
+    async def search_for_claim(self, claim: Dict[str, Any]) -> Dict[str, List[Dict]]:
+        """Perform parallel searches for a specific claim"""
+        claim_id = claim["id"]
+        claim_text = claim["text"]
+        logger.info(f"Searching references for claim {claim_id}: '{claim_text[:50]}...'")
+        # Generate search query from claim
+        search_query = self._extract_search_terms(claim_text)
+        # Parallel search across all sources
+        search_tasks = []
+        for tool in self.tools:
+            task = asyncio.create_task(
+                self._execute_search(tool, search_query, claim_id)
+            )
+            search_tasks.append(task)
+        search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
+        # Aggregate results
+        aggregated_results = []
+        for result in search_results:
+            if isinstance(result, dict) and "results" in result:
+                aggregated_results.extend(result["results"])
+        logger.info(f"Search complete for claim {claim_id}: {len(aggregated_results)} references found")
+        return {claim_id: aggregated_results}
+    def _extract_search_terms(self, claim_text: str) -> str:
+        """Extract key terms from claim for search"""
+        # Simple keyword extraction - could be enhanced with NLP
+        return claim_text[:100]  # Use first 100 chars as search query
+    async def _execute_search(self, tool, query: str, claim_id: str) -> Dict:
+        """Execute individual search tool"""
+        try:
+            result = tool.invoke({"query": query, "claim_id": claim_id})
+            return result
+        except Exception as e:
+            logger.error(f"Search error for {tool.name}: {e}")
+            return {"claim_id": claim_id, "results": []}
+class ResearcherAgent:
+    """Agent for claims anchoring and validation"""
+    def __init__(self, llm):
+        self.llm = llm
+        self.prompt = ChatPromptTemplate.from_template("""
+        You are an AI assistant specialized in claims anchoring and reference validation.
+        GUIDELINES:
+        1. ONLY focus on provided input - DO NOT consider conversation history
+        2. Review ALL anchored sentences and ALL references in the content
+        3. DO NOT OMIT anchored sentences or listed references
+        4. For each claim, analyze provided search results to:
+           - Identify relevant references that support the claim
+           - Extract supporting text from reference content
+           - Validate the strength of evidence
+           - Rate the relevance and quality of support
+        RESPONSE FORMAT:
+        {{
+            "claim_id": "{claim_id}",
+            "validation_status": "validated|partial|unsupported",
+            "anchored_references": [
+                {{
+                    "reference_id": "ref_id",
+                    "supporting_text": "exact text that supports claim",
+                    "relevance_score": 0.92,
+                    "section": "Results"
+                }}
+            ],
+            "supporting_passages": ["passage1", "passage2"],
+            "quality_assessment": "assessment text"
+        }}
+        Claim: {claim_text}
+        Search Results: {search_results}
+        """)
+    async def anchor_claim(self, claim: Dict[str, Any], search_results: List[Dict]) -> Dict[str, Any]:
+        """Perform claims anchoring for a specific claim"""
+        claim_id = claim["id"]
+        logger.info(f"Anchoring claim {claim_id} with {len(search_results)} references")
+        try:
+            # Fetch full content for top references
+            enriched_results = []
+            for result in search_results[:3]:  # Limit to top 3 results
+                full_content = mock_reference_fetch.invoke({"reference_id": result.get("id", "")})
+                result["full_content"] = full_content
+                enriched_results.append(result)
+            logger.debug(f"Retrieved full content for {len(enriched_results)} top references")
+            response = await self.llm.ainvoke(
+                self.prompt.format_messages(
+                    claim_id=claim["id"],
+                    claim_text=claim["text"],
+                    search_results=json.dumps(enriched_results, indent=2)
+                )
+            )
+            result = json.loads(response.content)
+            result["claim_text"] = claim["text"]
+            logger.info(f"Claim {claim_id} anchored: {result.get('validation_status', 'unknown')} status")
+            return result
+        except Exception as e:
+            logger.error(f"Researcher error for claim {claim['id']}: {e}")
+            return {
+                "claim_id": claim["id"],
+                "claim_text": claim["text"],
+                "validation_status": "error",
+                "anchored_references": [],
+                "supporting_passages": [],
+                "quality_assessment": f"Error during processing: {e}"
+            }
+class EditorAgent:
+    """Agent for reference formatting and validation"""
+    def __init__(self, llm):
+        self.llm = llm
+        self.prompt = ChatPromptTemplate.from_template("""
+        You are an expert in reference formatting using J&J formatting guidelines.
+        GUIDELINES:
+        1. Format ALL references according to these rules:
+           - Journal Article: Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages.
+           - Journal Epub: Authors. Article Title. Journal [Internet]. Epub Year Month Day [cited Year Month Day]; Volume(Number): Pages. Available from: DOI
+           - Website: Authors/Website Name [Internet]. Title. [Accessed Year Month Day]. Available from: URL.
+           - Book: Authors. Title. Edition. Place: Publisher, Year. Chapter: Chapter Title; Pages.
+        2. Special rules:
+           - Use first, second, third authors + "et al." when more than 3 authors
+           - Use italic format ONLY for book titles
+           - Translate terms based on content language: {language}
+        3. Complete missing information where possible
+        4. Maintain original reference order
+        RESPONSE FORMAT:
+        {{
+            "formatted_references": [
+                {{
+                    "id": "ref_id",
+                    "original": "original reference text",
+                    "formatted": "properly formatted reference",
+                    "changes_applied": "description of changes",
+                    "source_type": "journal|book|website|etc",
+                    "completion_status": "complete|incomplete|not_found"
+                }}
+            ]
+        }}
+        References to format:
+        {references}
+        Content Language: {language}
+        """)
+    async def format_references(self, references: List[Dict], language: str = "english") -> Dict[str, Any]:
+        """Format references according to J&J guidelines"""
+        logger.info(f"Formatting {len(references)} references according to J&J guidelines")
+        logger.info(f"Content language: {language}")
+        try:
+            response = await self.llm.ainvoke(
+                self.prompt.format_messages(
+                    references=json.dumps(references, indent=2),
+                    language=language
+                )
+            )
+            result = json.loads(response.content)
+            formatted_count = len(result.get("formatted_references", []))
+            logger.info(f"Reference formatting complete: {formatted_count} references processed")
+            return result
+        except Exception as e:
+            logger.error(f"Editor error: {e}")
+            return {"formatted_references": []}
+# LangGraph Workflow Nodes
+class ResearchTeamWorkflow:
+    """Main workflow orchestrator using LangGraph"""
+    def __init__(self):
+        logger.info("Initializing Research Team Workflow")
+        # Initialize LLM
+        self.llm = ChatOpenAI(
+            model="gpt-4",
+            temperature=0.1,
+            api_key=os.getenv("OPENAI_API_KEY")
+        )
+        # Initialize agents
+        self.analyzer = AnalyzerAgent(self.llm)
+        self.search_assistant = SearchAssistant(self.llm)
+        self.researcher = ResearcherAgent(self.llm)
+        self.editor = EditorAgent(self.llm)
+        # Build workflow graph
+        self.workflow = self._build_workflow()
+        logger.info("Research Team Workflow initialized successfully")
+    def _build_workflow(self) -> StateGraph:
+        """Build the LangGraph workflow"""
+        logger.info("Building LangGraph workflow...")
+        workflow = StateGraph(ResearchTeamState)
+        # Add nodes
+        workflow.add_node("analyzer", self._analyzer_node)
+        workflow.add_node("claims_dispatcher", self._claims_dispatcher_node)
+        workflow.add_node("parallel_search", self._parallel_search_node)
+        workflow.add_node("claims_anchoring", self._claims_anchoring_node)
+        workflow.add_node("reference_formatting", self._reference_formatting_node)
+        workflow.add_node("final_assembly", self._final_assembly_node)
+        # Define workflow edges
+        workflow.add_edge(START, "analyzer")
+        workflow.add_edge("analyzer", "claims_dispatcher")
+        workflow.add_edge("claims_dispatcher", "parallel_search")
+        workflow.add_edge("parallel_search", "claims_anchoring")
+        workflow.add_edge("claims_anchoring", "reference_formatting")
+        workflow.add_edge("reference_formatting", "final_assembly")
+        workflow.add_edge("final_assembly", END)
+        logger.info("LangGraph workflow built successfully")
+        return workflow.compile()
+    async def _analyzer_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Document analysis and claims extraction"""
+        logger.info("STEP 1: Document Analysis")
+        result = await self.analyzer.analyze(state["document_content"])
+        state.update({
+            "product": result["product"],
+            "countries": result["countries"],
+            "language": result["language"],
+            "all_claims": result["all_claims"],
+            "core_claims": result["core_claims"],
+            "processing_status": {"analyzer": "completed"}
+        })
+        logger.info("STEP 1 COMPLETE: Document analysis finished")
+        return state
+    async def _claims_dispatcher_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Prepare claims for parallel processing"""
+        logger.info("STEP 2: Claims Dispatcher - Preparing parallel processing")
+        core_claims = state["core_claims"]
+        # Initialize processing status for each claim
+        processing_status = state.get("processing_status", {})
+        for claim in core_claims:
+            processing_status[f"claim_{claim['id']}"] = "pending"
+        state["processing_status"] = processing_status
+        logger.info(f"STEP 2 COMPLETE: {len(core_claims)} core claims prepared for parallel processing")
+        return state
+    async def _parallel_search_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Execute parallel searches for all core claims"""
+        logger.info("STEP 3: Parallel Search - Starting multi-source searches")
+        core_claims = state["core_claims"]
+        search_results = {}
+        logger.info(f"Launching parallel searches for {len(core_claims)} core claims")
+        logger.info("Search sources: Google Scholar, PubMed, arXiv")
+        # Create search tasks for all core claims
+        search_tasks = []
+        for claim in core_claims:
+            task = asyncio.create_task(
+                self.search_assistant.search_for_claim(claim)
+            )
+            search_tasks.append(task)
+        # Execute all searches in parallel
+        results = await asyncio.gather(*search_tasks, return_exceptions=True)
+        # Aggregate results
+        total_references = 0
+        for result in results:
+            if isinstance(result, dict):
+                search_results.update(result)
+                for claim_id, refs in result.items():
+                    total_references += len(refs)
+        state["search_results"] = search_results
+        logger.info(f"STEP 3 COMPLETE: Parallel search finished - {total_references} total references found")
+        return state
+    async def _claims_anchoring_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Perform claims anchoring for all core claims"""
+        logger.info("STEP 4: Claims Anchoring - Validating evidence support")
+        core_claims = state["core_claims"]
+        search_results = state["search_results"]
+        anchoring_results = []
+        logger.info(f"Processing {len(core_claims)} claims for anchoring")
+        # Create anchoring tasks
+        anchoring_tasks = []
+        for claim in core_claims:
+            claim_search_results = search_results.get(claim["id"], [])
+            task = asyncio.create_task(
+                self.researcher.anchor_claim(claim, claim_search_results)
+            )
+            anchoring_tasks.append(task)
+        # Execute all anchoring in parallel
+        results = await asyncio.gather(*anchoring_tasks, return_exceptions=True)
+        validated_count = 0
+        for result in results:
+            if isinstance(result, dict):
+                anchoring_results.append(result)
+                if result.get("validation_status") == "validated":
+                    validated_count += 1
+        state["anchoring_results"] = anchoring_results
+        logger.info(f"STEP 4 COMPLETE: Claims anchoring finished - {validated_count}/{len(anchoring_results)} claims validated")
+        return state
+    async def _reference_formatting_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Format all references according to J&J guidelines"""
+        logger.info("STEP 5: Reference Formatting - Applying J&J guidelines")
+        # Extract all references from anchoring results
+        all_references = []
+        for anchoring_result in state["anchoring_results"]:
+            if "anchored_references" in anchoring_result:
+                all_references.extend(anchoring_result["anchored_references"])
+        logger.info(f"Processing {len(all_references)} references for formatting")
+        # Format references
+        formatting_result = await self.editor.format_references(
+            all_references,
+            state.get("language", "english")
+        )
+        state["formatted_references"] = formatting_result.get("formatted_references", [])
+        logger.info("STEP 5 COMPLETE: Reference formatting finished")
+        return state
+    async def _final_assembly_node(self, state: ResearchTeamState) -> ResearchTeamState:
+        """Assemble final results"""
+        logger.info("STEP 6: Final Assembly - Generating comprehensive report")
+        final_output = {
+            "document_metadata": {
+                "product": state["product"],
+                "countries": state["countries"],
+                "language": state["language"]
+            },
+            "claims_analysis": {
+                "total_claims": len(state["all_claims"]),
+                "core_claims_count": len(state["core_claims"]),
+                "claims_details": state["all_claims"]
+            },
+            "claims_anchoring": {
+                "results": state["anchoring_results"],
+                "summary": self._generate_anchoring_summary(state["anchoring_results"])
+            },
+            "reference_formatting": {
+                "formatted_references": state["formatted_references"],
+                "total_references": len(state["formatted_references"])
+            },
+            "processing_status": state.get("processing_status", {})
+        }
+        state["final_output"] = final_output
+        # Log final summary
+        summary = final_output["claims_anchoring"]["summary"]
+        logger.info("FINAL RESULTS SUMMARY:")
+        logger.info(f"   Total claims processed: {final_output['claims_analysis']['total_claims']}")
+        logger.info(f"   Core claims: {final_output['claims_analysis']['core_claims_count']}")
+        logger.info(f"   Successfully validated: {summary['successfully_validated']}")
+        logger.info(f"   Validation rate: {summary['validation_rate']:.1%}")
+        logger.info(f"   References formatted: {final_output['reference_formatting']['total_references']}")
+        logger.info("STEP 6 COMPLETE: Research Team workflow finished successfully!")
+        return state
+    def _generate_anchoring_summary(self, anchoring_results: List[Dict]) -> Dict[str, Any]:
+        """Generate summary of anchoring results"""
+        total_claims = len(anchoring_results)
+        validated_claims = sum(1 for r in anchoring_results if r.get("validation_status") == "validated")
+        return {
+            "total_claims_processed": total_claims,
+            "successfully_validated": validated_claims,
+            "validation_rate": validated_claims / total_claims if total_claims > 0 else 0,
+            "claims_summary": [
+                {
+                    "claim_id": r["claim_id"],
+                    "status": r.get("validation_status", "unknown"),
+                    "references_found": len(r.get("anchored_references", []))
+                }
+                for r in anchoring_results
+            ]
+        }
+    async def process_document(self, document_content: str) -> Dict[str, Any]:
+        """Main entry point for document processing"""
+        start_time = datetime.now()
+        logger.info("=" * 80)
+        logger.info("RESEARCH TEAM WORKFLOW STARTED")
+        logger.info(f"Start time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
+        logger.info(f"Document length: {len(document_content)} characters")
+        logger.info("=" * 80)
+        initial_state = ResearchTeamState(
+            document_content=document_content,
+            product="",
+            countries=[],
+            language="english",
+            all_claims=[],
+            core_claims=[],
+            search_results={},
+            anchoring_results=[],
+            reference_list=[],
+            formatted_references=[],
+            final_output={},
+            messages=[],
+            processing_status={}
+        )
+        # Execute workflow
+        final_state = await self.workflow.ainvoke(initial_state)
+        end_time = datetime.now()
+        duration = end_time - start_time
+        logger.info("=" * 80)
+        logger.info("RESEARCH TEAM WORKFLOW COMPLETED")
+        logger.info(f"End time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
+        logger.info(f"Total duration: {duration.total_seconds():.2f} seconds")
+        logger.info("=" * 80)
+        return final_state["final_output"]
+# Factory function for easy instantiation
+def create_research_team() -> ResearchTeamWorkflow:
+    """Create and return a configured ResearchTeam instance"""
+    return ResearchTeamWorkflow()

test_document.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# Sample Medical Document for Testing ResearchTeam
+## Introduction
+**Daratumumab** is a human monoclonal antibody that targets CD38, a glycoprotein highly expressed on multiple myeloma cells. Clinical studies have demonstrated significant efficacy in treating relapsed and refractory multiple myeloma patients.
+## Key Clinical Claims
+### Primary Efficacy Findings
+The POLLUX study demonstrated that **daratumumab in combination with lenalidomide and dexamethasone significantly improved progression-free survival** compared to lenalidomide and dexamethasone alone (median not reached vs. 18.4 months; HR=0.37; 95% CI: 0.27-0.52; p<0.001).¹
+In the CASTOR trial, **daratumumab plus bortezomib and dexamethasone showed superior overall response rates** of 83% versus 63% in the control arm (p<0.001).²
+### Safety Profile
+**The most common adverse events observed were infusion-related reactions** occurring in approximately 48% of patients during the first infusion, with rates decreasing to less than 5% by the second infusion.³
+## Product Information
+This study was conducted across multiple countries including **Argentina, Brazil, Chile, and Mexico** for regulatory approval in Latin American markets.
+The content is provided in **Spanish** for healthcare professionals in these regions.
+## References
+1. Dimopoulos MA, Oriol A, Nopoka H, et al. Daratumumab, lenalidomide, and dexamethasone for multiple myeloma. N Engl J Med. 2016;375(14):1319-1331.
+2. Palumbo A, Chanan-Khan A, Weisel K, et al. Daratumumab, bortezomib, and dexamethasone for multiple myeloma. N Engl J Med. 2016;375(8):754-766.
+3. Safety data from pooled analysis of POLLUX and CASTOR studies. Presented at ASH 2016.
+---
+**Contact Information:**
+Janssen-Cilag Argentina S.A.
+Buenos Aires, Argentina
+Tel: +54-11-4732-5000
+**Important Safety Information:**
+Please refer to full prescribing information for complete safety profile and contraindications.