sccastillo commited on
Commit
e8fb55a
·
1 Parent(s): cc16e90

update research-team with websearch mocked

Browse files
Files changed (8) hide show
  1. .gitignore +1 -1
  2. README.md +10 -0
  3. SETUP.md +84 -16
  4. app.py +199 -17
  5. globant_quality_assesment.md +772 -0
  6. requirements.txt +7 -0
  7. research_team.py +726 -0
  8. test_document.md +41 -0
.gitignore CHANGED
@@ -1,7 +1,7 @@
1
  # Environment variables
2
  .env
3
  env_template.txt
4
-
5
  # Python
6
  __pycache__/
7
  *.py[cod]
 
1
  # Environment variables
2
  .env
3
  env_template.txt
4
+ .sciresearch/
5
  # Python
6
  __pycache__/
7
  *.py[cod]
README.md CHANGED
@@ -24,3 +24,13 @@ Scientific research FastAPI application deployed on Hugging Face Spaces.
24
  - `GET /` - Returns a greeting message
25
 
26
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
24
  - `GET /` - Returns a greeting message
25
 
26
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
27
+
28
+ ```bash
29
+ uvicorn app:app \
30
+ --host 0.0.0.0 \
31
+ --port 8000 \
32
+ --reload \
33
+ --log-level info \
34
+ --access-log \
35
+ --log-config logging.conf
36
+ ```
SETUP.md CHANGED
@@ -1,4 +1,4 @@
1
- # Configuración de SciResearch API
2
 
3
  ## 🔑 Configurar OpenAI API Key
4
 
@@ -16,38 +16,106 @@ OPENAI_API_KEY=tu_api_key_aqui
16
  2. Haz clic en "Settings"
17
  3. Ve a la sección "Variables and secrets"
18
  4. Agrega una nueva variable:
19
- - **Name**: `OPENAI_API_KEY`
20
- - **Value**: Tu API key de OpenAI (sk-...)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## 🚀 Características
23
 
24
- - **Interfaz web interactiva**: Pregunta directamente en la página principal
25
- - **API REST**: Endpoint `/api/generate` para integración
26
- - **Respuestas inteligentes**: Usa OpenAI para responder preguntas
27
- - **Documentación automática**: Disponible en `/docs`
 
28
 
29
  ## 📝 Endpoints disponibles:
30
 
 
31
  - `GET /` - Página principal con interfaz interactiva
32
  - `POST /api/generate` - Generar respuestas con IA
33
  - `GET /api/health` - Estado de la aplicación
34
  - `GET /docs` - Documentación Swagger UI
35
 
 
 
 
 
36
  ## 🧪 Ejemplo de uso con curl:
37
 
 
38
  ```bash
39
  curl -X POST "https://sccastillo-sciresearch.hf.space/api/generate" \
40
  -H "Content-Type: application/json" \
41
  -d '{"question": "¿Qué es la inteligencia artificial?"}'
42
  ```
43
 
44
- ## 📁 Estructura del proyecto:
45
-
 
 
 
 
 
46
  ```
47
- sciresearch/
48
- ├── app.py # Aplicación principal con OpenAI
49
- ├── requirements.txt # Dependencias
50
- ├── Dockerfile # Configuración Docker
51
- ├── README.md # Metadatos de HF Spaces
52
- └── SETUP.md # Este archivo
53
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuración de SciResearch API con Research Team
2
 
3
  ## 🔑 Configurar OpenAI API Key
4
 
 
16
  2. Haz clic en "Settings"
17
  3. Ve a la sección "Variables and secrets"
18
  4. Agrega una nueva variable:
19
+ - Name: `OPENAI_API_KEY`
20
+ - Value: Tu API key de OpenAI (sk-...)
21
+
22
+ ## 🧬 Research Team - Funcionalidad Principal
23
+
24
+ La nueva funcionalidad Research Team implementa un sistema multi-agente para Claims Anchoring y Reference Formatting siguiendo las especificaciones de Johnson & Johnson:
25
+
26
+ ### 🎯 Características del Research Team:
27
+
28
+ #### Claims Anchoring Workflow:
29
+ - Analyzer Agent: Extrae y clasifica claims en jerárquicas (core, supporting, contextual)
30
+ - SearchAssistant: Búsqueda paralela en Google Scholar, PubMed, y arXiv
31
+ - Researcher Agent: Anclaje de claims con referencias y validación de evidencia
32
+
33
+ #### Reference Formatting Workflow:
34
+ - Editor Agent: Formatea referencias según guidelines de J&J
35
+ - Validación: Verifica integridad y completitud de referencias
36
+
37
+ ### 🔧 Arquitectura Técnica:
38
+ - LangGraph: Orquestación de workflow multi-agente
39
+ - Parallel Processing: Procesamiento simultáneo de múltiples claims
40
+ - Mock Tools: Herramientas simuladas para desarrollo y testing
41
 
42
  ## 🚀 Características
43
 
44
+ - Interfaz web interactiva: Pregunta directamente en la página principal
45
+ - Research Team Interface: Procesa documentos para análisis de claims
46
+ - API REST: Endpoints para integración
47
+ - Respuestas inteligentes: Usa OpenAI para responder preguntas
48
+ - Documentación automática: Disponible en `/docs`
49
 
50
  ## 📝 Endpoints disponibles:
51
 
52
+ ### Endpoints básicos:
53
  - `GET /` - Página principal con interfaz interactiva
54
  - `POST /api/generate` - Generar respuestas con IA
55
  - `GET /api/health` - Estado de la aplicación
56
  - `GET /docs` - Documentación Swagger UI
57
 
58
+ ### Endpoints del Research Team:
59
+ - `POST /api/research/process` - Procesar documento con Research Team
60
+ - `GET /api/research/status` - Estado del Research Team
61
+
62
  ## 🧪 Ejemplo de uso con curl:
63
 
64
+ ### Respuesta básica con IA:
65
  ```bash
66
  curl -X POST "https://sccastillo-sciresearch.hf.space/api/generate" \
67
  -H "Content-Type: application/json" \
68
  -d '{"question": "¿Qué es la inteligencia artificial?"}'
69
  ```
70
 
71
+ ### Procesamiento de documento con Research Team:
72
+ ```bash
73
+ curl -X POST "https://sccastillo-sciresearch.hf.space/api/research/process" \
74
+ -H "Content-Type: application/json" \
75
+ -d '{
76
+ "document_content": "Daratumumab is a human monoclonal antibody that targets CD38. Clinical studies have demonstrated significant efficacy in treating multiple myeloma patients. The POLLUX study demonstrated that daratumumab in combination with lenalidomide and dexamethasone significantly improved progression-free survival."
77
+ }'
78
  ```
79
+
80
+ ## 🧪 Testing con Documento de Ejemplo
81
+
82
+ El archivo `test_document.md` contiene un documento de muestra con:
83
+ - Claims médicos estructurados
84
+ - Referencias formateadas
85
+ - Metadatos de producto (Daratumumab, países LATAM)
86
+ - Información de contacto
87
+
88
+ Puedes usar este contenido para probar la funcionalidad del Research Team.
89
+
90
+
91
+ ### 1. Análisis de Documento (Analyzer Agent)
92
+ - Extrae claims y los clasifica por importancia
93
+ - Identifica producto, países, y idioma
94
+ - Genera estructura jerárquica de claims
95
+
96
+ ### 2. Búsqueda Paralela (SearchAssistant)
97
+ - Procesa solo claims core (alta prioridad)
98
+ - Búsqueda simultánea en múltiples fuentes
99
+ - Optimización de recursos y rate limiting
100
+
101
+ ### 3. Anclaje de Claims (Researcher Agent)
102
+ - Valida evidencia de soporte para cada claim
103
+ - Extrae pasajes relevantes de referencias
104
+ - Genera scoring de relevancia y calidad
105
+
106
+ ### 4. Formateo de Referencias (Editor Agent)
107
+ - Aplica guidelines de formato J&J
108
+ - Completa información faltante
109
+ - Estandariza citaciones según tipo de fuente
110
+
111
+ ### 5. Ensamblaje Final
112
+ - Combina resultados de todos los agentes
113
+ - Genera reporte completo con métricas
114
+ - Proporciona documento reconstructado
115
+
116
+ ## Optimizaciones de Performance
117
+
118
+ - Parallel Processing: Múltiples claims procesados simultáneamente
119
+ - Mock Tools: Evita rate limits durante desarrollo
120
+ - State Management: LangGraph maneja estado distribuido
121
+ - Error Handling: Tolerancia a fallos individuales
app.py CHANGED
@@ -1,14 +1,18 @@
1
  import os
2
- from fastapi import FastAPI, HTTPException
3
  from fastapi.responses import HTMLResponse
4
  from pydantic import BaseModel
5
  from dotenv import load_dotenv
 
6
 
7
  # Importar dependencias de LangChain y OpenAI
8
- from langchain_openai import OpenAI
9
  from langchain.chains import LLMChain
10
  from langchain.prompts import PromptTemplate
11
 
 
 
 
12
  # Cargar variables de entorno
13
  load_dotenv()
14
 
@@ -16,17 +20,34 @@ load_dotenv()
16
  class QuestionRequest(BaseModel):
17
  question: str
18
 
 
 
 
19
  class GenerateResponse(BaseModel):
20
  text: str
21
  status: str = "success"
22
 
 
 
 
 
23
  # Crear la aplicación FastAPI
24
  app = FastAPI(
25
  title="SciResearch API",
26
- description="Scientific Research FastAPI application with OpenAI integration on Hugging Face Spaces",
27
  version="1.0.0"
28
  )
29
 
 
 
 
 
 
 
 
 
 
 
30
  def answer_question(question: str):
31
  """
32
  Función para responder preguntas usando OpenAI LLM
@@ -51,6 +72,18 @@ def answer_question(question: str):
51
  api_key=openai_api_key,
52
  temperature=0.7
53
  )
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  # Crear cadena LLM
56
  llm_chain = LLMChain(
@@ -78,28 +111,50 @@ def read_root():
78
  <style>
79
  body { font-family: Arial, sans-serif; margin: 40px; }
80
  h1 { color: #333; }
81
- .container { max-width: 600px; margin: 0 auto; }
82
  .form-group { margin: 20px 0; }
 
83
  input[type="text"] { width: 100%; padding: 10px; margin: 5px 0; }
84
- button { background-color: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; }
 
85
  button:hover { background-color: #45a049; }
86
- #response { background-color: #f9f9f9; padding: 15px; margin-top: 20px; border-left: 4px solid #4CAF50; }
 
 
 
 
 
87
  </style>
88
  </head>
89
  <body>
90
  <div class="container">
91
- <h1>🦀 SciResearch API</h1>
92
- <p>¡Bienvenido a la aplicación de investigación científica con IA!</p>
93
 
94
- <div class="form-group">
95
- <h3>Pregunta a la IA:</h3>
96
- <input type="text" id="question" placeholder="Escribe tu pregunta aquí..." />
97
- <button onclick="askQuestion()">Preguntar</button>
 
 
 
 
 
 
 
98
  </div>
99
-
100
- <div id="response" style="display:none;">
101
- <h4>Respuesta:</h4>
102
- <p id="answer"></p>
 
 
 
 
 
 
 
 
103
  </div>
104
 
105
  <h2>Endpoints disponibles:</h2>
@@ -108,6 +163,7 @@ def read_root():
108
  <li><a href="/api/hello">/api/hello</a> - Saludo JSON</li>
109
  <li><a href="/api/health">/api/health</a> - Estado de la aplicación</li>
110
  <li><strong>/api/generate</strong> - Generar respuestas con IA (POST)</li>
 
111
  </ul>
112
  </div>
113
 
@@ -140,6 +196,87 @@ def read_root():
140
  alert('Error de conexión: ' + error.message);
141
  }
142
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  // Permitir envío con Enter
145
  document.getElementById('question').addEventListener('keypress', function(e) {
@@ -171,7 +308,8 @@ def health_check():
171
  "status": "healthy",
172
  "service": "sciresearch",
173
  "version": "1.0.0",
174
- "openai_configured": openai_configured
 
175
  }
176
 
177
  @app.post("/api/generate", summary="Answer user questions using OpenAI", tags=["AI Generate"], response_model=GenerateResponse)
@@ -180,3 +318,47 @@ def inference(request: QuestionRequest):
180
  Endpoint para generar respuestas a preguntas usando OpenAI LLM
181
  """
182
  return answer_question(question=request.question)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import os
2
+ from fastapi import FastAPI, HTTPException, UploadFile, File
3
  from fastapi.responses import HTMLResponse
4
  from pydantic import BaseModel
5
  from dotenv import load_dotenv
6
+ import asyncio
7
 
8
  # Importar dependencias de LangChain y OpenAI
9
+ from langchain_openai import OpenAI, ChatOpenAI
10
  from langchain.chains import LLMChain
11
  from langchain.prompts import PromptTemplate
12
 
13
+ # Import ResearchTeam
14
+ from research_team import create_research_team
15
+
16
  # Cargar variables de entorno
17
  load_dotenv()
18
 
 
20
  class QuestionRequest(BaseModel):
21
  question: str
22
 
23
+ class DocumentRequest(BaseModel):
24
+ document_content: str
25
+
26
  class GenerateResponse(BaseModel):
27
  text: str
28
  status: str = "success"
29
 
30
+ class ResearchResponse(BaseModel):
31
+ result: dict
32
+ status: str = "success"
33
+
34
  # Crear la aplicación FastAPI
35
  app = FastAPI(
36
  title="SciResearch API",
37
+ description="Scientific Research FastAPI application with OpenAI integration and Research Team for Claims Anchoring",
38
  version="1.0.0"
39
  )
40
 
41
+ # Initialize ResearchTeam
42
+ research_team = None
43
+
44
+ def get_research_team():
45
+ """Get or create ResearchTeam instance"""
46
+ global research_team
47
+ if research_team is None:
48
+ research_team = create_research_team()
49
+ return research_team
50
+
51
  def answer_question(question: str):
52
  """
53
  Función para responder preguntas usando OpenAI LLM
 
72
  api_key=openai_api_key,
73
  temperature=0.7
74
  )
75
+ #llm = ChatOpenAI(
76
+ # model="openai/gpt-4.1",
77
+ # temperature=0.7,
78
+ # api_key=os.getenv("GEAI_API_KEY"),
79
+ # base_url=os.getenv("GEAI_BASE_URL")
80
+ #)
81
+ llm = ChatOpenAI(
82
+ model="openai/gpt-4.1",
83
+ temperature=0.7,
84
+ api_key=os.getenv("GEAI_API_KEY"),
85
+ base_url=os.getenv("GEAI_BASE_URL")
86
+ )
87
 
88
  # Crear cadena LLM
89
  llm_chain = LLMChain(
 
111
  <style>
112
  body { font-family: Arial, sans-serif; margin: 40px; }
113
  h1 { color: #333; }
114
+ .container { max-width: 800px; margin: 0 auto; }
115
  .form-group { margin: 20px 0; }
116
+ .section { border: 1px solid #ddd; padding: 20px; margin: 20px 0; border-radius: 5px; }
117
  input[type="text"] { width: 100%; padding: 10px; margin: 5px 0; }
118
+ textarea { width: 100%; padding: 10px; margin: 5px 0; height: 150px; }
119
+ button { background-color: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; margin: 5px; }
120
  button:hover { background-color: #45a049; }
121
+ .research-button { background-color: #2196F3; }
122
+ .research-button:hover { background-color: #1976D2; }
123
+ #response, #research-response { background-color: #f9f9f9; padding: 15px; margin-top: 20px; border-left: 4px solid #4CAF50; }
124
+ #research-response { border-left-color: #2196F3; }
125
+ .result-section { margin: 10px 0; padding: 10px; background-color: #f5f5f5; }
126
+ .loading { color: #666; font-style: italic; }
127
  </style>
128
  </head>
129
  <body>
130
  <div class="container">
131
+ <h1>🦀 SciResearch API with Research Team</h1>
132
+ <p>¡Bienvenido a la aplicación de investigación científica con IA y equipo de research para análisis de documentos!</p>
133
 
134
+ <div class="section">
135
+ <h3>💬 Pregunta a la IA:</h3>
136
+ <div class="form-group">
137
+ <input type="text" id="question" placeholder="Escribe tu pregunta aquí..." />
138
+ <button onclick="askQuestion()">Preguntar</button>
139
+ </div>
140
+
141
+ <div id="response" style="display:none;">
142
+ <h4>Respuesta:</h4>
143
+ <p id="answer"></p>
144
+ </div>
145
  </div>
146
+
147
+ <div class="section">
148
+ <h3>📄 Research Team - Claims Anchoring & Reference Formatting:</h3>
149
+ <div class="form-group">
150
+ <textarea id="document" placeholder="Pega aquí el contenido del documento para analizar claims y referencias..."></textarea>
151
+ <button class="research-button" onclick="processDocument()">Procesar Documento</button>
152
+ </div>
153
+
154
+ <div id="research-response" style="display:none;">
155
+ <h4>Resultados del Research Team:</h4>
156
+ <div id="research-results"></div>
157
+ </div>
158
  </div>
159
 
160
  <h2>Endpoints disponibles:</h2>
 
163
  <li><a href="/api/hello">/api/hello</a> - Saludo JSON</li>
164
  <li><a href="/api/health">/api/health</a> - Estado de la aplicación</li>
165
  <li><strong>/api/generate</strong> - Generar respuestas con IA (POST)</li>
166
+ <li><strong>/api/research/process</strong> - Procesar documento con Research Team (POST)</li>
167
  </ul>
168
  </div>
169
 
 
196
  alert('Error de conexión: ' + error.message);
197
  }
198
  }
199
+
200
+ async function processDocument() {
201
+ const document_content = document.getElementById('document').value;
202
+ if (!document_content.trim()) {
203
+ alert('Por favor pega el contenido del documento');
204
+ return;
205
+ }
206
+
207
+ // Show loading state
208
+ const resultsDiv = document.getElementById('research-results');
209
+ resultsDiv.innerHTML = '<p class="loading">Procesando documento... Esto puede tomar unos minutos.</p>';
210
+ document.getElementById('research-response').style.display = 'block';
211
+
212
+ try {
213
+ const response = await fetch('/api/research/process', {
214
+ method: 'POST',
215
+ headers: {
216
+ 'Content-Type': 'application/json',
217
+ },
218
+ body: JSON.stringify({document_content: document_content})
219
+ });
220
+
221
+ const data = await response.json();
222
+
223
+ if (response.ok) {
224
+ displayResearchResults(data.result);
225
+ } else {
226
+ resultsDiv.innerHTML = '<p style="color: red;">Error: ' + data.detail + '</p>';
227
+ }
228
+ } catch (error) {
229
+ resultsDiv.innerHTML = '<p style="color: red;">Error de conexión: ' + error.message + '</p>';
230
+ }
231
+ }
232
+
233
+ function displayResearchResults(result) {
234
+ const resultsDiv = document.getElementById('research-results');
235
+
236
+ let html = '';
237
+
238
+ // Document metadata
239
+ if (result.document_metadata) {
240
+ html += '<div class="result-section">';
241
+ html += '<h4>📋 Metadatos del Documento:</h4>';
242
+ html += '<p><strong>Producto:</strong> ' + (result.document_metadata.product || 'No detectado') + '</p>';
243
+ html += '<p><strong>Países:</strong> ' + (result.document_metadata.countries?.join(', ') || 'No detectados') + '</p>';
244
+ html += '<p><strong>Idioma:</strong> ' + (result.document_metadata.language || 'No detectado') + '</p>';
245
+ html += '</div>';
246
+ }
247
+
248
+ // Claims analysis
249
+ if (result.claims_analysis) {
250
+ html += '<div class="result-section">';
251
+ html += '<h4>🔍 Análisis de Claims:</h4>';
252
+ html += '<p><strong>Total de Claims:</strong> ' + result.claims_analysis.total_claims + '</p>';
253
+ html += '<p><strong>Claims Principales:</strong> ' + result.claims_analysis.core_claims_count + '</p>';
254
+ html += '</div>';
255
+ }
256
+
257
+ // Claims anchoring
258
+ if (result.claims_anchoring) {
259
+ html += '<div class="result-section">';
260
+ html += '<h4>⚓ Claims Anchoring:</h4>';
261
+ if (result.claims_anchoring.summary) {
262
+ const summary = result.claims_anchoring.summary;
263
+ html += '<p><strong>Claims Procesados:</strong> ' + summary.total_claims_processed + '</p>';
264
+ html += '<p><strong>Validados Exitosamente:</strong> ' + summary.successfully_validated + '</p>';
265
+ html += '<p><strong>Tasa de Validación:</strong> ' + Math.round(summary.validation_rate * 100) + '%</p>';
266
+ }
267
+ html += '</div>';
268
+ }
269
+
270
+ // Reference formatting
271
+ if (result.reference_formatting) {
272
+ html += '<div class="result-section">';
273
+ html += '<h4>📚 Formateo de Referencias:</h4>';
274
+ html += '<p><strong>Referencias Formateadas:</strong> ' + result.reference_formatting.total_references + '</p>';
275
+ html += '</div>';
276
+ }
277
+
278
+ resultsDiv.innerHTML = html;
279
+ }
280
 
281
  // Permitir envío con Enter
282
  document.getElementById('question').addEventListener('keypress', function(e) {
 
308
  "status": "healthy",
309
  "service": "sciresearch",
310
  "version": "1.0.0",
311
+ "openai_configured": openai_configured,
312
+ "research_team_available": True
313
  }
314
 
315
  @app.post("/api/generate", summary="Answer user questions using OpenAI", tags=["AI Generate"], response_model=GenerateResponse)
 
318
  Endpoint para generar respuestas a preguntas usando OpenAI LLM
319
  """
320
  return answer_question(question=request.question)
321
+
322
+ @app.post("/api/research/process", summary="Process document with Research Team", tags=["Research Team"], response_model=ResearchResponse)
323
+ async def process_document_research(request: DocumentRequest):
324
+ """
325
+ Endpoint para procesar documentos con el Research Team para Claims Anchoring y Reference Formatting
326
+ """
327
+ if not request.document_content or request.document_content.strip() == "":
328
+ raise HTTPException(status_code=400, detail="Please provide document content.")
329
+
330
+ try:
331
+ # Get research team instance
332
+ team = get_research_team()
333
+
334
+ # Process document
335
+ result = await team.process_document(request.document_content)
336
+
337
+ return ResearchResponse(result=result)
338
+
339
+ except Exception as e:
340
+ raise HTTPException(status_code=500, detail=f"Error processing document: {str(e)}")
341
+
342
+ @app.get("/api/research/status")
343
+ def get_research_status():
344
+ """
345
+ Endpoint para verificar el estado del Research Team
346
+ """
347
+ try:
348
+ team = get_research_team()
349
+ return {
350
+ "status": "ready",
351
+ "workflow_available": True,
352
+ "agents": {
353
+ "analyzer": "ready",
354
+ "search_assistant": "ready",
355
+ "researcher": "ready",
356
+ "editor": "ready"
357
+ }
358
+ }
359
+ except Exception as e:
360
+ return {
361
+ "status": "error",
362
+ "error": str(e),
363
+ "workflow_available": False
364
+ }
globant_quality_assesment.md ADDED
@@ -0,0 +1,772 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # J&J Quality Assessment Multiagent FastAPI backend
2
+ _______________________________________________________________________________________________________________
3
+
4
+ This is a design of a MultiAgent system for the content and proofreading teams who are engaged in the Quality Assessment process for Johnson & Johnson. We need to orquestrate a process with clear definitions of responsabilities and objetives, laveraging the MoE architecture of LLMs and compositionalit-modularity.
5
+
6
+ The system should serves as an internal checkpoint for the content team, ensuring that generated content undergoes a thorough review before it is submitted to the design and proofreading teams.
7
+
8
+ For the proofreading team, the multiagent backend should be designed to enhance efficiency by accelerating review times.
9
+
10
+ The system need to review a PDF document and recreate it with the necessary changes implemented. The solution need to address individually each error identified during the baseline assessment (typos, mistranslations, references errors, footers misalignment, etc), prioritized based on the frequency of occurrence, significance, and complexity of the required solution.
11
+
12
+ # Problema context
13
+ _______________________________________________________________________________________________________________
14
+
15
+ Possible asset types are: Visual Aids (VA), Email Marketing (EMKT), JPRO webpage (JPRO) and Approved Email (AE).
16
+
17
+
18
+ # The Quality Assessment Step-by-Step
19
+ _______________________________________________________________________________________________________________
20
+
21
+ # Step #1 - PDF Parsing:
22
+
23
+ - Consists of parsing the PDF as accurately and reliably as possible, following the usual reading order that a person would have and considering particularities of the layouts of the type of content. In this step the product name, content language and target countries are identified.
24
+
25
+ # Step #2 - Content Analyze:
26
+
27
+ - Typo Detection: detection of grammatical, formatting, typing, syntax and translation errors, etc. from previously extracted text.*The user can select which of the suggested changes to apply.*
28
+
29
+ - Reference Search, Retrieval, Completion and Formatting: identifies the list of references in the text, searches them through different tools in PubMed, arXiv and on the web, if possible retrieves all the metadata related to the reference and the full text if available, completes the reference and formats it according to Johnson&Johnson guidelines. *The user can select which of the suggested changes to apply.*
30
+
31
+ - Claims - References Anchoring: identifies claims/statements that are anchored with a reference, uses the text retrieved from the reference in the previous step and extracts passages from the text that support the respective claim/statement.
32
+
33
+ # Step #3 - Metadada Analyze:
34
+
35
+ When building an agent to recognize different parts of a document, especially in semi-structured formats like PDFs, Word, or HTML, it’s crucial to define the **key document components** typically found across most documents. Here's a list of the **most common document parts** your agent should be able to identify and classify:
36
+
37
+ **Core Document Structure (Logical Parts)**
38
+
39
+ | **Section** | **Description** |
40
+ | ---------------------------------- | ------------------------------------------------------------------------------ |
41
+ | **Header** | Text at the top of each page. May include title, author, page number, date. |
42
+ | **Footer** | Text at the bottom of each page. Often contains page numbers, confidentiality. |
43
+ | **Title** | The main title of the document (usually on the first page). |
44
+ | **Subtitle** | Additional title under the main one. Sometimes a date or version. |
45
+ | **Table of Contents (TOC)** | Index of sections, usually early in the document. |
46
+ | **Abstract / Executive Summary** | A short summary of the document's purpose or findings. |
47
+ | **Chapters / Sections / Headings** | Logical divisions of content. Can be numbered or titled. |
48
+ | **Paragraphs** | Basic text blocks. Your agent should detect boundaries between them. |
49
+ | **Tables** | Structured data often with borders, rows, and columns. |
50
+ | **Figures / Images** | Visuals (diagrams, charts) often with captions and references. |
51
+ | **Captions** | Descriptive text for images/tables. |
52
+ | **References / Bibliography** | Cited sources, usually near the end. |
53
+ | **Appendices** | Additional material, often with technical or supplementary data. |
54
+ | **Footnotes / Endnotes** | Extra comments or citations at the bottom of the page or end of doc. |
55
+ | **Signatures** | Found in contracts, agreements, or formal letters. |
56
+ | **Metadata** | Info like author, creation date, document ID (may not be visible in content). |
57
+
58
+ ---
59
+
60
+ Marketing documents in the **pharmaceutical industry** — especially Visual Aids (VA), Email Marketing (EMKT), JPRO webpages, and Approved Emails (AE) — follow a specific communication structure aligned with **compliance**, **messaging strategy**, and **medical-legal review (MLR)** requirements.
61
+ Here's a breakdown of the **key parts** your agent should be able to identify across these document types:
62
+
63
+ **📚 Common Document Parts in Pharma Marketing Materials**
64
+
65
+ | **Section Name** | **Description / Function** | **Examples / Signals** |
66
+ | -------------------------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------- |
67
+ | **Header** | Often includes company branding, campaign title, product name, version number, approval code | Logo, “Brand”, MLR ID, product name |
68
+ | **Footer** | Regulatory information, references, disclaimers, copyright, internal tracking codes | “© 2025 PharmaCo…”, MLR ID |
69
+ | **Slide/Section Number** | In VAs and AEs, content is structured in modular slides or frames, often numbered | “Slide 3 of 10”, “Frame 2” |
70
+ | **Title / Headline** | Key message or callout, designed to grab attention | “Did You Know?”, “Introducing…” |
71
+ | **Body Copy** | Main promotional or educational text | Paragraphs, charts, bullet points |
72
+ | **Product Claims** | Efficacy, safety, tolerability, MOA (Mechanism of Action), etc. | Often tied to footnotes/references |
73
+ | **References Section** | Scientific references supporting the content | “1. Smith et al., JAMA 2022…” |
74
+ | **ISI (Important Safety Information)** | Required risk disclosures; mandatory part of regulated content | Black box warnings, contraindications |
75
+ | **PI (Prescribing Information)** | Full prescribing info, usually linked or footnoted | “Click here for full PI” |
76
+ | **Fair Balance** | Risk/benefit balancing language | “Not all patients respond to…” |
77
+ | **Call to Action (CTA)** | Encouragement to talk to a rep, visit a site, or request samples | “Talk to your rep”, “Learn more” |
78
+ | **Footnotes / Citations** | Notes attached to claims, figures, or data | “\*p<0.05 vs placebo” |
79
+ | **Interactive Elements** | In AE/EMKT/JPRO, includes links, clickable modules | “Download Brochure”, “Learn More” |
80
+ | **Audience Flags** | May specify “For HCPs only”, “For internal use”, “For patient education” | Usually in the footer or watermark |
81
+ | **Modular Elements (AE/Veeva)** | Approved Email often has reusable modules like Header, Core Message, CTA, Footer | In metadata or template structure |
82
+
83
+ ---
84
+
85
+ ### 🛠️ **Approach to Identify Parts**
86
+
87
+ You can design your agent using a combination of:
88
+
89
+ * **Layout-based heuristics**: position on page (e.g., headers are near the top, footers at bottom).
90
+ * **Font-based features**: size, boldness, italics to distinguish titles, headings, footnotes.
91
+ * **Keyword-based rules**: e.g., “Table of Contents”, “References”, “Appendix”.
92
+ * **NLP classifiers**: Train a model to classify text blocks into types based on content and formatting.
93
+ * **PDF parser tools**: `pdfminer.six`, `PyMuPDF`, `PDFplumber` (extract coordinates, text boxes).
94
+ * **LayoutLM family**: Use pretrained layout-aware models like [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) for OCR + layout understanding.
95
+
96
+ **Focus on footers**
97
+
98
+ Footer Search, Retrieval, Selection and Validation: based on the asset type specified by the user and the product and countries identified in Step #1, it searches for possible footers in the AI Assistant named with the same product name , select those that match the asset type, product and countries, provides a list to the user who must download and upload the corresponding footer and then performs a comparison to understand whether the footer is fully contained in the main content that is being reviewed.
99
+
100
+
101
+
102
+ # Multi-Agent Design and Tools: a draft!
103
+ _______________________________________________________________________________________________________________
104
+
105
+ Agent should cover :
106
+ - Tasks Planing and Control Excecutión
107
+ - Tasks completion by Agent-Skill (reader, editor, writer, validator, researcher, etc)
108
+ - Action Excecution
109
+ Tools can be programatic tool (like a calculator) or semantic tools (an llm invocation to perform a tasks)
110
+
111
+ **Reader**
112
+ Acces to the following tool:
113
+ - PDF Parsing: this step resorts to a agent to parse the content of the PDF, an AI agent to identify from the content the name of the product, the target countries in which the material is to be disseminated and the text language . It includes a script to post-process the output (in json format) to retrieve the product, countries and language as independent variables.
114
+
115
+ **Editor**
116
+ Acces to the following tool:
117
+ -Typo-Gramatical error detection: detect errors in the content and suggest a list of corrections.
118
+
119
+ **Writer**
120
+ Acces to the following tool:
121
+ - Analyze the list of error and solve them.
122
+ - Complete and format the reference citation using J&J Guidelines
123
+
124
+ **Researcher**
125
+ - Reference Search, Retrieval, Completion and Formatting:
126
+ - identify claims and references,
127
+ - search references in PubMed, arXiv, web,
128
+ - retrieve all available information (including the full text if possible),
129
+ - find text passages in the reference content that support the claim in the parsed PDF.
130
+ - References Anchoring:
131
+ - It consists of a list of claims, its associated references, links, supporting text, etc.
132
+ - Footer Search, Retrieval, Selection and Validation:
133
+ - the user is given the possibility to manually upload a PDF file with the footer template (in case the user has it).
134
+ - filter asset type being reviewed and the countries for which it is intended.
135
+
136
+ **Validator**
137
+ - Evaluate footer compliance
138
+
139
+
140
+ UX interaction
141
+
142
+ User need two products:
143
+ a. The Reconstructed PDF!!!
144
+ b. The Revised PDF!!!
145
+
146
+
147
+ # Step by step discussion and constraints
148
+ _______________________________________________________________________________________________________________
149
+
150
+ -----------------------------------------------------------------------
151
+ 1. Parsing Step
152
+ ----------------------------------------------------------------------
153
+
154
+ The Parsing Step is the initial phase of the Quality Assessment process, focusing on the reliable and accurate extraction of information from PDF files. This step is crucial as the content to be reviewed may include various elements such as design components, tables, figures, links, and QR codes, all of which add to the complexity of the task.
155
+ Possiblem Service provider and the gemini-2.5-pro-preview-05-06 LLM (Large Language Model) for its advanced image interpretation capabilities. This method leverages AI to extract content from PDFs, aiming for high accuracy and reliability (Table 1).
156
+
157
+ Agent Draft:
158
+ --------------------
159
+ You are an AI assistant specialized in parsing preprocessed PDFs and reconstructing content accurately.
160
+ GUIDELINES
161
+
162
+ 1. Use the line-by-line input as the main source of content to ensure sentences are complete and to determine the natural reading order of the content.
163
+ 2. Use the block input to gain insights into the graphical design of the original PDF. This input helps understand which lines belong together in panels, figures, or other structured content, ensuring the reconstructed content reflects the original layout and design.
164
+ 3. When identifying a structure similar to a table or panel in the line-by-line input, follow this structure to split the content into rows and columns. Cross-reference the block input to accurately group lines into columns or panels, ensuring headers, rows, and columns are complete and logically aligned.
165
+ 4.Reconstruct the original content as accurately as possible, including text, figures, tables, panels, and graphical elements, ensuring it reflects how a reader would naturally read the document.
166
+ 5. Do not omit any information, even if it appears duplicated.
167
+ 6. Do not correct any typos or modify any words. Parsing must be literal as other agents will flag errors in the original content.
168
+
169
+ RESPONSE FORMAT
170
+ Provide the output always in markdown format, including tables and panels, and preserving the original layout.
171
+ --------------------
172
+
173
+ Provider
174
+ Anthropic
175
+ Model
176
+ claude-3-7-sonnet-latest
177
+ Temperature
178
+ 0.10
179
+ Max Tokens
180
+ 8192
181
+ Included features
182
+ Exact PDF parsing in natural reading order
183
+ Tables, logos and figures text parsing
184
+ Missing features
185
+ Link retrieval
186
+
187
+
188
+ -----------------------------------------------------------------------
189
+ 2. Analyzer: extract
190
+ ----------------------------------------------------------------------
191
+
192
+ Once the content that it’s intended to be reviewed is parsed, we resort to a Analyzer agent that it’s capable of recognizing from the main text:
193
+ * The name of the product
194
+ * The target countries in which the material will be delivered
195
+ * The main language of the content
196
+
197
+ Agent Draft
198
+ ----------
199
+ An AI assistant capable of identifying the product name from a content and the countries for content delivery
200
+ GUIDELINES
201
+ You are an AI assistant responsible for analyzing a content and retrieving three types of information:
202
+
203
+ 1) Name of the product that is being advertised/explained
204
+ 2) Countries mentioned in the content. Take into account that countries may appear abbreviated. Usually, countries appear in the CONTACT INFORMATION section. Possible options include:
205
+
206
+ * Brasil
207
+ * Argentina
208
+ * Chile
209
+ * Uruguay
210
+ * Paraguay
211
+ * Bolivia
212
+ * Colombia
213
+ * Ecuador
214
+ * Peru
215
+ * Mexico
216
+ * CENCA
217
+
218
+ 3) Main content language: spanish, english or portuguese
219
+
220
+ CRITICAL RULES
221
+ - If CENCA is mentioned in the content, it MUST BE included in the countries list
222
+ - If daratumumab is mentioned in the content, darzalex MUST BE identified as the product
223
+
224
+ RESPONSE FORMAT
225
+ For the response DO NOT INCLUDE ANY ADDITIONAL INFORMATION RATHER THAN THE REQUESTED PRODUCT AND COUNTRIES. THE ANSWER MUST BE IN JSON FORMAT. DO NOT INCLUDE MARKDOWN. DO NOT INCLUDE HTML TAGS. DO NOT PROVIDE THE RESPONSE WITH ANY KIND OF FORMATTING.
226
+ - Output MUST be in JSON format
227
+ - The name of the product MUST be in lower case
228
+ - The name of the product MUST NOT contain special characters
229
+ - Output must be {'product': name_of_product_lower_case_no_special_characters, 'countries': name_of_countries, 'language': main_content_language}.
230
+ - DO NOT respond in markdown
231
+ - DO NOT respond in HTML
232
+ - DO NOT respond in any formatting style
233
+ - DO NOT include formatting characters, tags or special characters
234
+
235
+ RESPONSE EXAMPLE
236
+
237
+ {'product': 'aspirin', 'countries': ['Argentina', 'Chile'], 'language': 'spanish'}.
238
+ -------------
239
+
240
+ Provider
241
+ OpenAI
242
+ Model
243
+ gpt-40.2024-11-20
244
+ Temperature
245
+ 0.0
246
+ Max Tokens
247
+ 2048
248
+ Capabilities
249
+ Capable of recognizing countries and products
250
+ Missing features
251
+ Capable of recognizing asset types
252
+ Recognition of product names without special characters due to logos
253
+ Increase the critical rules section with exceptions when products are named differently in different countries
254
+ Constraints
255
+ Handling of pieces of content without reference to a specific product (see examples in Brasil/Spravato)
256
+ Identify the asset type from the content
257
+ If PDF is wrongly parsed and it does not parse the entire document, the countries identification may fail (as it is based on the contact information, logos and QR that appear at the end of the document)
258
+
259
+
260
+ This AI assistant provides the answer in a JSON format, including the keys “product”, “countries” and “language”. For this reason, a Javascript script (Figure 5) is included in the flow to post-process this output and assign the values of each key to a context variable.
261
+
262
+ -----------------------------------------------------------------------
263
+ 3. Analyzer: detect
264
+ ----------------------------------------------------------------------
265
+
266
+ Typo Detection Step. As previously mentioned, the Typo Detection Step consists of the detection of grammatical, formatting, typing, syntax and translation errors, etc. in the content retrieved from the PDF file in the first step. So the input to the Agent is the parsed PDF and the output is a list of errors correction suggestions.
267
+
268
+ Agent Draft
269
+ ------------------
270
+ Agent specialized in detecting and correcting typos, spelling mistakes, and punctuation errors in English, Spanish, Portuguese, and any other language
271
+ Guidelines
272
+
273
+ 1. Analyse the provided text content.
274
+ 2.Detect typos, spelling mistakes, grammar mistakes, formatting errors, and punctuation errors in the text. Dismiss line breaks (\n), HTML tags, markdown instructions and image descriptions.
275
+ 3. Apply grammatical rules specific to the language of the text (English, Spanish, or Portuguese).
276
+ 4. DO consider the context and language of the entire content to ensure proper corrections. Spot mistranslations.
277
+ 5. Ensure proper names or drug names are not altered unless they contain errors.
278
+ 6. You MUST detect EVERY error. DO NOT miss errors.
279
+ 7. DO NOT hallucinate. DO NOT invent content that is not found in the original material.
280
+ 8. DO NOT consider references preambles such as "Adaptado de", "Adapted from", "Extracted from" an error.
281
+ 9. Respond in the same language as the user's first instruction. NO explanation. NO chain of thought. NO text generation. NO interpretation. NO preambles. NO additional context.
282
+ 10. Response modes:
283
+ - The agent ONLY lists the found errors and suggested corrections in a numbered format.
284
+ - If instructed to apply one or more of the suggested corrections, the agent will return the full original text in the markdown format with the selected corrections applied, preserving all formatting, layout, and special characters. DO NOT create extra content. DO NOT perform other changes than the requested
285
+
286
+ -------------------
287
+
288
+ Provider
289
+ Anthropic
290
+ Model
291
+ claude-3-7-sonnet-latest
292
+ Reasoning strategy
293
+ Chain of Thought
294
+ Creativity Level
295
+ 0,3
296
+ Max Tokens
297
+ 8192
298
+ Capabilities
299
+ Typo, grammar, formatting, spelling, punctuation errors detection
300
+ Spanish, English and Portuguese
301
+ Provides errors list and applies corrections
302
+ Upcoming features
303
+ Incorporate a domain knowledge dictionary
304
+ Constraints
305
+ Finding absolutely all the typos
306
+ Double spaces recognition due to constraints in parsing step at the beginning of the flow
307
+
308
+
309
+ -----------------------------------------------------------------------
310
+ 4. Researcher
311
+ ----------------------------------------------------------------------
312
+
313
+ Reference Completion and Formatting Step. AI agent in charge of validating claims and references anchoring, as well as formatting the reference list according to J&J Guidelines
314
+
315
+ Agent Draft
316
+ ----------------
317
+ Claims Anchoring and Reference Formatting
318
+ Agent Purpose
319
+ Agent specialized in retrieving references and their content from PubMed, arXiv, or the web.
320
+ Background Knowledge
321
+
322
+ Guidelines
323
+ 1. ONLY focus on current Input and context variables: DO NOT consider any conversation history. ONLY analyze the input variablesprovided to the agent and the context variables. Review ALL anchored sentences and ALL references in the content. DO NOT OMIT anchored sentences. DO NOT OMIT listed references. DO NOT OMIT references in legends.
324
+ 2. Identify References: Analyze the provided input to identify sentences that are anchored with references, using the accompanying list of references.
325
+ 3. Locate References: Cross-reference the identified sentences with the reference list to pinpoint the relevant citations.
326
+ 4. Access Reference Information: Utilize the "PubMed Search" tool as first option to find the corresponding references. Be mindful of potential typos in the references; if a reference is not found, attempt to search using only the title information. If not found, resort to "arXiv Search" or "Web Search" tools to find the corresponding references. TRY TO FIND as much of the references as possible using ANY of these tools.
327
+ 5. Retrieve Detailed Metadata and Full Text using "PubMed Fetch" tool or "Web Scrapper Httpx" tool. Obtain the detailed metadata and the full text (if accessible) in PubMed or the publisher's webpage (if a link is available). If not, try using arXiv or other web pages like ResearchGate.
328
+ - If the reference is accessible, extract the following:
329
+ * The link to the reference
330
+ * The reference metadata to complete the citation (including its source type)
331
+ * The exact text from the reference that supports the sentence. Disclose the SUPPORTING TEXT accurately. Supporting text MUST BE in the main content. DO NOT consider the abstract. DO NOT use the abstract or summary text.
332
+ * The section of the document where the supporting text appears. Abstract MUST NOT be considered.
333
+ - If the reference cannot be accessed (e.g., not found, not open access, API rate limit excceded), provide the link to the page and CLEARLY and ACCURATELY indicate this status.
334
+ 6. Repeat for ALL Sentences and References: Perform the above steps for each sentence that contains a reference.
335
+ 7. Respond once you have EXHAUSTED ALL WAYS of accessing the references and their full text.
336
+ 8. Use the "jnj_reference_formatting" agent tool to complete and format ALL the references in the list of references. Provide the agent with the main content and the metadata previously retrieved using "PubMed Fetch" or "Web Scrapper Httpx" tools and the context variables.
337
+ 9. Provide Results in Two Sections:
338
+ - CLAIMS SECTION: A bulleted list (DO NOT use numbers) where each anchored sentence is disclosed in the order of appearance. Below each sentence, with appropriate indentation, specify the following in additional bullets:
339
+ * The corresponding reference.
340
+ * The related link.
341
+ * The supporting text EXACTLY as it appears in the reference content.
342
+ * The section where it appears.
343
+ - FORMATTING SECTION: Based on jnj_reference_formatting agent's answer, present the COMPLETE numbered reference list formatted as requested, maintaining the original order (e.g., 1, 2, 3, etc.). For each reference, give a brief explanation of the changes or indicate if the reference could not be found, or completed or formatted. ALL references must be disclosed.
344
+ 10. Response formatting: ONLY provide the section name and ONLY include the requested information:
345
+ - NO explanations
346
+ - NO additional text
347
+ - NO interpretations
348
+ - NO preambles
349
+ - NO context
350
+ 11. DO NOT OMIT any sentence. DO NOT OMIT any reference.
351
+ ----------------------
352
+
353
+
354
+ Provider
355
+ OpenAI
356
+ Model
357
+ gpt-4.1
358
+ Reasoning strategy
359
+ Chain of Thought
360
+ Creativity Level
361
+ 0,2
362
+ Max Tokens
363
+ 12288
364
+ Tools
365
+ com.globant.geai.pubmed_search
366
+ com.globant.geai.arxiv_search
367
+ com.globant.geai.web_search
368
+ com.globant.geai.pubmed_fetch
369
+ com.globant.geai.web_scrapper_httpx
370
+ jnj_reference_formatting
371
+ Capabilities
372
+ Claims and Reference list identificación
373
+ Reference search (PubMed Search, arXiv Search, Web Search)
374
+ Reference fetch -metadata and full text- (PubMed Fetch, Web Scrapper Httpx)
375
+ Source identification
376
+ Reference completion and formatting according to content onboarding guidelines
377
+ Retrieval of reference main content passages that support the corresponding claim
378
+ Upcoming features
379
+ Access to Veeva references (with IMR approval) through a different assistant
380
+ Constraints
381
+ Search in PubMed sometimes does not give results, even though the reference is in PubMed
382
+ PubMed API error: exceeded API rate limit when analyzing multiple references or multiple documents
383
+ References with errors may not be found (even though typos and mistranslations are corrected after the typo detection step in the flow)
384
+
385
+
386
+ -----------------------------------------------------------------------
387
+ 4. Writer
388
+ ----------------------------------------------------------------------
389
+
390
+ Agent Draft
391
+ ------------------
392
+ Agent Role
393
+ Reference Formatting
394
+ Agent Purpose
395
+ Agent specialized in formatting references using JnJ formatting guidelines
396
+ Background Knowledge
397
+ You are an expert in reference formatting
398
+ Guidelines
399
+ 1. Find ALL the references in the main content and use the metadata provided by another agent.
400
+ 2. For EACH reference, identify the source type (e.g., Journal Article, Book, Website, etc.) based on the provided information.
401
+ 3. Format EACH reference according to the predefined rules for the identified source type. COMPLETE the reference if it has missing fields. FOLLOW THESE GUIDELINES STRICTLY. References MUST BE formatted using ONLY these RULES:
402
+ - Journal Article (Journal): Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages.
403
+ - Journal Article (Epub): Authors. Article Title. Abbreviated Journal Title [Internet]. Epub Year Month Day [cited Year Month Day]; Volume(Number): Pages. Available from: DOI
404
+ - Supplementary Appendix (Supplementary Appendix): Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages. Supplementary Material.
405
+ - Label (Label): Medicine Name [package insert]. Place of Publication: Manufacturer; Year.
406
+ - Abstract (Abstracts): Authors. Article Title [abstract]. Abbreviated Journal Title Year; Volume(Number): Page.
407
+ - Poster (Poster): Authors. Title. Poster session presented at: [descriptive text]. Event Name; Event Year Month Date; Event Location.
408
+ - Oral Presentation (Oral Presentations): Authors. Title. [Oral presentation presented at: Event Name; Event Year Month Date; Event Location.]
409
+ - Website (Website): Authors/Website Name [Internet]. Title. [Accessed Year Month Day]. Available from: Website URL.
410
+ - Book (Book): Authors. Title. Edition Number. Place of Publication: Publisher, Year. Chapter Number: Chapter Title; Pages.
411
+ 4. Special rules:
412
+ - Authors: Use first, second and third authors + "et al." when more than 5 authors
413
+ - Use italic format ONLY for book titles.
414
+ - Only if the content is in Portuguese, use the following formatting for package inserts: "Bula de [drug name]® ([molecule])"
415
+ - Translate clarifications such as "cited", "Available from", "Supplementary Material", "package insert", "abstract", "Poster session presented at", "Oral presentation presented at", "Accessed" taking into consideration the main content context variable language (spanish, english or portuguese). When in doubt, use English.
416
+ - Months should be disclosed completely (e.g. "December")
417
+ 5. Response format:
418
+ - If references DON'T appear in the content, ONLY RESPOND that references were not found.
419
+ - If references DO appear in the content, ONLY respond with:
420
+ * Return numbered list of ALL references in the original order (e.g. 1,2,3..). DO NOT OMIT references.
421
+ * For each reference indicate ONLY corrected reference and specify ONLY the applied formatting changes and corrected errors.
422
+ * Specify if the reference was not found in PubMed and if NO changes were applied.
423
+ - NO explanations
424
+ - NO additional text
425
+ - NO interpretations
426
+ - NO preambles
427
+ - NO context
428
+ Example
429
+ Input: Journal: Rajpurkar, Pranav, Emma Chen, Oishi Banerjee, and Eric J. Topol. "AI in health and medicine." Nature medicine 28, no. 1 (2022): 31-38.
430
+ Output: Rajpurkar, P, et al. AI in health and medicine. Nat. Med. 2022; 28(1): 31-38.
431
+
432
+ Input: Journal: Smith, Matthew R., Fred Saad, Simon Chowdhury, Stéphane Oudard, Boris A. Hadaschik, Julie N. Graff, David Olmos et al. "Apalutamide treatment and metastasis-free survival in prostate cancer." New England Journal of Medicine 378, no. 15 (2018): 1408-1418.
433
+ Output: Smith MR, Saad F, Chowdhury S, et al. Apalutamide Treatment and Metastasis-free survival in Prostate Cancer. N Engl J Med. 2018; 378(15): 1408-1418
434
+
435
+ Input: Journal Epub: Korona-Glowniak, Izabela, Artur Niedzielski, and Anna Malm. "Upper respiratory colonization by Streptococcus pneumoniae in healthy pre-school children in south-east Poland." International journal of pediatric otorhinolaryngology 75, no. 12 (2011): 1529-1534.
436
+ Output: Korona-Glowniak I, Niedzielski A, Malm A. Upper respiratory colonization by Streptococcus pneumoniae in healthy pre-school children in south-east Poland. Int J Pediatr Otorhinolaryngol [Internet]. Epub 2001 Apr 18 [cited 2025 May 26]; 75(12): 1529-34. Available from: https://doi.org/10.1016/j.ijporl.2011.08.021
437
+
438
+ Input: Supplementary Appendix: Smith, Matthew R., Fred Saad, Simon Chowdhury, Stéphane Oudard, Boris A. Hadaschik, Julie N. Graff, David Olmos et al. "Apalutamide treatment and metastasis-free survival in prostate cancer." New England Journal of Medicine 378, no. 15 (2018): 1408-1418.
439
+ Output: Smith MR, Saad F, Chowdhury S, et al. Apalutamide Treatment and Metastasis-free survival in Prostate Cancer. N Engl J Med. 2018; 378(15): 1408-1418. Supplementary Material.
440
+
441
+ Input: Label in spanish/english: Ibrutinib
442
+ Output: Ibrutinib [package insert]. Buenos Aires (AR): Janssen Cilag Farmacéutica S.A 2019
443
+
444
+ Input: Label in portuguese: Talvey
445
+ Output: Bula de Talvey® (talquetamabe)
446
+
447
+ Input: Abstract: Lofwall, M. R., E. C. Strain, R. K. Brooner, K. A. Kindbom, and G. E. Bigelow. "Characteristics of older methadone maintenance (MM) patients." Drug Alcohol Depend 66 (2002).
448
+ Output: Lofwall MR, Strain EC, Brooner RK, Kindborn KA, Bigelaw GE. Characteristics of older methadone maintenance (MM) patients [abstract]. Drug Alcohol Depend 2002; 66(1): 5105.
449
+
450
+ Input: Poster: Chasman, J., and R. F. Kaplan. "The effects of occupation on preserved cognitive functioning in dementia." CLINICAL NEUROPSYCHOLOGIST. Vol. 20. No. 2. 325 CHESTNUT ST, SUITE 800, PHILADELPHIA, PA 19106 USA: TAYLOR & FRANCIS INC, 2006.
451
+ Output: Chasman J, Kaplan RF. The effects of occupation on preserved cognitive functioning in dementia. Poster session presented at: Excellence in clinical practice. 4th Annual Conference of the American Academy of Clinical Neuropsychology; 2006 Jun 15-17; Philadelphia, PA.
452
+
453
+ Input: Oral presentations: Costa LJ, Chhabra S, Godby KN, Medvedova E, Cornell RF, Hall AC, Silbermann RW, Innis-Shelton R, Dhakal B, DeIdiaquez D, Hardwick P. Daratumumab, carfilzomib, lenalidomide and dexamethasone (Dara-KRd) induction, autologous transplantation and post-transplant, response-adapted, measurable residual disease (MRD)-based Dara-Krd consolidation in patients with newly diagnosed multiple myeloma (NDMM). Blood. 2019 Nov 13;134:860.
454
+ Output: Costa LJ, Chhabra S, Godby KN, et al. Daratumumab, Carfilzomib, Lenalidomide and Dexamethasone (Dara-KRd) Induction, Autologous Transplantation and Post-Transplant, Response-Adapted, Measurable Residual Disease (MRD)-Based Dara-KRd Consolidation in Patients with Newly Diagnosed Multiple Myeloma (NDMM). [Oral presentation presented at The 61st American Society of Hematology (ASH) Annual Meeting & Exposition; December 7-10, 2019; Orlando, Florida.]
455
+
456
+ Input: Website: https://www.iarc.who.int/featured-news/iarc-research-at-the-intersection-of-cancer-and-covid-19/
457
+ Output: International Agency for Research on Cancer (WHO) [Internet]. IARC research at the intersection of cancer and COVID-19. [Accessed July 5th 2021]. Available from: https://www.iarc.who.int/featured-news/iarc-research-at-the-intersection-of-cancer-and-covid-19/
458
+
459
+ Input: Book: Simons, N., Menzies, B., & Matthews, M. . A short course in soil and rock slope engineering.
460
+ Output: Simons NE, Menzies B, Matthews M. A Short Course in Soil and Rock Slope Engineering. 13th ed. Philadelphia: Lippincott. WIlliams & Wilkins, 2010. Chapter 3:
461
+ Pharmaceutical measurement; p.35-47
462
+ ----------------
463
+
464
+
465
+ Provider
466
+ OpenAI
467
+ Model
468
+ gpt-4.1
469
+ Reasoning strategy
470
+ Chain of Thought
471
+ Creativity Level
472
+ 0,3
473
+ Max Tokens
474
+ 8192
475
+ Capabilities
476
+ Reference completion and formatting according to contesnt onboarding guidelines
477
+ Upcoming features
478
+ Access to Veeva references (with IMR approval) through a different assistant
479
+ Constraints
480
+ The same constraints as the jnj_claims_references agent as it relies on its input to perform the completion and formatting:
481
+ Search in PubMed sometimes does not give results, even though the reference is in PubMed
482
+ PubMed API error: exceeded API rate limit when analyzing multiple references or multiple documents
483
+ References with errors may not be found (even though typos and mistranslations are corrected after the typo detection step in the flow)
484
+
485
+
486
+ -----------------------------------------------------------------------
487
+ 4. Writer
488
+ ----------------------------------------------------------------------
489
+
490
+ 4.1. Claims Anchoring Step
491
+ In the Claims Anchoring Step, the “CLAIMS SECTION” obtained previous section is shown to the user. This response consists of a list of claims; such that for each one the associated references, the link to such reference, the supporting text of the claim and the section in which such text is found are specified.
492
+
493
+
494
+ -----------------------------------------------------------------------
495
+ 5. Validator
496
+ ----------------------------------------------------------------------
497
+
498
+ 4.2. Footers Comparison Step
499
+ For the Footers Comparison Step, we have created an AI RAG Assistant for each product to host all the footer templates corresponding to the specific product (for different asset types and countries) that are available in the Figma New J&J Footers Board. Those assistants are named in a standardized way: jnj_footers_{product}.
500
+
501
+
502
+ 6. Reflexive-RAG
503
+
504
+ More RAG assistants would need to be created for new products (or products that by the creation of this document are not available in the board). Moreover, these RAG assistants will require a continuous update as footers and its components tend to be replaced quite often. To make this creation and update process easier, a Python script has been developed (How-To create or update Footers RAG Assistants). In Table 8, the jnj_footers_tremfya is shown as an example. This RAG assistant can be used as a template for the creation of other RAG assistants. The only difference with the RAG Assistants for other products is that they are fed with different PDF files.
505
+
506
+ Agent Rag Draft
507
+ ------------------------
508
+ A RAG assistant with Tremfya footers (bula, disclaimers, IPP)
509
+ EmbeddingProvider
510
+ OpenAI
511
+ EmbeddingModel
512
+ text-embedding-3-large
513
+ Prompt
514
+ You are a document retrieval assistant designed to extract and return the COMPLETE CONTENT of a single document, maintaining precise fidelity to the original text.
515
+ <document>
516
+ {context}
517
+ </document>
518
+
519
+ {question}
520
+ EXTRACTION RULES
521
+ 1. Text and Visual Fidelity:
522
+ * Extract ALL text exactly as written.
523
+ * Maintain original spacing and formatting.
524
+ * Preserve ALL special characters, including accents and symbols in Spanish, English, and Portuguese.
525
+ * Keep ALL numbers and symbols.
526
+ * Include ALL punctuation marks.
527
+ * Identify and convert any logos, QR or images to base64 format.
528
+ * Extract text from logos/QR/images using OCR, ensuring text respects accents, numbers, and symbols.
529
+
530
+ 2.Content Scope:
531
+ * Main body text
532
+ * Headers and footers
533
+ * Footnotes
534
+ * References
535
+ * Extract and include any links present in the document.
536
+
537
+ 3. Tables and visual content:
538
+ * Captions
539
+ * Lists and enumerations
540
+ * Legal text
541
+ * Disclaimers
542
+ * Copyright notices
543
+ * Preserve table structures and include any visual content in base64.
544
+
545
+ 4. Structure Preservation:
546
+ * Keep original paragraph breaks
547
+ * Keep original order
548
+ * Maintain list formatting
549
+ * Preserve table structures
550
+ * Retain section hierarchy
551
+ * Keep indentation patterns
552
+
553
+ CRITICAL RULES
554
+ * NO summarization
555
+ * NO paraphrasing
556
+ * NO content modification
557
+ * NO text generation
558
+ * NO explanations or comments
559
+ * NO interpretation
560
+ * NO formatting changes
561
+ * NO content omission
562
+ * NO additional context
563
+ * NO metadata inclusion
564
+
565
+ RESPONSE BEHAVIOR
566
+ * Return ONLY the exact document content
567
+ * Include ALL text without exception
568
+ * Include ALL images, logos and QR without exception
569
+ * Maintain precise formatting
570
+ * Preserve ALL original elements
571
+ * Replace all codified symbols into the corresponding special characters, including accents and trademarks
572
+ * Reorder the content in the original layout
573
+ * Respect break lines and sections spacing
574
+
575
+
576
+ IMPORTANT: Your ONLY task is to return the EXACT and COMPLETE content of the document, precisely as it appears in the original.
577
+
578
+ OUTPUT FORMAT
579
+
580
+
581
+ [DOCUMENT CONTENT]
582
+ Chunk Count
583
+ 5
584
+ History Message C
585
+ 0
586
+ LLM - Provider
587
+ OpenAI
588
+ LLM - Model
589
+ gpt-4o
590
+ Temperature
591
+ 0.0
592
+ Max Tokens
593
+ 8192
594
+ topP
595
+ 1
596
+ Retrieval
597
+ Vector Store
598
+ Capabilities
599
+ The RAG assistants are actually used as a database. They are called through an API call to retrieve all its documents, but are not used as a RAG assistant per se.
600
+ Constraints
601
+ As we need to retrieve the footers pdf files for performing a comparison with the main content, including visual features (logos, QRs, images, etc), that’s the reason why the RAG is not used as a RAG. If further updates improve the parsing of visual elements in the files, this could be used as a RAG and not as a DB.
602
+
603
+
604
+ 7. Footer Selector
605
+
606
+
607
+ An AI assistant capable of filtering from a list of footers those that match the countries and asset type
608
+ Prompt
609
+ You are an AI assistant capable of filtering from a list of documents the ones that correspond to a specific asset type and country or countries.
610
+
611
+ GUIDELINES
612
+ ----------
613
+ You will receive a numbered list of document names with their corresponding URL in HTML format. You MUST consider the variables {assetType} and {countries} previously provided by the user to filter ABSOLUTELY ALL the documents that match these specifications based on the document names in the list.
614
+
615
+ Possible asset types are: Visual Aids (VA), Email Marketing (EMKT), JPRO webpage (JPRO) and Approved Email (AE).
616
+
617
+ Asset types may appear complete or abbreviated in the document's names.
618
+
619
+ The countries list may correspond to a single country or a combination of countries. Some countries may appear written differently (e.g. Brasil and Brazil)
620
+
621
+ CRITICAL RULES
622
+ - If {countries} contains multiple countries, ALL of them must appear in the document name to be selected. DO NOT return partial matches on countries.
623
+ - If the original list received as input contains documents in which none of the asset types is specified, you should ALSO include these cases in the filtered list.
624
+ - DO NOT omit any document from the original list that matches the {assetType} and {countries} specification.
625
+ - DO NOT include in the filtered list ANY document from a different asset type disclosed in its name.
626
+ - DO NOT include in the filtered list ANY document from a different country than the ones included in the countries list.
627
+
628
+ RESPONSE
629
+ You MUST respond ONLY with the filtered list with ALL the selected documents and in the same HTML formatting including the original number in the provided list, name and URL.
630
+ You MUST USE the original number as bullet. DO NOT add extra bullets or enumerations. DO NOT change the enumeration.
631
+ --------------
632
+
633
+ Provider
634
+ OpenAI
635
+ Model
636
+ gpt-4.1
637
+ Temperature
638
+ 0.0
639
+ Max Tokens
640
+ 8192
641
+ Capabilities
642
+
643
+
644
+ -----------------------------------------------------------------------
645
+ 5. Analyzer:
646
+ ----------------------------------------------------------------------
647
+
648
+ The user can then decide whether or not to proceed with the content and footer comparison by downloading any of the provided footers and uploading it later; or, alternatively, by uploading another footer template PDF file from a local directory. This comparison is made by another AI assistant (Table 10) based on Gemini models, which perform well in the interpretation of images and visual components.
649
+
650
+ An AI assistant capable of checking whether a template footer is fully contained in a main document file
651
+ Prompt
652
+ PRIMARY TASK
653
+ ----------
654
+ Analyze if the second PDF (footer template/reference document) is fully contained within the first PDF (main content), considering all structural and content elements. Respond with a summarized conclusion.
655
+
656
+ ELEMENTS TO COMPARE
657
+ ----------
658
+ -Textual Content:
659
+ *Main body text
660
+ * Headers and titles
661
+ *Footnotes
662
+ *References
663
+ *Disclaimers
664
+ *Legal text
665
+ -Interactive Elements:
666
+ *URLs/hyperlinks
667
+ *QR codes
668
+ *Call-to-action (CTA) buttons and links
669
+ *Contact forms
670
+ -Contact Information:
671
+ *Phone numbers
672
+ *Email addresses
673
+ *Physical addresses
674
+ *Social media handles
675
+ -Visual Elements:
676
+ *Logos
677
+ *Brand marks
678
+ *Required symbols
679
+ *Regulatory icons
680
+ -Document Structure:
681
+ *Required sections (even if empty in template)
682
+ *Section ordering
683
+ *Information hierarchy
684
+ *Reference/abbreviation sections
685
+
686
+
687
+ COMPARISON RULES
688
+ ----------
689
+ -Template sections must exist in main document WITH content. The template must have placeholders that will be filled in the main document.
690
+ -All required elements from template must be present
691
+ -Links must be functional and identical
692
+ -CTA must be functional and identical
693
+ -Contact information must match exactly
694
+ -Visual elements must maintain required positioning
695
+
696
+
697
+ OUTPUT FORMAT
698
+ ----------
699
+ Respond with:
700
+ -Overall containment status (Yes/No)
701
+ - Structured and concise summary showing the comparisons
702
+ Provider
703
+ Google VertexAI
704
+ Model
705
+ gemini-2.5-flash-preview-04-17
706
+ Temperature
707
+ 0.30
708
+ Max Tokens
709
+ 8192
710
+ File upload
711
+ enabled
712
+ Capabilities
713
+ Capable of identifying sections (differentiating between placeholders in template and completed sections in main content)
714
+ Capable of comparing logos, QR and layout organization
715
+ Upcoming features
716
+
717
+
718
+ Constraints
719
+ Access to external links to check
720
+
721
+
722
+ ------------------------------
723
+ 7. Writer
724
+ ------------------------------
725
+
726
+ Table 11. Johnson & Johnson Changes Implementation
727
+ Specification
728
+ Description
729
+ Agent Name
730
+ jnj_changes_implementation
731
+ [available from The Lab]
732
+ Agent Role
733
+ Content Formatter and Corrector
734
+ Agent Purpose
735
+ Agent specialized in applying corrections to content while preserving the exact original formatting.
736
+ Background Knowledge
737
+ You are an expert in content formatting and corrections.
738
+ Guidelines
739
+ 1. Receive the main content in HTML, Markdown, or any other format along with the instructions for corrections.
740
+ 2. Analyze the instructions and identify the corrections to be applied.
741
+ 3. Apply the corrections STRICTLY as per the instructions. If no instructions are provided, respond with the main content. DO NOT generate content. DO NOT omit content. DO NOT hallucinate.
742
+ 4. DO NOT consider conversation history, ONLY consider the provided inputs. IGNORE all previous conversations.
743
+ 5. Ensure that the original order and formatting, including special characters, tags, and structure in the main content, remains intact.
744
+ 6. Translate the original formatting into HTML for visualization purposes. Take into account tables and graphical panels.
745
+ 7. Return ONLY the corrected content formatted as HTML based on the format originally received. DO NOT INCLUDE clarifications. DO NOT generate explanations. DO NOT include chain of thought. DO NOT add content.
746
+ Example
747
+ Input: MAIN CONTENT: <pre><code class="language-markdown">| Product®&lt;br&gt;(drugname) | ![Woman in sportswear with a circular graphic element behind her](placeholder_image_woman_sportswear) | | --- | --- | | **Increíble mejoría&lt;br&gt;observada en&lt;br&gt; esta enfermedad**&lt;br&gt;Evidencia notable. | | | [VER MÁS INFORMACIÓN] | | Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO] INSTRUCTIONS FOR CORRECTIONS: ONLY apply: reemplazar "Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO]" for "Estimado Dr. Juan Perez"
748
+
749
+ Output: <table> <tr> <td>Product®<br>(drugname)</td> <td><img src="placeholder_image_woman_sportswear" alt="Woman in sportswear with a circular graphic element behind her"></td> </tr> <tr> <td><strong>Increíble mejoría<br>observada en<br> esta enfermedad</strong><br>Evidencia notable.</td> <td></td> </tr> <tr> <td><a href="#">VER MÁS INFORMACIÓN</a></td> <td></td> </tr> <tr> <td>Estimado Dr. Juan Perez</td> <td></td> </tr> </table>
750
+
751
+
752
+ Input: MAIN CONTENT: <pre><code class="language-markdown">| Product®&lt;br&gt;(drugname) | ![Woman in sportswear with a circular graphic element behind her](placeholder_image_woman_sportswear) | | --- | --- | | **Increíble mejoría&lt;br&gt;observada en&lt;br&gt; esta enfermedad**&lt;br&gt;Evidencia notable. | | | [VER MÁS INFORMACIÓN] | | Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO] INSTRUCTIONS FOR CORRECTIONS:
753
+ Output: <table> <tr> <td>Product®<br>(drugname)</td> <td><img src="placeholder_image_woman_sportswear" alt="Woman in sportswear with a circular graphic element behind her"></td> </tr> <tr> <td><strong>Increíble mejoría<br>observada en<br> esta enfermedad</strong><br>Evidencia notable.</td> <td></td> </tr> <tr> <td><a href="#">VER MÁS INFORMACIÓN</a></td> <td></td> </tr> <tr> <td>Estimado/a. Dr./Dra. [NOMBRE DEL MÉDICO] </td> <td></td> </tr> </table>
754
+ Provider
755
+ Anthropic
756
+ Model
757
+ claude-3-7-sonnet-latest
758
+ Reasoning strategy
759
+ Chain of Thought
760
+ Creativity Level
761
+ 0.1
762
+ Max Tokens
763
+ 8192
764
+ Capabilities
765
+
766
+
767
+ Upcoming features
768
+ Reconstruction of original PDF with requested changed (probably it would need tube execution of a script outside GEAI - e.g. a Python script using PyMuPDF)
769
+ Replace footer template if needed
770
+ Constraints
771
+
772
+
requirements.txt CHANGED
@@ -2,5 +2,12 @@ fastapi
2
  uvicorn[standard]
3
  langchain
4
  langchain-openai
 
 
5
  python-dotenv
6
  pydantic
 
 
 
 
 
 
2
  uvicorn[standard]
3
  langchain
4
  langchain-openai
5
+ langchain-core
6
+ langgraph
7
  python-dotenv
8
  pydantic
9
+ asyncio
10
+ dataclasses
11
+ enum34
12
+ typing-extensions
13
+ python-multipart
research_team.py ADDED
@@ -0,0 +1,726 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Research Team for Claims Anchoring and Reference Formatting
3
+ Implementation using LangGraph for multi-agent orchestration
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import asyncio
9
+ import logging
10
+ from typing import List, Dict, Any, Optional, TypedDict, Annotated
11
+ from dataclasses import dataclass, field
12
+ from enum import Enum
13
+ import operator
14
+ from datetime import datetime
15
+
16
+ from langchain_core.messages import HumanMessage, AIMessage
17
+ from langchain_openai import ChatOpenAI
18
+ from langchain_core.prompts import ChatPromptTemplate
19
+ from langgraph.graph import StateGraph, START, END
20
+ from langgraph.graph.message import add_messages
21
+ from langgraph.prebuilt import ToolNode
22
+ from langchain_core.tools import tool
23
+ from pydantic import BaseModel
24
+
25
+ # Configure logging
26
+ logging.basicConfig(
27
+ level=logging.INFO,
28
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
29
+ )
30
+ logger = logging.getLogger("ResearchTeam")
31
+
32
+ # Data Models
33
+ class ClaimType(Enum):
34
+ CORE = "core"
35
+ SUPPORTING = "supporting"
36
+ CONTEXTUAL = "contextual"
37
+
38
+ class SourceType(Enum):
39
+ GOOGLE_SCHOLAR = "google_scholar"
40
+ PUBMED = "pubmed"
41
+ ARXIV = "arxiv"
42
+
43
+ @dataclass
44
+ class Claim:
45
+ id: str
46
+ text: str
47
+ type: ClaimType
48
+ importance_score: float
49
+ position: int
50
+ context: str = ""
51
+
52
+ @dataclass
53
+ class Reference:
54
+ id: str
55
+ text: str
56
+ authors: List[str] = field(default_factory=list)
57
+ title: str = ""
58
+ journal: str = ""
59
+ year: str = ""
60
+ doi: str = ""
61
+ url: str = ""
62
+ source_type: str = ""
63
+
64
+ @dataclass
65
+ class SearchResult:
66
+ claim_id: str
67
+ source: SourceType
68
+ references: List[Reference]
69
+ supporting_text: str = ""
70
+ relevance_score: float = 0.0
71
+
72
+ @dataclass
73
+ class AnchoringResult:
74
+ claim_id: str
75
+ claim_text: str
76
+ anchored_references: List[Reference]
77
+ supporting_passages: List[str]
78
+ validation_status: str = "pending"
79
+
80
+ # State Management
81
+ class ResearchTeamState(TypedDict):
82
+ document_content: str
83
+ product: str
84
+ countries: List[str]
85
+ language: str
86
+ all_claims: List[Dict]
87
+ core_claims: List[Dict]
88
+ search_results: Dict[str, List[Dict]]
89
+ anchoring_results: List[Dict]
90
+ reference_list: List[Dict]
91
+ formatted_references: List[Dict]
92
+ final_output: Dict[str, Any]
93
+ messages: Annotated[List, add_messages]
94
+ processing_status: Dict[str, str]
95
+
96
+ # Mock Tools for Internet Search
97
+ @tool
98
+ def mock_google_scholar_search(query: str, claim_id: str) -> Dict[str, Any]:
99
+ """Mock Google Scholar search tool"""
100
+ logger.info(f"Google Scholar search for claim {claim_id}: '{query[:30]}...'")
101
+ return {
102
+ "claim_id": claim_id,
103
+ "source": "google_scholar",
104
+ "results": [
105
+ {
106
+ "id": f"gs_{claim_id}_1",
107
+ "title": f"Research paper related to: {query[:50]}...",
108
+ "authors": ["Smith, J.", "Doe, A."],
109
+ "journal": "Nature",
110
+ "year": "2023",
111
+ "doi": "10.1038/example",
112
+ "url": "https://nature.com/articles/example",
113
+ "relevance_score": 0.85
114
+ }
115
+ ]
116
+ }
117
+
118
+ @tool
119
+ def mock_pubmed_search(query: str, claim_id: str) -> Dict[str, Any]:
120
+ """Mock PubMed search tool"""
121
+ logger.info(f"PubMed search for claim {claim_id}: '{query[:30]}...'")
122
+ return {
123
+ "claim_id": claim_id,
124
+ "source": "pubmed",
125
+ "results": [
126
+ {
127
+ "id": f"pm_{claim_id}_1",
128
+ "title": f"Medical study on: {query[:50]}...",
129
+ "authors": ["Johnson, R.", "Wilson, K."],
130
+ "journal": "JAMA",
131
+ "year": "2023",
132
+ "doi": "10.1001/jama.2023.example",
133
+ "url": "https://pubmed.ncbi.nlm.nih.gov/example",
134
+ "relevance_score": 0.92
135
+ }
136
+ ]
137
+ }
138
+
139
+ @tool
140
+ def mock_arxiv_search(query: str, claim_id: str) -> Dict[str, Any]:
141
+ """Mock arXiv search tool"""
142
+ logger.info(f"arXiv search for claim {claim_id}: '{query[:30]}...'")
143
+ return {
144
+ "claim_id": claim_id,
145
+ "source": "arxiv",
146
+ "results": [
147
+ {
148
+ "id": f"ar_{claim_id}_1",
149
+ "title": f"Preprint research: {query[:50]}...",
150
+ "authors": ["Chen, L.", "Zhang, M."],
151
+ "journal": "arXiv preprint",
152
+ "year": "2024",
153
+ "url": "https://arxiv.org/abs/example",
154
+ "relevance_score": 0.78
155
+ }
156
+ ]
157
+ }
158
+
159
+ @tool
160
+ def mock_reference_fetch(reference_id: str) -> Dict[str, Any]:
161
+ """Mock tool to fetch full reference content"""
162
+ logger.debug(f"Fetching full content for reference {reference_id}")
163
+ return {
164
+ "reference_id": reference_id,
165
+ "full_text": f"This is the full text content for reference {reference_id}. It contains detailed information that supports the corresponding claim...",
166
+ "abstract": f"Abstract for {reference_id}",
167
+ "sections": ["Introduction", "Methods", "Results", "Discussion"],
168
+ "supporting_passages": [
169
+ f"Key finding 1 from {reference_id}",
170
+ f"Important conclusion from {reference_id}"
171
+ ]
172
+ }
173
+
174
+ # Research Team Agents
175
+ class AnalyzerAgent:
176
+ """Agent for document analysis and claims extraction"""
177
+
178
+ def __init__(self, llm):
179
+ self.llm = llm
180
+ self.prompt = ChatPromptTemplate.from_template("""
181
+ You are an AI assistant specialized in analyzing content and extracting claims systematically.
182
+
183
+ GUIDELINES:
184
+ 1. Analyze the provided document content to identify ALL claims and statements
185
+ 2. Classify claims hierarchically:
186
+ - CORE claims: Primary thesis/conclusions that define the document's main arguments
187
+ - SUPPORTING claims: Evidence that reinforces core claims
188
+ - CONTEXTUAL claims: Background/introductory statements
189
+ 3. Score importance (0-10) based on:
190
+ - Impact on main thesis
191
+ - Frequency of reference
192
+ - Position in document structure
193
+ 4. Extract product name, countries, and language from content
194
+ 5. DO NOT omit any claims, even if they appear duplicated
195
+
196
+ RESPONSE FORMAT:
197
+ Provide response in JSON format with:
198
+ {{
199
+ "product": "product_name_lowercase",
200
+ "countries": ["country1", "country2"],
201
+ "language": "detected_language",
202
+ "claims": [
203
+ {{
204
+ "id": "claim_1",
205
+ "text": "exact claim text",
206
+ "type": "core|supporting|contextual",
207
+ "importance_score": 9,
208
+ "position": 1,
209
+ "context": "surrounding context"
210
+ }}
211
+ ]
212
+ }}
213
+
214
+ Document Content:
215
+ {document_content}
216
+ """)
217
+
218
+ async def analyze(self, document_content: str) -> Dict[str, Any]:
219
+ """Analyze document and extract structured claims"""
220
+ logger.info("STEP 1: Starting document analysis...")
221
+
222
+ try:
223
+ logger.info("Processing document content for claims extraction")
224
+ response = await self.llm.ainvoke(
225
+ self.prompt.format_messages(document_content=document_content)
226
+ )
227
+
228
+ # Parse JSON response
229
+ result = json.loads(response.content)
230
+
231
+ # Separate core claims for priority processing
232
+ core_claims = [claim for claim in result["claims"] if claim["type"] == "core"]
233
+
234
+ logger.info(f"Analysis complete: {len(result['claims'])} total claims found")
235
+ logger.info(f"Core claims identified: {len(core_claims)}")
236
+ logger.info(f"Product detected: {result.get('product', 'Not detected')}")
237
+ logger.info(f"Countries: {', '.join(result.get('countries', ['Not detected']))}")
238
+ logger.info(f"Language: {result.get('language', 'Not detected')}")
239
+
240
+ return {
241
+ "product": result.get("product", ""),
242
+ "countries": result.get("countries", []),
243
+ "language": result.get("language", "english"),
244
+ "all_claims": result["claims"],
245
+ "core_claims": core_claims
246
+ }
247
+ except Exception as e:
248
+ logger.error(f"Analyzer error: {e}")
249
+ return {
250
+ "product": "",
251
+ "countries": [],
252
+ "language": "english",
253
+ "all_claims": [],
254
+ "core_claims": []
255
+ }
256
+
257
+ class SearchAssistant:
258
+ """Agent for parallel reference searching across multiple sources"""
259
+
260
+ def __init__(self, llm):
261
+ self.llm = llm
262
+ self.tools = [mock_google_scholar_search, mock_pubmed_search, mock_arxiv_search]
263
+
264
+ async def search_for_claim(self, claim: Dict[str, Any]) -> Dict[str, List[Dict]]:
265
+ """Perform parallel searches for a specific claim"""
266
+ claim_id = claim["id"]
267
+ claim_text = claim["text"]
268
+
269
+ logger.info(f"Searching references for claim {claim_id}: '{claim_text[:50]}...'")
270
+
271
+ # Generate search query from claim
272
+ search_query = self._extract_search_terms(claim_text)
273
+
274
+ # Parallel search across all sources
275
+ search_tasks = []
276
+ for tool in self.tools:
277
+ task = asyncio.create_task(
278
+ self._execute_search(tool, search_query, claim_id)
279
+ )
280
+ search_tasks.append(task)
281
+
282
+ search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
283
+
284
+ # Aggregate results
285
+ aggregated_results = []
286
+ for result in search_results:
287
+ if isinstance(result, dict) and "results" in result:
288
+ aggregated_results.extend(result["results"])
289
+
290
+ logger.info(f"Search complete for claim {claim_id}: {len(aggregated_results)} references found")
291
+
292
+ return {claim_id: aggregated_results}
293
+
294
+ def _extract_search_terms(self, claim_text: str) -> str:
295
+ """Extract key terms from claim for search"""
296
+ # Simple keyword extraction - could be enhanced with NLP
297
+ return claim_text[:100] # Use first 100 chars as search query
298
+
299
+ async def _execute_search(self, tool, query: str, claim_id: str) -> Dict:
300
+ """Execute individual search tool"""
301
+ try:
302
+ result = tool.invoke({"query": query, "claim_id": claim_id})
303
+ return result
304
+ except Exception as e:
305
+ logger.error(f"Search error for {tool.name}: {e}")
306
+ return {"claim_id": claim_id, "results": []}
307
+
308
+ class ResearcherAgent:
309
+ """Agent for claims anchoring and validation"""
310
+
311
+ def __init__(self, llm):
312
+ self.llm = llm
313
+ self.prompt = ChatPromptTemplate.from_template("""
314
+ You are an AI assistant specialized in claims anchoring and reference validation.
315
+
316
+ GUIDELINES:
317
+ 1. ONLY focus on provided input - DO NOT consider conversation history
318
+ 2. Review ALL anchored sentences and ALL references in the content
319
+ 3. DO NOT OMIT anchored sentences or listed references
320
+ 4. For each claim, analyze provided search results to:
321
+ - Identify relevant references that support the claim
322
+ - Extract supporting text from reference content
323
+ - Validate the strength of evidence
324
+ - Rate the relevance and quality of support
325
+
326
+ RESPONSE FORMAT:
327
+ {{
328
+ "claim_id": "{claim_id}",
329
+ "validation_status": "validated|partial|unsupported",
330
+ "anchored_references": [
331
+ {{
332
+ "reference_id": "ref_id",
333
+ "supporting_text": "exact text that supports claim",
334
+ "relevance_score": 0.92,
335
+ "section": "Results"
336
+ }}
337
+ ],
338
+ "supporting_passages": ["passage1", "passage2"],
339
+ "quality_assessment": "assessment text"
340
+ }}
341
+
342
+ Claim: {claim_text}
343
+ Search Results: {search_results}
344
+ """)
345
+
346
+ async def anchor_claim(self, claim: Dict[str, Any], search_results: List[Dict]) -> Dict[str, Any]:
347
+ """Perform claims anchoring for a specific claim"""
348
+ claim_id = claim["id"]
349
+ logger.info(f"Anchoring claim {claim_id} with {len(search_results)} references")
350
+
351
+ try:
352
+ # Fetch full content for top references
353
+ enriched_results = []
354
+ for result in search_results[:3]: # Limit to top 3 results
355
+ full_content = mock_reference_fetch.invoke({"reference_id": result.get("id", "")})
356
+ result["full_content"] = full_content
357
+ enriched_results.append(result)
358
+
359
+ logger.debug(f"Retrieved full content for {len(enriched_results)} top references")
360
+
361
+ response = await self.llm.ainvoke(
362
+ self.prompt.format_messages(
363
+ claim_id=claim["id"],
364
+ claim_text=claim["text"],
365
+ search_results=json.dumps(enriched_results, indent=2)
366
+ )
367
+ )
368
+
369
+ result = json.loads(response.content)
370
+ result["claim_text"] = claim["text"]
371
+
372
+ logger.info(f"Claim {claim_id} anchored: {result.get('validation_status', 'unknown')} status")
373
+
374
+ return result
375
+ except Exception as e:
376
+ logger.error(f"Researcher error for claim {claim['id']}: {e}")
377
+ return {
378
+ "claim_id": claim["id"],
379
+ "claim_text": claim["text"],
380
+ "validation_status": "error",
381
+ "anchored_references": [],
382
+ "supporting_passages": [],
383
+ "quality_assessment": f"Error during processing: {e}"
384
+ }
385
+
386
+ class EditorAgent:
387
+ """Agent for reference formatting and validation"""
388
+
389
+ def __init__(self, llm):
390
+ self.llm = llm
391
+ self.prompt = ChatPromptTemplate.from_template("""
392
+ You are an expert in reference formatting using J&J formatting guidelines.
393
+
394
+ GUIDELINES:
395
+ 1. Format ALL references according to these rules:
396
+ - Journal Article: Authors. Article Title. Abbreviated Journal Title Year; Volume(Number): Pages.
397
+ - Journal Epub: Authors. Article Title. Journal [Internet]. Epub Year Month Day [cited Year Month Day]; Volume(Number): Pages. Available from: DOI
398
+ - Website: Authors/Website Name [Internet]. Title. [Accessed Year Month Day]. Available from: URL.
399
+ - Book: Authors. Title. Edition. Place: Publisher, Year. Chapter: Chapter Title; Pages.
400
+ 2. Special rules:
401
+ - Use first, second, third authors + "et al." when more than 3 authors
402
+ - Use italic format ONLY for book titles
403
+ - Translate terms based on content language: {language}
404
+ 3. Complete missing information where possible
405
+ 4. Maintain original reference order
406
+
407
+ RESPONSE FORMAT:
408
+ {{
409
+ "formatted_references": [
410
+ {{
411
+ "id": "ref_id",
412
+ "original": "original reference text",
413
+ "formatted": "properly formatted reference",
414
+ "changes_applied": "description of changes",
415
+ "source_type": "journal|book|website|etc",
416
+ "completion_status": "complete|incomplete|not_found"
417
+ }}
418
+ ]
419
+ }}
420
+
421
+ References to format:
422
+ {references}
423
+
424
+ Content Language: {language}
425
+ """)
426
+
427
+ async def format_references(self, references: List[Dict], language: str = "english") -> Dict[str, Any]:
428
+ """Format references according to J&J guidelines"""
429
+ logger.info(f"Formatting {len(references)} references according to J&J guidelines")
430
+ logger.info(f"Content language: {language}")
431
+
432
+ try:
433
+ response = await self.llm.ainvoke(
434
+ self.prompt.format_messages(
435
+ references=json.dumps(references, indent=2),
436
+ language=language
437
+ )
438
+ )
439
+
440
+ result = json.loads(response.content)
441
+ formatted_count = len(result.get("formatted_references", []))
442
+ logger.info(f"Reference formatting complete: {formatted_count} references processed")
443
+
444
+ return result
445
+ except Exception as e:
446
+ logger.error(f"Editor error: {e}")
447
+ return {"formatted_references": []}
448
+
449
+ # LangGraph Workflow Nodes
450
+ class ResearchTeamWorkflow:
451
+ """Main workflow orchestrator using LangGraph"""
452
+
453
+ def __init__(self):
454
+ logger.info("Initializing Research Team Workflow")
455
+
456
+ # Initialize LLM
457
+ self.llm = ChatOpenAI(
458
+ model="gpt-4",
459
+ temperature=0.1,
460
+ api_key=os.getenv("OPENAI_API_KEY")
461
+ )
462
+
463
+ # Initialize agents
464
+ self.analyzer = AnalyzerAgent(self.llm)
465
+ self.search_assistant = SearchAssistant(self.llm)
466
+ self.researcher = ResearcherAgent(self.llm)
467
+ self.editor = EditorAgent(self.llm)
468
+
469
+ # Build workflow graph
470
+ self.workflow = self._build_workflow()
471
+
472
+ logger.info("Research Team Workflow initialized successfully")
473
+
474
+ def _build_workflow(self) -> StateGraph:
475
+ """Build the LangGraph workflow"""
476
+ logger.info("Building LangGraph workflow...")
477
+
478
+ workflow = StateGraph(ResearchTeamState)
479
+
480
+ # Add nodes
481
+ workflow.add_node("analyzer", self._analyzer_node)
482
+ workflow.add_node("claims_dispatcher", self._claims_dispatcher_node)
483
+ workflow.add_node("parallel_search", self._parallel_search_node)
484
+ workflow.add_node("claims_anchoring", self._claims_anchoring_node)
485
+ workflow.add_node("reference_formatting", self._reference_formatting_node)
486
+ workflow.add_node("final_assembly", self._final_assembly_node)
487
+
488
+ # Define workflow edges
489
+ workflow.add_edge(START, "analyzer")
490
+ workflow.add_edge("analyzer", "claims_dispatcher")
491
+ workflow.add_edge("claims_dispatcher", "parallel_search")
492
+ workflow.add_edge("parallel_search", "claims_anchoring")
493
+ workflow.add_edge("claims_anchoring", "reference_formatting")
494
+ workflow.add_edge("reference_formatting", "final_assembly")
495
+ workflow.add_edge("final_assembly", END)
496
+
497
+ logger.info("LangGraph workflow built successfully")
498
+ return workflow.compile()
499
+
500
+ async def _analyzer_node(self, state: ResearchTeamState) -> ResearchTeamState:
501
+ """Document analysis and claims extraction"""
502
+ logger.info("STEP 1: Document Analysis")
503
+
504
+ result = await self.analyzer.analyze(state["document_content"])
505
+
506
+ state.update({
507
+ "product": result["product"],
508
+ "countries": result["countries"],
509
+ "language": result["language"],
510
+ "all_claims": result["all_claims"],
511
+ "core_claims": result["core_claims"],
512
+ "processing_status": {"analyzer": "completed"}
513
+ })
514
+
515
+ logger.info("STEP 1 COMPLETE: Document analysis finished")
516
+ return state
517
+
518
+ async def _claims_dispatcher_node(self, state: ResearchTeamState) -> ResearchTeamState:
519
+ """Prepare claims for parallel processing"""
520
+ logger.info("STEP 2: Claims Dispatcher - Preparing parallel processing")
521
+
522
+ core_claims = state["core_claims"]
523
+
524
+ # Initialize processing status for each claim
525
+ processing_status = state.get("processing_status", {})
526
+ for claim in core_claims:
527
+ processing_status[f"claim_{claim['id']}"] = "pending"
528
+
529
+ state["processing_status"] = processing_status
530
+
531
+ logger.info(f"STEP 2 COMPLETE: {len(core_claims)} core claims prepared for parallel processing")
532
+ return state
533
+
534
+ async def _parallel_search_node(self, state: ResearchTeamState) -> ResearchTeamState:
535
+ """Execute parallel searches for all core claims"""
536
+ logger.info("STEP 3: Parallel Search - Starting multi-source searches")
537
+
538
+ core_claims = state["core_claims"]
539
+ search_results = {}
540
+
541
+ logger.info(f"Launching parallel searches for {len(core_claims)} core claims")
542
+ logger.info("Search sources: Google Scholar, PubMed, arXiv")
543
+
544
+ # Create search tasks for all core claims
545
+ search_tasks = []
546
+ for claim in core_claims:
547
+ task = asyncio.create_task(
548
+ self.search_assistant.search_for_claim(claim)
549
+ )
550
+ search_tasks.append(task)
551
+
552
+ # Execute all searches in parallel
553
+ results = await asyncio.gather(*search_tasks, return_exceptions=True)
554
+
555
+ # Aggregate results
556
+ total_references = 0
557
+ for result in results:
558
+ if isinstance(result, dict):
559
+ search_results.update(result)
560
+ for claim_id, refs in result.items():
561
+ total_references += len(refs)
562
+
563
+ state["search_results"] = search_results
564
+
565
+ logger.info(f"STEP 3 COMPLETE: Parallel search finished - {total_references} total references found")
566
+ return state
567
+
568
+ async def _claims_anchoring_node(self, state: ResearchTeamState) -> ResearchTeamState:
569
+ """Perform claims anchoring for all core claims"""
570
+ logger.info("STEP 4: Claims Anchoring - Validating evidence support")
571
+
572
+ core_claims = state["core_claims"]
573
+ search_results = state["search_results"]
574
+ anchoring_results = []
575
+
576
+ logger.info(f"Processing {len(core_claims)} claims for anchoring")
577
+
578
+ # Create anchoring tasks
579
+ anchoring_tasks = []
580
+ for claim in core_claims:
581
+ claim_search_results = search_results.get(claim["id"], [])
582
+ task = asyncio.create_task(
583
+ self.researcher.anchor_claim(claim, claim_search_results)
584
+ )
585
+ anchoring_tasks.append(task)
586
+
587
+ # Execute all anchoring in parallel
588
+ results = await asyncio.gather(*anchoring_tasks, return_exceptions=True)
589
+
590
+ validated_count = 0
591
+ for result in results:
592
+ if isinstance(result, dict):
593
+ anchoring_results.append(result)
594
+ if result.get("validation_status") == "validated":
595
+ validated_count += 1
596
+
597
+ state["anchoring_results"] = anchoring_results
598
+
599
+ logger.info(f"STEP 4 COMPLETE: Claims anchoring finished - {validated_count}/{len(anchoring_results)} claims validated")
600
+ return state
601
+
602
+ async def _reference_formatting_node(self, state: ResearchTeamState) -> ResearchTeamState:
603
+ """Format all references according to J&J guidelines"""
604
+ logger.info("STEP 5: Reference Formatting - Applying J&J guidelines")
605
+
606
+ # Extract all references from anchoring results
607
+ all_references = []
608
+ for anchoring_result in state["anchoring_results"]:
609
+ if "anchored_references" in anchoring_result:
610
+ all_references.extend(anchoring_result["anchored_references"])
611
+
612
+ logger.info(f"Processing {len(all_references)} references for formatting")
613
+
614
+ # Format references
615
+ formatting_result = await self.editor.format_references(
616
+ all_references,
617
+ state.get("language", "english")
618
+ )
619
+
620
+ state["formatted_references"] = formatting_result.get("formatted_references", [])
621
+
622
+ logger.info("STEP 5 COMPLETE: Reference formatting finished")
623
+ return state
624
+
625
+ async def _final_assembly_node(self, state: ResearchTeamState) -> ResearchTeamState:
626
+ """Assemble final results"""
627
+ logger.info("STEP 6: Final Assembly - Generating comprehensive report")
628
+
629
+ final_output = {
630
+ "document_metadata": {
631
+ "product": state["product"],
632
+ "countries": state["countries"],
633
+ "language": state["language"]
634
+ },
635
+ "claims_analysis": {
636
+ "total_claims": len(state["all_claims"]),
637
+ "core_claims_count": len(state["core_claims"]),
638
+ "claims_details": state["all_claims"]
639
+ },
640
+ "claims_anchoring": {
641
+ "results": state["anchoring_results"],
642
+ "summary": self._generate_anchoring_summary(state["anchoring_results"])
643
+ },
644
+ "reference_formatting": {
645
+ "formatted_references": state["formatted_references"],
646
+ "total_references": len(state["formatted_references"])
647
+ },
648
+ "processing_status": state.get("processing_status", {})
649
+ }
650
+
651
+ state["final_output"] = final_output
652
+
653
+ # Log final summary
654
+ summary = final_output["claims_anchoring"]["summary"]
655
+ logger.info("FINAL RESULTS SUMMARY:")
656
+ logger.info(f" Total claims processed: {final_output['claims_analysis']['total_claims']}")
657
+ logger.info(f" Core claims: {final_output['claims_analysis']['core_claims_count']}")
658
+ logger.info(f" Successfully validated: {summary['successfully_validated']}")
659
+ logger.info(f" Validation rate: {summary['validation_rate']:.1%}")
660
+ logger.info(f" References formatted: {final_output['reference_formatting']['total_references']}")
661
+ logger.info("STEP 6 COMPLETE: Research Team workflow finished successfully!")
662
+
663
+ return state
664
+
665
+ def _generate_anchoring_summary(self, anchoring_results: List[Dict]) -> Dict[str, Any]:
666
+ """Generate summary of anchoring results"""
667
+ total_claims = len(anchoring_results)
668
+ validated_claims = sum(1 for r in anchoring_results if r.get("validation_status") == "validated")
669
+
670
+ return {
671
+ "total_claims_processed": total_claims,
672
+ "successfully_validated": validated_claims,
673
+ "validation_rate": validated_claims / total_claims if total_claims > 0 else 0,
674
+ "claims_summary": [
675
+ {
676
+ "claim_id": r["claim_id"],
677
+ "status": r.get("validation_status", "unknown"),
678
+ "references_found": len(r.get("anchored_references", []))
679
+ }
680
+ for r in anchoring_results
681
+ ]
682
+ }
683
+
684
+ async def process_document(self, document_content: str) -> Dict[str, Any]:
685
+ """Main entry point for document processing"""
686
+ start_time = datetime.now()
687
+ logger.info("=" * 80)
688
+ logger.info("RESEARCH TEAM WORKFLOW STARTED")
689
+ logger.info(f"Start time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
690
+ logger.info(f"Document length: {len(document_content)} characters")
691
+ logger.info("=" * 80)
692
+
693
+ initial_state = ResearchTeamState(
694
+ document_content=document_content,
695
+ product="",
696
+ countries=[],
697
+ language="english",
698
+ all_claims=[],
699
+ core_claims=[],
700
+ search_results={},
701
+ anchoring_results=[],
702
+ reference_list=[],
703
+ formatted_references=[],
704
+ final_output={},
705
+ messages=[],
706
+ processing_status={}
707
+ )
708
+
709
+ # Execute workflow
710
+ final_state = await self.workflow.ainvoke(initial_state)
711
+
712
+ end_time = datetime.now()
713
+ duration = end_time - start_time
714
+
715
+ logger.info("=" * 80)
716
+ logger.info("RESEARCH TEAM WORKFLOW COMPLETED")
717
+ logger.info(f"End time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
718
+ logger.info(f"Total duration: {duration.total_seconds():.2f} seconds")
719
+ logger.info("=" * 80)
720
+
721
+ return final_state["final_output"]
722
+
723
+ # Factory function for easy instantiation
724
+ def create_research_team() -> ResearchTeamWorkflow:
725
+ """Create and return a configured ResearchTeam instance"""
726
+ return ResearchTeamWorkflow()
test_document.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sample Medical Document for Testing ResearchTeam
2
+
3
+ ## Introduction
4
+
5
+ **Daratumumab** is a human monoclonal antibody that targets CD38, a glycoprotein highly expressed on multiple myeloma cells. Clinical studies have demonstrated significant efficacy in treating relapsed and refractory multiple myeloma patients.
6
+
7
+ ## Key Clinical Claims
8
+
9
+ ### Primary Efficacy Findings
10
+
11
+ The POLLUX study demonstrated that **daratumumab in combination with lenalidomide and dexamethasone significantly improved progression-free survival** compared to lenalidomide and dexamethasone alone (median not reached vs. 18.4 months; HR=0.37; 95% CI: 0.27-0.52; p<0.001).¹
12
+
13
+ In the CASTOR trial, **daratumumab plus bortezomib and dexamethasone showed superior overall response rates** of 83% versus 63% in the control arm (p<0.001).²
14
+
15
+ ### Safety Profile
16
+
17
+ **The most common adverse events observed were infusion-related reactions** occurring in approximately 48% of patients during the first infusion, with rates decreasing to less than 5% by the second infusion.³
18
+
19
+ ## Product Information
20
+
21
+ This study was conducted across multiple countries including **Argentina, Brazil, Chile, and Mexico** for regulatory approval in Latin American markets.
22
+
23
+ The content is provided in **Spanish** for healthcare professionals in these regions.
24
+
25
+ ## References
26
+
27
+ 1. Dimopoulos MA, Oriol A, Nopoka H, et al. Daratumumab, lenalidomide, and dexamethasone for multiple myeloma. N Engl J Med. 2016;375(14):1319-1331.
28
+
29
+ 2. Palumbo A, Chanan-Khan A, Weisel K, et al. Daratumumab, bortezomib, and dexamethasone for multiple myeloma. N Engl J Med. 2016;375(8):754-766.
30
+
31
+ 3. Safety data from pooled analysis of POLLUX and CASTOR studies. Presented at ASH 2016.
32
+
33
+ ---
34
+
35
+ **Contact Information:**
36
+ Janssen-Cilag Argentina S.A.
37
+ Buenos Aires, Argentina
38
+ Tel: +54-11-4732-5000
39
+
40
+ **Important Safety Information:**
41
+ Please refer to full prescribing information for complete safety profile and contraindications.