Junaidb commited on
Commit
85b3ca7
·
verified ·
1 Parent(s): 097333a

Update llmeval.py

Browse files
Files changed (1) hide show
  1. llmeval.py +87 -201
llmeval.py CHANGED
@@ -11,35 +11,38 @@ de=DatabaseEngine()
11
 
12
 
13
 
14
- SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION='''
15
  Task:
16
- Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
17
 
18
  Goal:
19
  Assess:
20
 
21
- Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
22
 
23
- Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
24
 
25
- Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
26
 
27
  Scoring Guide (0–1 continuous scale):
28
- Score 1.0 if:
29
 
30
- Prompt is clear, biologically specific, and well-aligned to the data context.
31
 
32
- Research Data is relevant, complete, and interpretable in a biological sense.
 
 
 
 
33
 
34
- Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
35
 
36
  Lower scores if:
37
 
38
- The prompt is vague, generic, or misaligned to the data.
 
 
39
 
40
- The research data is noisy, irrelevant, incomplete, or non-biological.
41
 
42
- The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
43
 
44
  Your output must begin with:
45
  Score:
@@ -47,38 +50,38 @@ and contain only two fields:
47
  Score: and Reasoning:
48
  No extra commentary, no markdown, no explanations before or after.
49
  Think step by step
50
-
51
  '''
52
 
53
- SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT='''
54
  Task:
55
- Evaluate how well the Observation Generator agent’s Response addresses the specific Prompt by leveraging the provided Context on a 0–1 continuous scale.
56
 
57
  Goal:
58
  Assess:
59
 
60
- Whether the Prompt clearly sets expectations aligned with the Context.
61
 
62
- Whether the Context supplies appropriate information to fulfill the Prompt.
63
 
64
- Whether the Response directly responds to the Prompt using relevant Context details.
65
 
66
- Scoring Guide (0–1 continuous scale):
67
- Score 1.0 if:
68
 
69
- The Prompt is precisely tailored to the Context.
70
 
71
- The Context is sufficient and pertinent.
72
 
73
- The Response directly and comprehensively leverages the Context to satisfy the Prompt.
74
 
75
- Lower scores if:
 
 
76
 
77
- Prompt is misaligned or too generic.
78
 
79
- Context is irrelevant, incomplete, or off-topic.
80
 
81
- Response fails to use Context or deviates from Prompt intent.
82
 
83
  Your output must begin with:
84
  Score:
@@ -86,78 +89,77 @@ and contain only two fields:
86
  Score: and Reasoning:
87
  No extra commentary, no markdown, no explanations before or after.
88
  Think step by step
 
89
  '''
90
 
91
 
92
- SYSTEM_PROMPT_FOR_COHERENCE='''
93
  Task:
94
- Evaluate the logical and semantic coherence of the Prompt, one or more provided Contexts, and the Response as a unified set on a 0–1 continuous scale.
95
 
96
  Goal:
97
  Assess:
98
 
99
- Whether the Prompt logically fits each of the provided Contexts.
100
 
101
- Whether the Response logically follows from both the Prompt and all provided Contexts.
102
 
103
- Whether there are any gaps, contradictions, or misalignments among the Prompt, the Contexts, and the Response.
104
 
105
- Scoring Guide (0–1 continuous scale):
106
- Score 1.0 if:
107
 
108
- The Prompt is coherent with every Context.
109
 
110
- The Response seamlessly builds on the Prompt and all Contexts without contradiction.
111
 
112
- All elements form a consistent, unified narrative.
113
 
114
- Lower scores if:
115
 
116
- The Prompt and one or more Contexts are disjointed.
117
-
118
- The Response introduces contradictions or unsupported claims relative to any Context.
119
-
120
- Logical or semantic gaps exist between the Prompt, any Context, or the Response.
121
 
122
  Your output must begin with:
123
  Score:
124
  and contain only two fields:
125
  Score: and Reasoning:
126
  No extra commentary, no markdown, no explanations before or after.
127
-
128
  Think step by step
 
129
  '''
130
 
131
 
132
- SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY='''
133
  Task:
134
- Evaluate how focused, detailed, and context-aware the agent’s Response is with respect to the Prompt and Context on a 0–1 continuous scale.
135
 
136
  Goal:
137
  Assess:
138
 
139
- Whether the Response provides precise answers targeted to the Prompt.
140
 
141
- Whether it includes sufficient detail drawn from the Context.
142
 
143
- Whether it avoids vagueness or overly generic statements.
144
 
145
- Scoring Guide (0–1 continuous scale):
146
- Score 1.0 if:
147
 
148
- The Response is highly specific to the Prompt.
149
 
150
- It draws clear, detailed insights from the Context.
151
 
152
- No generic or irrelevant content is present.
153
 
154
- Lower scores if:
155
 
156
- Response is vague or broad.
157
 
158
- Lacks detail or context grounding.
 
159
 
160
- Contains filler or off-topic information.
161
 
162
  Your output must begin with:
163
  Score:
@@ -177,164 +179,48 @@ class LLM_as_Evaluator():
177
  def ___engine_core(self,messages):
178
 
179
  completion = client.chat.completions.create(
180
- model="llama3-8b-8192",
181
  messages=messages,
182
  temperature=0.0,
183
- max_completion_tokens=5000,
184
- top_p=1,
185
  stream=False,
186
  stop=None,
187
  )
188
  actual_message=completion.choices[0].message.content
189
- return actual_message
190
- #cleaned_json=re.sub(r"```(?:json)?\s*(.*?)\s*```", r"\1", actual_message, flags=re.DOTALL).strip()
191
- #is_json_like = cleaned_json.strip().startswith("{") and cleaned_json.strip().endswith("}")
192
- #if is_json_like==True:
193
- #return cleaned_json
194
- #else:
195
- #return "FATAL"
196
-
197
-
198
- def Paradigm_LLM_Evaluator(self,promptversion):
199
 
200
 
201
- SYSTEM='''
202
- Task:
203
- Evaluate the biological quality of a prompt, research data, paradigm list, and response on a 0–1 continuous scale.
204
 
205
- Goal:
206
- Assess:
207
-
208
- Whether the Prompt is clear, biologically specific, and aligned with the Research Data and the Paradigm List.
209
-
210
- Whether the response is biologically relevant, mechanistically coherent, and experimentally actionable based on the Research Data.
211
-
212
- Whether the response is correctly chosen from the Paradigm List in light of the Research Data.
213
-
214
- Scoring Guide (0–1 continuous scale):
215
-
216
- Score 1.0 if:
217
-
218
- The Prompt is clear, biologically detailed, and well-aligned to the Research Data and Paradigm List.
219
-
220
- The response correctly reflects a biologically valid interpretation of the Research Data and is appropriately drawn from the Paradigm List.
221
-
222
- Lower scores if:
223
-
224
- The prompt is vague or misaligned with the research context.
225
-
226
- The response is biologically irrelevant, mechanistically incoherent, or mismatched with the Research Data.
227
-
228
- The paradigm is not the most plausible or supported choice from the Paradigm List.
229
-
230
-
231
- Your output must begin with Score: and contain only two fields: Score: and Reasoning:. No extra commentary, no markdown, no explanations before or after.:
232
-
233
- Think step by step
234
- '''
235
 
236
- data_to_evaluate=dbe.GetData(promptversion)
237
- messages=[
238
- {"role":"system","content":SYSTEM},
239
- {"role":"user","content":f"""
240
- Prompt:{data_to_evaluate["prompt"]},
241
- Research Data :{data_to_evaluate["context"]},
242
- Agent's Response:{data_to_evaluate["response"]}
243
-
244
- """}
245
- ]
246
-
247
- evaluation_response=self.___engine_core(messages=messages)
248
- data={
249
- "promptversion":promptversion,
250
- "biological_context_alignment":evaluation_response
251
- }
252
- de.Update(data=data)
253
-
254
-
255
- def Observation_LLM_Evaluator(self,metric,promptversion):
256
- match metric:
257
- case "biological_context_alignment":
258
-
259
-
260
- data_to_evaluate=de.GetData(promptversion)
261
- messages =[
262
 
263
- {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
264
- {"role":"user","content":f"""
265
  Prompt :{data_to_evaluate["prompt"]}
266
- Research Data :{data_to_evaluate["context"]}
267
  Agent's Response : {data_to_evaluate["response"]}
268
- """}
269
- ]
270
- evaluation_response=self.___engine_core(messages=messages)
271
- data={
 
 
272
  "promptversion":promptversion,
273
- "biological_context_alignment":evaluation_response
 
 
 
274
  }
275
- de.Update(data=data)
276
-
277
- case "contextual_relevance_alignment":
278
- pass
279
-
280
-
281
-
282
-
283
- def Anomaly_LLM_Evaluator(self,promptversion):
284
- SYSTEM='''
285
- Task:
286
- Evaluate the biological quality of a prompt , observations , paradigms and response from an Anomaly Detector Agent on a 0–1 continuous scale.
287
-
288
- Goal:
289
- Assess:
290
-
291
- Whether the Prompt clearly defines the biological context and intent.
292
-
293
- Whether the Observations are biologically plausible and internally consistent.
294
 
295
- Whether the Paradigms are plausible biological frameworks given the context.
296
 
297
- Whether the Response correctly identifies biologically relevant inconsistencies or contradictions between the Paradigms and the Observations.
298
-
299
- Scoring Guide (0–1 continuous scale):
300
-
301
- Score 1.0 if:
302
-
303
- The Prompt is clear, biologically grounded, and well-scoped.
304
-
305
- The Observations are plausible and logically consistent.
306
-
307
- The Response accurately identifies true anomalies—i.e., meaningful contradictions or gaps—between the Paradigms and the Observations.
308
-
309
- All major conflicts are captured.
310
-
311
- Lower scores if:
312
-
313
- The Prompt is vague or misaligned with the context.
314
-
315
- Observations are biologically implausible or incoherent.
316
-
317
- The Response overlooks key inconsistencies, includes irrelevant anomalies, or shows poor biological reasoning.
318
-
319
- Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
320
-
321
- Think step by step
322
- '''
323
-
324
- data_to_evaluate=dbe.GetData(promptversion)
325
- messages=[
326
- {"role":"system","content":SYSTEM},
327
- {"role":"user","content":f"""
328
- Prompt :{data_to_evaluate["prompt"]}
329
- Observations :{ data_to_evaluate["context"]}
330
- Agent's Response :{data_to_evaluate["response"]}
331
- """}
332
- ]
333
- evaluation_response=self.___engine_core(messages=messages)
334
- data={
335
- "promptversion":promptversion,
336
- "biological_context_alignment":evaluation_response
337
- }
338
- de.Update(data=data)
339
-
340
 
 
11
 
12
 
13
 
14
+ SYSTEM_FOR_BIO_CONTEXT_ALIGNMENT=f'''
15
  Task:
16
+ Evaluate the biological quality of a Prompt, Context, and Response from an {agenttype} Agent on a 0–10 continuous scale.
17
 
18
  Goal:
19
  Assess:
20
 
21
+ Whether Prompt precisely defines a biologically specific research objective, explicitly frames the agent's role, and delineates valid output types or constraints and is well aligned to the context.
22
 
23
+ Whether Context is highly relevant, internally consistent, sufficiently rich in biological context, and presented in a way that supports fine-grained inference or analysis.
24
 
25
+ Whether Response consists of output that is biologically valid, mechanistically sound, non-redundant, free from trivialities, contradictions, or generic phrasing and directly grounded in the context.
26
 
27
  Scoring Guide (0–1 continuous scale):
 
28
 
29
+ Score 10 if all of the following are true:
30
 
31
+ Prompt precisely defines a biologically specific research objective, explicitly frames the agent's role, and delineates valid output types or constraints and is well aligned to the context.
32
+
33
+ Context is highly relevant, internally consistent, sufficiently rich in biological context, and presented in a way that supports fine-grained inference or analysis.
34
+
35
+ Response consists of output that is biologically valid, mechanistically sound, non-redundant, free from trivialities, contradictions, or generic phrasing and directly grounded in the context.
36
 
 
37
 
38
  Lower scores if:
39
 
40
+ Prompt does not clearly define a biologically specific objective, fails to frame the agent’s role or valid outputs, and is misaligned with the context.
41
+
42
+ Context is irrelevant, inconsistent, lacking biological detail, or presented in a way that hinders meaningful analysis.
43
 
44
+ Response includes output that is biologically invalid, mechanistically flawed, redundant, trivial, contradictory, or generic, and not clearly grounded in the context.
45
 
 
46
 
47
  Your output must begin with:
48
  Score:
 
50
  Score: and Reasoning:
51
  No extra commentary, no markdown, no explanations before or after.
52
  Think step by step
 
53
  '''
54
 
55
+ SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT=f'''
56
  Task:
57
+ Evaluate how well the {agenttype} Response addresses the specific Prompt by leveraging the provided Context on a 0–10 continuous scale.
58
 
59
  Goal:
60
  Assess:
61
 
62
+ Whether the Prompt is precisely tailored to the Context, clearly sets expectations, and aligns with the scope of valid outputs.
63
 
64
+ Whether the Context is highly relevant, biologically rich, and sufficient to enable effective fulfillment of the Prompt.
65
 
66
+ Whether the Response directly and comprehensively utilizes the Context to fulfill the Prompt’s objective, without deviating or introducing irrelevant information.
67
 
68
+ Scoring Guide (0–10 scale):
 
69
 
70
+ Score 10 if all of the following are true:
71
 
72
+ Prompt is precisely tailored to the Context, setting clear, biologically specific expectations and constraints for the agent.
73
 
74
+ Context is sufficient, relevant, and complete, directly supporting the generation of appropriate output.
75
 
76
+ Response directly addresses the Prompt, utilizing the Context to comprehensively satisfy the Prompt’s expectations with no deviation or irrelevant information.
77
+
78
+ Low scores if :
79
 
80
+ Prompt is not tailored to the Context, lacks clear, biologically specific expectations, and fails to set appropriate constraints for the agent
81
 
82
+ Context is insufficient, irrelevant, or incomplete, failing to support the generation of appropriate output.
83
 
84
+ Response does not directly address the Prompt, fails to utilize the Context effectively, and includes deviations or irrelevant information that do not satisfy the Prompt’s expectations.
85
 
86
  Your output must begin with:
87
  Score:
 
89
  Score: and Reasoning:
90
  No extra commentary, no markdown, no explanations before or after.
91
  Think step by step
92
+
93
  '''
94
 
95
 
96
+ SYSTEM_PROMPT_FOR_TRIAD_COHERENCE=f'''
97
  Task:
98
+ Evaluate the logical and semantic coherence of the Prompt, Context, and Response of {agenttype} as a unified set on a 0–10 continuous scale.
99
 
100
  Goal:
101
  Assess:
102
 
103
+ Whether the Prompt is logically consistent with the provided Context, setting a clear, biologically grounded framework for the Response.
104
 
105
+ Whether the Response logically and semantically follows from both the Prompt and provided Context, without contradictions or unsupported claims.
106
 
107
+ Whether there are gaps, contradictions, or misalignments among the Prompt, Context and the Response that affect the overall coherence.
108
 
109
+ Scoring Guide (0–10 scale):
 
110
 
111
+ Score 10 if all are true:
112
 
113
+ The Prompt is logically coherent with the Context, clearly framing the research objectives.
114
 
115
+ The Response seamlessly builds on the Prompt and the Context, maintaining consistency without contradiction or ambiguity.
116
 
117
+ All elements form a logically unified and semantically sound narrative, with no gaps or contradictions between them.
118
 
119
+ Low scores if:
120
+ The Prompt is not logically coherent with the Context, failing to clearly frame the research objectives.
121
+ The Response does not seamlessly build on the Prompt and the Context, introducing contradictions or ambiguity.
122
+ The elements do not form a logically unified or semantically sound narrative, containing gaps or contradictions between them.
 
123
 
124
  Your output must begin with:
125
  Score:
126
  and contain only two fields:
127
  Score: and Reasoning:
128
  No extra commentary, no markdown, no explanations before or after.
 
129
  Think step by step
130
+
131
  '''
132
 
133
 
134
+ SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY=f'''
135
  Task:
136
+ Evaluate how focused, detailed, and context-aware the {agenttype} Response is with respect to the Prompt and Context on a 0–10 continuous scale.
137
 
138
  Goal:
139
  Assess:
140
 
141
+ Whether the Response is highly specific and precisely targeted to the Prompt, addressing the research objectives without deviation.
142
 
143
+ Whether the Response includes sufficient, detailed insights directly drawn from the Context, ensuring relevance and biological accuracy.
144
 
145
+ Whether the Response avoids vagueness, overly generic statements, and provides only relevant, factually grounded content.
146
 
147
+ Scoring Guide (0–10 scale):
 
148
 
149
+ Score 10 if all are true:
150
 
151
+ The Response is exceptionally specific to the Prompt, addressing every aspect with precision and detail.
152
 
153
+ The Response draws clear, biologically grounded, and highly detailed insights from the Context, ensuring all claims are backed by relevant data.
154
 
155
+ No generic, irrelevant, or off-topic content is present, and every statement is purposeful and directly tied to the research objectives.
156
 
157
+ Low scores if :
158
 
159
+ The Response is not specific to the Prompt, failing to address important aspects with precision or detail.
160
+ The Response does not draw clear, biologically grounded, or detailed insights from the Context, and many claims are not supported by relevant data.The Response contains generic, irrelevant, or off-topic content, and many statements are not purposeful or aligned with the research objectives
161
 
162
+ The Response contains generic, irrelevant, or off-topic content, and many statements are not purposeful or aligned with the research objectives
163
 
164
  Your output must begin with:
165
  Score:
 
179
  def ___engine_core(self,messages):
180
 
181
  completion = client.chat.completions.create(
182
+ model="deepseek-r1-distill-llama-70b",
183
  messages=messages,
184
  temperature=0.0,
185
+ max_completion_tokens=6000,
186
+ #top_p=1,
187
  stream=False,
188
  stop=None,
189
  )
190
  actual_message=completion.choices[0].message.content
191
+ return re.sub(r"<think>.*?</think>", "", actual_message, flags=re.DOTALL).strip()
 
 
 
 
 
 
 
 
 
192
 
193
 
 
 
 
194
 
195
+ def Observation_LLM_Evaluator(self,promptversion):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ metrics=["biological_context_alignment","contextual_relevance_alignment","coherence","response_specificity"]
198
+
199
+ data_to_evaluate=de.GetData(promptversion)
200
+ import time
201
+
202
+ for metric in metrics:
203
+
204
+ messages =[
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
 
206
+ {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
207
+ {"role":"user","content":f"""
208
  Prompt :{data_to_evaluate["prompt"]}
209
+ Context :{data_to_evaluate["context"]}
210
  Agent's Response : {data_to_evaluate["response"]}
211
+ """}
212
+ ]
213
+ evaluation_response=self.___engine_core(messages=messages)
214
+
215
+ data={
216
+
217
  "promptversion":promptversion,
218
+ "biological_context_alignment":"",
219
+ "contextual_relevance_alignment":"",
220
+ "unit_coherence":"",
221
+ "response_specificity":""
222
  }
223
+ de.Update(data=data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
 
 
225
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226