Junaidb commited on
Commit
097333a
·
verified ·
1 Parent(s): 2fb73dc

Update llmeval.py

Browse files
Files changed (1) hide show
  1. llmeval.py +182 -55
llmeval.py CHANGED
@@ -10,6 +10,164 @@ de=DatabaseEngine()
10
 
11
 
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  class LLM_as_Evaluator():
14
 
15
  def __init__(self):
@@ -94,61 +252,30 @@ Think step by step
94
  de.Update(data=data)
95
 
96
 
97
- def Observation_LLM_Evaluator(self,promptversion):
98
- SYSTEM='''
99
- Task:
100
- Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
101
-
102
- Goal:
103
- Assess:
104
-
105
- Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
106
-
107
- Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
108
-
109
- Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
110
-
111
- Scoring Guide (0–1 continuous scale):
112
- Score 1.0 if:
113
-
114
- Prompt is clear, biologically specific, and well-aligned to the data context.
115
-
116
- Research Data is relevant, complete, and interpretable in a biological sense.
117
-
118
- Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
119
-
120
- Lower scores if:
121
-
122
- The prompt is vague, generic, or misaligned to the data.
123
-
124
- The research data is noisy, irrelevant, incomplete, or non-biological.
125
-
126
- The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
127
-
128
- Your output must begin with:
129
- Score:
130
- and contain only two fields:
131
- Score: and Reasoning:
132
- No extra commentary, no markdown, no explanations before or after.
133
- Think step by step
134
-
135
- '''
136
- data_to_evaluate=de.GetData(promptversion)
137
- messages =[
138
-
139
- {"role":"system","content":SYSTEM},
140
- {"role":"user","content":f"""
141
- Prompt :{data_to_evaluate["prompt"]}
142
- Research Data :{data_to_evaluate["context"]}
143
- Agent's Response : {data_to_evaluate["response"]}
144
- """}
145
- ]
146
- evaluation_response=self.___engine_core(messages=messages)
147
- data={
148
- "promptversion":promptversion,
149
- "biological_context_alignment":evaluation_response
150
- }
151
- de.Update(data=data)
152
 
153
 
154
 
 
10
 
11
 
12
 
13
+
14
+ SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION='''
15
+ Task:
16
+ Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
17
+
18
+ Goal:
19
+ Assess:
20
+
21
+ Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
22
+
23
+ Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
24
+
25
+ Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
26
+
27
+ Scoring Guide (0–1 continuous scale):
28
+ Score 1.0 if:
29
+
30
+ Prompt is clear, biologically specific, and well-aligned to the data context.
31
+
32
+ Research Data is relevant, complete, and interpretable in a biological sense.
33
+
34
+ Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
35
+
36
+ Lower scores if:
37
+
38
+ The prompt is vague, generic, or misaligned to the data.
39
+
40
+ The research data is noisy, irrelevant, incomplete, or non-biological.
41
+
42
+ The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
43
+
44
+ Your output must begin with:
45
+ Score:
46
+ and contain only two fields:
47
+ Score: and Reasoning:
48
+ No extra commentary, no markdown, no explanations before or after.
49
+ Think step by step
50
+
51
+ '''
52
+
53
+ SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT='''
54
+ Task:
55
+ Evaluate how well the Observation Generator agent’s Response addresses the specific Prompt by leveraging the provided Context on a 0–1 continuous scale.
56
+
57
+ Goal:
58
+ Assess:
59
+
60
+ Whether the Prompt clearly sets expectations aligned with the Context.
61
+
62
+ Whether the Context supplies appropriate information to fulfill the Prompt.
63
+
64
+ Whether the Response directly responds to the Prompt using relevant Context details.
65
+
66
+ Scoring Guide (0–1 continuous scale):
67
+ Score 1.0 if:
68
+
69
+ The Prompt is precisely tailored to the Context.
70
+
71
+ The Context is sufficient and pertinent.
72
+
73
+ The Response directly and comprehensively leverages the Context to satisfy the Prompt.
74
+
75
+ Lower scores if:
76
+
77
+ Prompt is misaligned or too generic.
78
+
79
+ Context is irrelevant, incomplete, or off-topic.
80
+
81
+ Response fails to use Context or deviates from Prompt intent.
82
+
83
+ Your output must begin with:
84
+ Score:
85
+ and contain only two fields:
86
+ Score: and Reasoning:
87
+ No extra commentary, no markdown, no explanations before or after.
88
+ Think step by step
89
+ '''
90
+
91
+
92
+ SYSTEM_PROMPT_FOR_COHERENCE='''
93
+ Task:
94
+ Evaluate the logical and semantic coherence of the Prompt, one or more provided Contexts, and the Response as a unified set on a 0–1 continuous scale.
95
+
96
+ Goal:
97
+ Assess:
98
+
99
+ Whether the Prompt logically fits each of the provided Contexts.
100
+
101
+ Whether the Response logically follows from both the Prompt and all provided Contexts.
102
+
103
+ Whether there are any gaps, contradictions, or misalignments among the Prompt, the Contexts, and the Response.
104
+
105
+ Scoring Guide (0–1 continuous scale):
106
+ Score 1.0 if:
107
+
108
+ The Prompt is coherent with every Context.
109
+
110
+ The Response seamlessly builds on the Prompt and all Contexts without contradiction.
111
+
112
+ All elements form a consistent, unified narrative.
113
+
114
+ Lower scores if:
115
+
116
+ The Prompt and one or more Contexts are disjointed.
117
+
118
+ The Response introduces contradictions or unsupported claims relative to any Context.
119
+
120
+ Logical or semantic gaps exist between the Prompt, any Context, or the Response.
121
+
122
+ Your output must begin with:
123
+ Score:
124
+ and contain only two fields:
125
+ Score: and Reasoning:
126
+ No extra commentary, no markdown, no explanations before or after.
127
+
128
+ Think step by step
129
+ '''
130
+
131
+
132
+ SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY='''
133
+ Task:
134
+ Evaluate how focused, detailed, and context-aware the agent’s Response is with respect to the Prompt and Context on a 0–1 continuous scale.
135
+
136
+ Goal:
137
+ Assess:
138
+
139
+ Whether the Response provides precise answers targeted to the Prompt.
140
+
141
+ Whether it includes sufficient detail drawn from the Context.
142
+
143
+ Whether it avoids vagueness or overly generic statements.
144
+
145
+ Scoring Guide (0–1 continuous scale):
146
+ Score 1.0 if:
147
+
148
+ The Response is highly specific to the Prompt.
149
+
150
+ It draws clear, detailed insights from the Context.
151
+
152
+ No generic or irrelevant content is present.
153
+
154
+ Lower scores if:
155
+
156
+ Response is vague or broad.
157
+
158
+ Lacks detail or context grounding.
159
+
160
+ Contains filler or off-topic information.
161
+
162
+ Your output must begin with:
163
+ Score:
164
+ and contain only two fields:
165
+ Score: and Reasoning:
166
+ No extra commentary, no markdown, no explanations before or after.
167
+ Think step by step
168
+
169
+ '''
170
+
171
  class LLM_as_Evaluator():
172
 
173
  def __init__(self):
 
252
  de.Update(data=data)
253
 
254
 
255
+ def Observation_LLM_Evaluator(self,metric,promptversion):
256
+ match metric:
257
+ case "biological_context_alignment":
258
+
259
+
260
+ data_to_evaluate=de.GetData(promptversion)
261
+ messages =[
262
+
263
+ {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
264
+ {"role":"user","content":f"""
265
+ Prompt :{data_to_evaluate["prompt"]}
266
+ Research Data :{data_to_evaluate["context"]}
267
+ Agent's Response : {data_to_evaluate["response"]}
268
+ """}
269
+ ]
270
+ evaluation_response=self.___engine_core(messages=messages)
271
+ data={
272
+ "promptversion":promptversion,
273
+ "biological_context_alignment":evaluation_response
274
+ }
275
+ de.Update(data=data)
276
+
277
+ case "contextual_relevance_alignment":
278
+ pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
 
280
 
281