Spaces:

thexForce
/

guard

Sleeping

App Files Files Community

Junaidb commited on Apr 30

Commit

097333a

verified ·

1 Parent(s): 2fb73dc

Update llmeval.py

Browse files

Files changed (1) hide show

llmeval.py +182 -55

llmeval.py CHANGED Viewed

@@ -10,6 +10,164 @@ de=DatabaseEngine()
 class LLM_as_Evaluator():
     def __init__(self):
@@ -94,61 +252,30 @@ Think step by step
         de.Update(data=data)
-    def Observation_LLM_Evaluator(self,promptversion):
-        SYSTEM='''
-Task:
-Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
-Goal:
-Assess:
-Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
-Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
-Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-Prompt is clear, biologically specific, and well-aligned to the data context.
-Research Data is relevant, complete, and interpretable in a biological sense.
-Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
-Lower scores if:
-The prompt is vague, generic, or misaligned to the data.
-The research data is noisy, irrelevant, incomplete, or non-biological.
-The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
-Your output must begin with:
-Score:
-and contain only two fields:
-Score: and Reasoning:
-No extra commentary, no markdown, no explanations before or after.
-Think step by step
-'''
-        data_to_evaluate=de.GetData(promptversion)
-        messages =[
-            {"role":"system","content":SYSTEM},
-            {"role":"user","content":f"""
-            Prompt :{data_to_evaluate["prompt"]}
-            Research Data :{data_to_evaluate["context"]}
-            Agent's Response : {data_to_evaluate["response"]}
-            """}
-        ]
-        evaluation_response=self.___engine_core(messages=messages)
-        data={
-            "promptversion":promptversion,
-            "biological_context_alignment":evaluation_response
-            }
-        de.Update(data=data)

+SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION='''
+Task:
+Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
+Goal:
+Assess:
+Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
+Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
+Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
+Scoring Guide (0–1 continuous scale):
+Score 1.0 if:
+Prompt is clear, biologically specific, and well-aligned to the data context.
+Research Data is relevant, complete, and interpretable in a biological sense.
+Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
+Lower scores if:
+The prompt is vague, generic, or misaligned to the data.
+The research data is noisy, irrelevant, incomplete, or non-biological.
+The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
+Your output must begin with:
+Score:
+and contain only two fields:
+Score: and Reasoning:
+No extra commentary, no markdown, no explanations before or after.
+Think step by step
+'''
+SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT='''
+Task:
+Evaluate how well the Observation Generator agent’s Response addresses the specific Prompt by leveraging the provided Context on a 0–1 continuous scale.
+Goal:
+Assess:
+Whether the Prompt clearly sets expectations aligned with the Context.
+Whether the Context supplies appropriate information to fulfill the Prompt.
+Whether the Response directly responds to the Prompt using relevant Context details.
+Scoring Guide (0–1 continuous scale):
+Score 1.0 if:
+The Prompt is precisely tailored to the Context.
+The Context is sufficient and pertinent.
+The Response directly and comprehensively leverages the Context to satisfy the Prompt.
+Lower scores if:
+Prompt is misaligned or too generic.
+Context is irrelevant, incomplete, or off-topic.
+Response fails to use Context or deviates from Prompt intent.
+Your output must begin with:
+Score:
+and contain only two fields:
+Score: and Reasoning:
+No extra commentary, no markdown, no explanations before or after.
+Think step by step
+'''
+SYSTEM_PROMPT_FOR_COHERENCE='''
+Task:
+Evaluate the logical and semantic coherence of the Prompt, one or more provided Contexts, and the Response as a unified set on a 0–1 continuous scale.
+Goal:
+Assess:
+Whether the Prompt logically fits each of the provided Contexts.
+Whether the Response logically follows from both the Prompt and all provided Contexts.
+Whether there are any gaps, contradictions, or misalignments among the Prompt, the Contexts, and the Response.
+Scoring Guide (0–1 continuous scale):
+Score 1.0 if:
+The Prompt is coherent with every Context.
+The Response seamlessly builds on the Prompt and all Contexts without contradiction.
+All elements form a consistent, unified narrative.
+Lower scores if:
+The Prompt and one or more Contexts are disjointed.
+The Response introduces contradictions or unsupported claims relative to any Context.
+Logical or semantic gaps exist between the Prompt, any Context, or the Response.
+Your output must begin with:
+Score:
+and contain only two fields:
+Score: and Reasoning:
+No extra commentary, no markdown, no explanations before or after.
+Think step by step
+'''
+SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY='''
+Task:
+Evaluate how focused, detailed, and context-aware the agent’s Response is with respect to the Prompt and Context on a 0–1 continuous scale.
+Goal:
+Assess:
+Whether the Response provides precise answers targeted to the Prompt.
+Whether it includes sufficient detail drawn from the Context.
+Whether it avoids vagueness or overly generic statements.
+Scoring Guide (0–1 continuous scale):
+Score 1.0 if:
+The Response is highly specific to the Prompt.
+It draws clear, detailed insights from the Context.
+No generic or irrelevant content is present.
+Lower scores if:
+Response is vague or broad.
+Lacks detail or context grounding.
+Contains filler or off-topic information.
+Your output must begin with:
+Score:
+and contain only two fields:
+Score: and Reasoning:
+No extra commentary, no markdown, no explanations before or after.
+Think step by step
+'''
 class LLM_as_Evaluator():
     def __init__(self):
         de.Update(data=data)
+    def Observation_LLM_Evaluator(self,metric,promptversion):
+        match metric:
+            case "biological_context_alignment":
+                data_to_evaluate=de.GetData(promptversion)
+                messages =[
+                    {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
+                    {"role":"user","content":f"""
+                    Prompt :{data_to_evaluate["prompt"]}
+                    Research Data :{data_to_evaluate["context"]}
+                    Agent's Response : {data_to_evaluate["response"]}
+                """}
+                ]
+                evaluation_response=self.___engine_core(messages=messages)
+                data={
+                    "promptversion":promptversion,
+                    "biological_context_alignment":evaluation_response
+                }
+                de.Update(data=data)
+            case "contextual_relevance_alignment":
+                    pass