Spaces:

thexForce
/

guard

Sleeping

App Files Files Community

Junaidb commited on May 1

Commit

85b3ca7

verified ·

1 Parent(s): 097333a

Update llmeval.py

Browse files

Files changed (1) hide show

llmeval.py +87 -201

llmeval.py CHANGED Viewed

@@ -11,35 +11,38 @@ de=DatabaseEngine()
-SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION='''
 Task:
-Evaluate the biological quality of a Prompt, Research Data, and Response from an Observations Generator Agent on a 0–1 continuous scale.
 Goal:
 Assess:
-Whether the Prompt clearly defines the research context and specifies the scope of valid biological observations.
-Whether the Research Data is relevant, biologically meaningful, and sufficient to support observation generation.
-Whether the Response includes observations that are biologically plausible, factually grounded, and consistent with the Research Data.
 Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-Prompt is clear, biologically specific, and well-aligned to the data context.
-Research Data is relevant, complete, and interpretable in a biological sense.
-Response consists of multiple observations that are biologically valid, non-redundant, and directly grounded in the data.
 Lower scores if:
-The prompt is vague, generic, or misaligned to the data.
-The research data is noisy, irrelevant, incomplete, or non-biological.
-The response includes irrelevant, biologically implausible, contradictory, or trivial observations.
 Your output must begin with:
 Score:
@@ -47,38 +50,38 @@ and contain only two fields:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
-SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT='''
 Task:
-Evaluate how well the Observation Generator agent’s Response addresses the specific Prompt by leveraging the provided Context on a 0–1 continuous scale.
 Goal:
 Assess:
-Whether the Prompt clearly sets expectations aligned with the Context.
-Whether the Context supplies appropriate information to fulfill the Prompt.
-Whether the Response directly responds to the Prompt using relevant Context details.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-The Prompt is precisely tailored to the Context.
-The Context is sufficient and pertinent.
-The Response directly and comprehensively leverages the Context to satisfy the Prompt.
-Lower scores if:
-Prompt is misaligned or too generic.
-Context is irrelevant, incomplete, or off-topic.
-Response fails to use Context or deviates from Prompt intent.
 Your output must begin with:
 Score:
@@ -86,78 +89,77 @@ and contain only two fields:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
-SYSTEM_PROMPT_FOR_COHERENCE='''
 Task:
-Evaluate the logical and semantic coherence of the Prompt, one or more provided Contexts, and the Response as a unified set on a 0–1 continuous scale.
 Goal:
 Assess:
-Whether the Prompt logically fits each of the provided Contexts.
-Whether the Response logically follows from both the Prompt and all provided Contexts.
-Whether there are any gaps, contradictions, or misalignments among the Prompt, the Contexts, and the Response.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-The Prompt is coherent with every Context.
-The Response seamlessly builds on the Prompt and all Contexts without contradiction.
-All elements form a consistent, unified narrative.
-Lower scores if:
-The Prompt and one or more Contexts are disjointed.
-The Response introduces contradictions or unsupported claims relative to any Context.
-Logical or semantic gaps exist between the Prompt, any Context, or the Response.
 Your output must begin with:
 Score:
 and contain only two fields:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
-SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY='''
 Task:
-Evaluate how focused, detailed, and context-aware the agent’s Response is with respect to the Prompt and Context on a 0–1 continuous scale.
 Goal:
 Assess:
-Whether the Response provides precise answers targeted to the Prompt.
-Whether it includes sufficient detail drawn from the Context.
-Whether it avoids vagueness or overly generic statements.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-The Response is highly specific to the Prompt.
-It draws clear, detailed insights from the Context.
-No generic or irrelevant content is present.
-Lower scores if:
-Response is vague or broad.
-Lacks detail or context grounding.
-Contains filler or off-topic information.
 Your output must begin with:
 Score:
@@ -177,164 +179,48 @@ class LLM_as_Evaluator():
     def ___engine_core(self,messages):
         completion = client.chat.completions.create(
-            model="llama3-8b-8192",
             messages=messages,
             temperature=0.0,
-            max_completion_tokens=5000,
-            top_p=1,
             stream=False,
             stop=None,
             )
         actual_message=completion.choices[0].message.content
-        return actual_message
-        #cleaned_json=re.sub(r"```(?:json)?\s*(.*?)\s*```", r"\1", actual_message, flags=re.DOTALL).strip()
-        #is_json_like = cleaned_json.strip().startswith("{") and cleaned_json.strip().endswith("}")
-        #if is_json_like==True:
-            #return cleaned_json
-        #else:
-            #return "FATAL"
-    def Paradigm_LLM_Evaluator(self,promptversion):
-        SYSTEM='''
-Task:
-Evaluate the biological quality of a prompt, research data, paradigm list, and response on a 0–1 continuous scale.
-Goal:
-Assess:
-Whether the Prompt is clear, biologically specific, and aligned with the Research Data and the Paradigm List.
-Whether the response is biologically relevant, mechanistically coherent, and experimentally actionable based on the Research Data.
-Whether the response is correctly chosen from the Paradigm List in light of the Research Data.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-The Prompt is clear, biologically detailed, and well-aligned to the Research Data and Paradigm List.
-The response correctly reflects a biologically valid interpretation of the Research Data and is appropriately drawn from the Paradigm List.
-Lower scores if:
-The prompt is vague or misaligned with the research context.
-The response is biologically irrelevant, mechanistically incoherent, or mismatched with the Research Data.
-The paradigm is not the most plausible or supported choice from the Paradigm List.
-Your output must begin with Score: and contain only two fields: Score: and Reasoning:. No extra commentary, no markdown, no explanations before or after.:
-Think step by step
-'''
-        data_to_evaluate=dbe.GetData(promptversion)
-        messages=[
-            {"role":"system","content":SYSTEM},
-            {"role":"user","content":f"""
-            Prompt:{data_to_evaluate["prompt"]},
-            Research Data :{data_to_evaluate["context"]},
-            Agent's Response:{data_to_evaluate["response"]}
-            """}
-        ]
-        evaluation_response=self.___engine_core(messages=messages)
-        data={
-            "promptversion":promptversion,
-            "biological_context_alignment":evaluation_response
-            }
-        de.Update(data=data)
-    def Observation_LLM_Evaluator(self,metric,promptversion):
-        match metric:
-            case "biological_context_alignment":
-                data_to_evaluate=de.GetData(promptversion)
-                messages =[
-                    {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
-                    {"role":"user","content":f"""
                     Prompt :{data_to_evaluate["prompt"]}
-                    Research Data :{data_to_evaluate["context"]}
                     Agent's Response : {data_to_evaluate["response"]}
-                """}
-                ]
-                evaluation_response=self.___engine_core(messages=messages)
-                data={
                     "promptversion":promptversion,
-                    "biological_context_alignment":evaluation_response
                 }
-                de.Update(data=data)
-            case "contextual_relevance_alignment":
-                    pass
-    def  Anomaly_LLM_Evaluator(self,promptversion):
-        SYSTEM='''
-Task:
-Evaluate the biological quality of a prompt , observations , paradigms and response  from an Anomaly Detector Agent on a 0–1 continuous scale.
-Goal:
-Assess:
-Whether the Prompt clearly defines the biological context and intent.
-Whether the Observations are biologically plausible and internally consistent.
-Whether the Paradigms are plausible biological frameworks given the context.
-Whether the Response correctly identifies biologically relevant inconsistencies or contradictions between the Paradigms and the Observations.
-Scoring Guide (0–1 continuous scale):
-Score 1.0 if:
-The Prompt is clear, biologically grounded, and well-scoped.
-The Observations are plausible and logically consistent.
-The Response accurately identifies true anomalies—i.e., meaningful contradictions or gaps—between the Paradigms and the Observations.
-All major conflicts are captured.
-Lower scores if:
-The Prompt is vague or misaligned with the context.
-Observations are biologically implausible or incoherent.
-The Response overlooks key inconsistencies, includes irrelevant anomalies, or shows poor biological reasoning.
-Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
-Think step by step
-'''
-        data_to_evaluate=dbe.GetData(promptversion)
-        messages=[
-            {"role":"system","content":SYSTEM},
-            {"role":"user","content":f"""
-            Prompt :{data_to_evaluate["prompt"]}
-            Observations :{ data_to_evaluate["context"]}
-            Agent's Response :{data_to_evaluate["response"]}
-            """}
-        ]
-        evaluation_response=self.___engine_core(messages=messages)
-        data={
-            "promptversion":promptversion,
-            "biological_context_alignment":evaluation_response
-            }
-        de.Update(data=data)

+SYSTEM_FOR_BIO_CONTEXT_ALIGNMENT=f'''
 Task:
+Evaluate the biological quality of a Prompt, Context, and Response from an {agenttype} Agent on a 0–10 continuous scale.
 Goal:
 Assess:
+Whether Prompt precisely defines a biologically specific research objective, explicitly frames the agent's role, and delineates valid output types or constraints and is well aligned to the context.
+Whether Context is highly relevant, internally consistent, sufficiently rich in biological context, and presented in a way that supports fine-grained inference or analysis.
+Whether Response consists of  output that is biologically valid, mechanistically sound, non-redundant, free from trivialities, contradictions, or generic phrasing and directly grounded in the context.
 Scoring Guide (0–1 continuous scale):
+Score 10 if all of the following are true:
+Prompt precisely defines a biologically specific research objective, explicitly frames the agent's role, and delineates valid output types or constraints and is well aligned to the  context.
+Context is highly relevant, internally consistent, sufficiently rich in biological context, and presented in a way that supports fine-grained inference or analysis.
+Response consists of  output that is biologically valid, mechanistically sound, non-redundant, free from trivialities, contradictions, or generic phrasing and directly grounded in the context.
 Lower scores if:
+Prompt does not clearly define a biologically specific objective, fails to frame the agent’s role or valid outputs, and is misaligned with the  context.
+Context is irrelevant, inconsistent, lacking biological detail, or presented in a way that hinders meaningful analysis.
+Response includes output that is biologically invalid, mechanistically flawed, redundant, trivial, contradictory, or generic, and not clearly grounded in the context.
 Your output must begin with:
 Score:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
+SYSTEM_FOR_CONTEXTUAL_RELEVANCE_ALIGNMENT=f'''
 Task:
+Evaluate how well the {agenttype} Response addresses the specific Prompt by leveraging the provided Context on a 0–10 continuous scale.
 Goal:
 Assess:
+Whether the Prompt is precisely tailored to the Context, clearly sets expectations, and aligns with the scope of valid outputs.
+Whether the Context is highly relevant, biologically rich, and sufficient to enable effective fulfillment of the Prompt.
+Whether the Response directly and comprehensively utilizes the Context to fulfill the Prompt’s objective, without deviating or introducing irrelevant information.
+Scoring Guide (0–10 scale):
+Score 10 if all of the following are true:
+Prompt is precisely tailored to the Context, setting clear, biologically specific expectations and constraints for the agent.
+Context is sufficient, relevant, and complete, directly supporting the generation of appropriate output.
+Response directly addresses the Prompt, utilizing the Context to comprehensively satisfy the Prompt’s expectations with no deviation or irrelevant information.
+Low scores if :
+Prompt is not tailored to the Context, lacks clear, biologically specific expectations, and fails to set appropriate constraints for the agent
+Context is insufficient, irrelevant, or incomplete, failing to support the generation of appropriate output.
+Response does not directly address the Prompt, fails to utilize the Context effectively, and includes deviations or irrelevant information that do not satisfy the Prompt’s expectations.
 Your output must begin with:
 Score:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
+SYSTEM_PROMPT_FOR_TRIAD_COHERENCE=f'''
 Task:
+Evaluate the logical and semantic coherence of the Prompt, Context, and Response of {agenttype} as a unified set on a 0–10 continuous scale.
 Goal:
 Assess:
+Whether the Prompt is logically consistent with the provided Context, setting a clear, biologically grounded framework for the Response.
+Whether the Response logically and semantically follows from both the Prompt and provided Context, without contradictions or unsupported claims.
+Whether there are gaps, contradictions, or misalignments among the Prompt, Context and the Response that affect the overall coherence.
+Scoring Guide (0–10 scale):
+Score 10 if all are true:
+The Prompt is logically coherent with the Context, clearly framing the research objectives.
+The Response seamlessly builds on the Prompt and  the Context, maintaining consistency without contradiction or ambiguity.
+All elements form a logically unified and semantically sound narrative, with no gaps or contradictions between them.
+Low scores if:
+The Prompt is not logically coherent with the Context, failing to clearly frame the research objectives.
+The Response does not seamlessly build on the Prompt and the Context, introducing contradictions or ambiguity.
+The elements do not form a logically unified or semantically sound narrative, containing gaps or contradictions between them.
 Your output must begin with:
 Score:
 and contain only two fields:
 Score: and Reasoning:
 No extra commentary, no markdown, no explanations before or after.
 Think step by step
 '''
+SYSTEM_PROMPT_FOR_RESPONSE_SPECIFICITY=f'''
 Task:
+Evaluate how focused, detailed, and context-aware the {agenttype} Response is with respect to the Prompt and Context on a 0–10 continuous scale.
 Goal:
 Assess:
+Whether the Response is highly specific and precisely targeted to the Prompt, addressing the research objectives without deviation.
+Whether the Response includes sufficient, detailed insights directly drawn from the Context, ensuring relevance and biological accuracy.
+Whether the Response avoids vagueness, overly generic statements, and provides only relevant, factually grounded content.
+Scoring Guide (0–10 scale):
+Score 10 if all are true:
+The Response is exceptionally specific to the Prompt, addressing every aspect with precision and detail.
+The Response draws clear, biologically grounded, and highly detailed insights from the Context, ensuring all claims are backed by relevant data.
+No generic, irrelevant, or off-topic content is present, and every statement is purposeful and directly tied to the research objectives.
+Low scores if :
+The Response is not specific to the Prompt, failing to address important aspects with precision or detail.
+The Response does not draw clear, biologically grounded, or detailed insights from the Context, and many claims are not supported by relevant data.The Response contains generic, irrelevant, or off-topic content, and many statements are not purposeful or aligned with the research objectives
+The Response contains generic, irrelevant, or off-topic content, and many statements are not purposeful or aligned with the research objectives
 Your output must begin with:
 Score:
     def ___engine_core(self,messages):
         completion = client.chat.completions.create(
+            model="deepseek-r1-distill-llama-70b",
             messages=messages,
             temperature=0.0,
+            max_completion_tokens=6000,
+            #top_p=1,
             stream=False,
             stop=None,
             )
         actual_message=completion.choices[0].message.content
+        return re.sub(r"<think>.*?</think>", "", actual_message, flags=re.DOTALL).strip()
+    def Observation_LLM_Evaluator(self,promptversion):
+        metrics=["biological_context_alignment","contextual_relevance_alignment","coherence","response_specificity"]
+        data_to_evaluate=de.GetData(promptversion)
+        import time
+        for metric in metrics:
+            messages =[
+                {"role":"system","content":SYSTEM_FOR_BIO_CONTEXT_EVAL_FOR_OBSERVATION},
+                {"role":"user","content":f"""
                     Prompt :{data_to_evaluate["prompt"]}
+                    Context :{data_to_evaluate["context"]}
                     Agent's Response : {data_to_evaluate["response"]}
+                    """}
+            ]
+            evaluation_response=self.___engine_core(messages=messages)
+            data={
                     "promptversion":promptversion,
+                    "biological_context_alignment":"",
+                    "contextual_relevance_alignment":"",
+                    "unit_coherence":"",
+                    "response_specificity":""
                 }
+            de.Update(data=data)