Spaces:

thexForce
/

guard

Sleeping

App Files Files Community

Junaidb commited on Apr 29

Commit

324d9f3

verified ·

1 Parent(s): 3f020e1

Update llmeval.py

Browse files

Files changed (1) hide show

llmeval.py +25 -17

llmeval.py CHANGED Viewed

@@ -42,16 +42,16 @@ class LLM_as_Evaluator():
         SYSTEM='''
 Task:
-Evaluate the biological quality of a prompt, research data, paradigm list, and the selected paradigm on a 0–1 continuous scale.
 Goal:
 Assess:
 Whether the Prompt is clear, biologically specific, and aligned with the Research Data and the Paradigm List.
-Whether the selected Paradigm is biologically relevant, mechanistically coherent, and experimentally actionable based on the Research Data.
-Whether the selected Paradigm is correctly chosen from the Paradigm List in light of the Research Data.
 Scoring Guide (0–1 continuous scale):
@@ -59,15 +59,15 @@ Score 1.0 if:
 The Prompt is clear, biologically detailed, and well-aligned to the Research Data and Paradigm List.
-The selected Paradigm correctly reflects a biologically valid interpretation of the Research Data and is appropriately drawn from the Paradigm List.
 Lower scores if:
 The prompt is vague or misaligned with the research context.
-The selected paradigm is biologically irrelevant, mechanistically incoherent, or mismatched with the Research Data.
-The selected paradigm is not the most plausible or supported choice from the Paradigm List.
 Your output must begin with Score: and contain only two fields: Score: and Reasoning:. No extra commentary, no markdown, no explanations before or after.:
@@ -97,7 +97,7 @@ Think step by step
     def Observation_LLM_Evaluator(self,promptversion):
         SYSTEM='''
 Task:
-Evaluate the biological quality of a prompt , research data . response triplet from an Observations Generator Agent on a 0–1 continuous scale.
 Goal:
 Assess:
@@ -122,6 +122,8 @@ The response includes irrelevant, biologically implausible, contradictory, or tr
 Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
 '''
         data_to_evaluate=dbe.GetData(promptversion)
         messages =[
@@ -146,36 +148,42 @@ Your output must begin with Score: and contain only two fields: Score: and Reaso
     def  Anomaly_LLM_Evaluator(self,promptversion):
         SYSTEM='''
 Task:
-Evaluate the biological quality of a prompt–observations–response triplet from an Anomaly Detector Agent on a 0–1 continuous scale.
 Goal:
 Assess:
 Whether the Prompt clearly defines the biological context and intent.
 Whether the Observations are biologically plausible and internally consistent.
-Whether the Response correctly identifies biologically relevant inconsistencies between the Paradigm and Observations.
 Scoring Guide (0–1 continuous scale):
 Score 1.0 if:
-The prompt is clear and biologically grounded.
-The response lists true, biologically meaningful anomalies based on the observations.
-All major contradictions or gaps are captured.
 Lower scores if:
-The prompt is vague.
-The response misses key anomalies, adds irrelevant ones, or shows poor biological reasoning.
 Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
-Output:
-Score: 0.2
-Reasoning: Your reasoning.
 '''
         data_to_evaluate=dbe.GetData(promptversion)

         SYSTEM='''
 Task:
+Evaluate the biological quality of a prompt, research data, paradigm list, and response on a 0–1 continuous scale.
 Goal:
 Assess:
 Whether the Prompt is clear, biologically specific, and aligned with the Research Data and the Paradigm List.
+Whether the response is biologically relevant, mechanistically coherent, and experimentally actionable based on the Research Data.
+Whether the response is correctly chosen from the Paradigm List in light of the Research Data.
 Scoring Guide (0–1 continuous scale):
 The Prompt is clear, biologically detailed, and well-aligned to the Research Data and Paradigm List.
+The response correctly reflects a biologically valid interpretation of the Research Data and is appropriately drawn from the Paradigm List.
 Lower scores if:
 The prompt is vague or misaligned with the research context.
+The response is biologically irrelevant, mechanistically incoherent, or mismatched with the Research Data.
+The paradigm is not the most plausible or supported choice from the Paradigm List.
 Your output must begin with Score: and contain only two fields: Score: and Reasoning:. No extra commentary, no markdown, no explanations before or after.:
     def Observation_LLM_Evaluator(self,promptversion):
         SYSTEM='''
 Task:
+Evaluate the biological quality of a prompt , research data and response  from an Observations Generator Agent on a 0–1 continuous scale.
 Goal:
 Assess:
 Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
+Think step by step
 '''
         data_to_evaluate=dbe.GetData(promptversion)
         messages =[
     def  Anomaly_LLM_Evaluator(self,promptversion):
         SYSTEM='''
 Task:
+Evaluate the biological quality of a prompt , observations , paradigms and response  from an Anomaly Detector Agent on a 0–1 continuous scale.
 Goal:
 Assess:
 Whether the Prompt clearly defines the biological context and intent.
 Whether the Observations are biologically plausible and internally consistent.
+Whether the Paradigms are plausible biological frameworks given the context.
+Whether the Response correctly identifies biologically relevant inconsistencies or contradictions between the Paradigms and the Observations.
 Scoring Guide (0–1 continuous scale):
 Score 1.0 if:
+The Prompt is clear, biologically grounded, and well-scoped.
+The Observations are plausible and logically consistent.
+The Response accurately identifies true anomalies—i.e., meaningful contradictions or gaps—between the Paradigms and the Observations.
+All major conflicts are captured.
 Lower scores if:
+The Prompt is vague or misaligned with the context.
+Observations are biologically implausible or incoherent.
+The Response overlooks key inconsistencies, includes irrelevant anomalies, or shows poor biological reasoning.
 Your output must begin with Score: and contain only two fields: Score: and Reasoning: No extra commentary, no markdown, no explanations before or after.
+Think step by step
 '''
         data_to_evaluate=dbe.GetData(promptversion)