Spaces:
Running
Running
Organise prompts
Browse files
common.py
CHANGED
@@ -37,7 +37,7 @@ CSS_STYLES = """
|
|
37 |
gap: 8px;
|
38 |
}
|
39 |
"""
|
40 |
-
|
41 |
# Default Eval Prompt
|
42 |
EVAL_DESCRIPTION = """
|
43 |
## 📝 Tips
|
@@ -47,43 +47,6 @@ EVAL_DESCRIPTION = """
|
|
47 |
- Examples (Optional)
|
48 |
"""
|
49 |
|
50 |
-
DEFAULT_EVAL_PROMPT = """Does the model provide relevant and useful responses to the user's needs or questions?
|
51 |
-
|
52 |
-
Scoring Rubric:
|
53 |
-
Score 1: The model's responses are irrelevant or unhelpful to the user's needs or queries.
|
54 |
-
Score 2: The model sometimes provides helpful information, but often fails to address the user's actual needs or questions.
|
55 |
-
Score 3: The model generally provides helpful responses that address the user's needs, though it may occasionally miss the mark.
|
56 |
-
Score 4: The model regularly provides helpful responses that are well-aligned with the user's inquiries, with only rare inaccuracies.
|
57 |
-
Score 5: The model consistently offers highly relevant and useful responses that perfectly cater to the user's needs and inquiries.
|
58 |
-
|
59 |
-
[User Query]: {{input}}
|
60 |
-
|
61 |
-
[AI Response]: {{response}}"""
|
62 |
-
|
63 |
-
# Split the eval prompt into editable and fixed parts
|
64 |
-
DEFAULT_EVAL_PROMPT_EDITABLE = """Does the model provide relevant and useful responses to the user's needs or questions?
|
65 |
-
|
66 |
-
Scoring Rubric:
|
67 |
-
Score 1: The model's responses are irrelevant or unhelpful to the user's needs or queries.
|
68 |
-
Score 2: The model sometimes provides helpful information, but often fails to address the user's actual needs or questions.
|
69 |
-
Score 3: The model generally provides helpful responses that address the user's needs, though it may occasionally miss the mark.
|
70 |
-
Score 4: The model regularly provides helpful responses that are well-aligned with the user's inquiries, with only rare inaccuracies.
|
71 |
-
Score 5: The model consistently offers highly relevant and useful responses that perfectly cater to the user's needs and inquiries."""
|
72 |
-
|
73 |
-
# Fixed suffix that will always be appended
|
74 |
-
FIXED_EVAL_SUFFIX = """
|
75 |
-
[User Query]: {{input}}
|
76 |
-
|
77 |
-
[AI Response]: {{response}}"""
|
78 |
-
|
79 |
-
# Default Variable Values
|
80 |
-
DEFAULT_INPUT = """Which of these animals is least likely to be found in a rainforest?"
|
81 |
-
A) Jaguar
|
82 |
-
B) Toucan
|
83 |
-
C) Polar Bear
|
84 |
-
D) Sloth"""
|
85 |
-
DEFAULT_RESPONSE = "C) Polar Bear"
|
86 |
-
|
87 |
# Voting Section Header
|
88 |
VOTING_HEADER = """
|
89 |
# Start Voting Now
|
@@ -103,7 +66,7 @@ We thank [Clementine Fourrier](https://huggingface.co/clefourrier) and Hugging F
|
|
103 |
POLICY_CONTENT = """
|
104 |
# About Atla
|
105 |
|
106 |
-
|
107 |
<br><br>
|
108 |
# Our Mission
|
109 |
|
@@ -159,25 +122,5 @@ Atla currently funds this out of our own pocket. We are looking for API credits
|
|
159 |
We are training a general-purpose evaluator that you will soon be able to run in this Judge Arena. Our next step will be to open-source a powerful model that the community can use to run fast and accurate evaluations.
|
160 |
<br><br>
|
161 |
# Get in touch
|
162 |
-
We’d love to hear your feedback! For general feature requests or to submit / suggest new models to add to the arena, please open up a discussion in the [community](https://huggingface.co/spaces/AtlaAI/judge-arena/discussions) tab. You can also contact us directly on [X](https://x.com/Atla_AI) or [Discord](https://discord.gg/
|
163 |
\nPlease file any issues on our [Github](https://github.com/atla-ai/judge-arena)."""
|
164 |
-
|
165 |
-
|
166 |
-
# Default values for compatible mode
|
167 |
-
DEFAULT_EVAL_CRITERIA = """Does the model provide relevant and useful responses to the user's needs or questions?"""
|
168 |
-
|
169 |
-
DEFAULT_SCORE_1 = "The model's responses are irrelevant or unhelpful to the user's needs or queries."
|
170 |
-
|
171 |
-
DEFAULT_SCORE_2 = "The model sometimes provides helpful information, but often fails to address the user's actual needs or questions."
|
172 |
-
|
173 |
-
DEFAULT_SCORE_3 = "The model generally provides helpful responses that address the user's needs, though it may occasionally miss the mark."
|
174 |
-
|
175 |
-
DEFAULT_SCORE_4 = "The model regularly provides helpful responses that are well-aligned with the user's inquiries, with only rare inaccuracies."
|
176 |
-
|
177 |
-
DEFAULT_SCORE_5 = "The model consistently offers highly relevant and useful responses that perfectly cater to the user's needs and inquiries."
|
178 |
-
|
179 |
-
#**What are the Evaluator Prompt Templates based on?**
|
180 |
-
|
181 |
-
#As a quick start, we've set up templates that cover the most popular evaluation metrics out there on LLM evaluation / monitoring tools, often known as 'base metrics'. The data samples used in these were randomly picked from popular datasets from academia - [ARC](https://huggingface.co/datasets/allenai/ai2_arc), [Preference Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), [RAGTruth](https://arxiv.org/abs/2401.00396).
|
182 |
-
|
183 |
-
#These templates are designed as a starting point to showcase how to interact with the Judge Arena, especially for those less familiar with using LLM judges.
|
|
|
37 |
gap: 8px;
|
38 |
}
|
39 |
"""
|
40 |
+
|
41 |
# Default Eval Prompt
|
42 |
EVAL_DESCRIPTION = """
|
43 |
## 📝 Tips
|
|
|
47 |
- Examples (Optional)
|
48 |
"""
|
49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
# Voting Section Header
|
51 |
VOTING_HEADER = """
|
52 |
# Start Voting Now
|
|
|
66 |
POLICY_CONTENT = """
|
67 |
# About Atla
|
68 |
|
69 |
+
Atla is an applied research organization that trains models as evaluators to capture human preferences. We're a team of researchers, engineers, and operational leaders, with experience spanning a variety of disciplines, all working together to build reliable and understandable AI systems. Our research is informed by our experiences conducting AI safety research at the UK AI Task Force, OpenAI and the Stanford Existential Risks Initiative.
|
70 |
<br><br>
|
71 |
# Our Mission
|
72 |
|
|
|
122 |
We are training a general-purpose evaluator that you will soon be able to run in this Judge Arena. Our next step will be to open-source a powerful model that the community can use to run fast and accurate evaluations.
|
123 |
<br><br>
|
124 |
# Get in touch
|
125 |
+
We’d love to hear your feedback! For general feature requests or to submit / suggest new models to add to the arena, please open up a discussion in the [community](https://huggingface.co/spaces/AtlaAI/judge-arena/discussions) tab. You can also contact us directly on [X](https://x.com/Atla_AI) or [Discord](https://discord.gg/yNpUAMqs).
|
126 |
\nPlease file any issues on our [Github](https://github.com/atla-ai/judge-arena)."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|