Selene 1 Mini: the best small language model-as-a-judge

Community Article Published January 29, 2025

The Atla team is proud to introduce Selene mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is an open source 8B evaluation model that outperforms GPT-4o-mini and top SLMJs (SFR-Judge, Glider, Flow Judge, Prometheus 2) on overall performance across a wide variety of evaluation tasks.

Using off-the-shelf LLMs as evaluators can produce biased or inaccurate results. By contrast, Selene Mini was trained on dedicated datasets to be a general purpose evaluator. It supports flexible prompting, and produces both scores and critiques with reasoning. It also excels in real-world applications, such as medicine and finance.

In the example below taken from a medical dataset, we wish to evaluate whether it is possible to reach a single diagnosis from a conversation. Both GPT-4o mini and LLaMa 3.1 8B incorrectly reason that a diagnosis can be reached, whereas Selene Mini correctly judges that the conversation does not include enough information for a diagnosis, matching the judgment of a medical expert.

Download Selene Mini.

Our larger model from the Selene model family will be released soon. Sign up to our waitlist to get first access.

Benchmark performance

Selene Mini outperforms prior small evaluation models on average performance across 11 benchmarks, spanning three different types of evaluation tasks:

Absolute scoring, e.g. "Evaluate the harmlessness of this response on a scale of 1-5."
Classification, e.g. "Does this response address the user query? Answer Yes or No."
Pairwise preference. e.g. "Which of the following responses is more logically consistent - A or B?"

On some benchmarks, Selene Mini beats models several times its size, outperforming GPT-4o on RewardBench, EvalBiasBench, and Auto-J. It is also the highest-scoring 8B generative model on RewardBench, and the top-ranking model on Judge arena.

How we trained Selene Mini

As a general purpose evaluator, Selene Mini is able to accurately assign evaluations across absolute score, classification, and pairwise tasks.

We achieve this by developing a principled data curation strategy that augments public datasets with synthetically generated critiques, and ensures high quality through filtering and ablation studies. We used this curated data to fine-tune a Llama-3.1-8B model on a loss strategy that combines the benefits of direct preference optimization (DPO) and supervised fine-tuning (SFT).

Read more on our methodology and results in the technical report.

Fit for real-world use cases

Selene Mini’s capabilities transfer to specialized domains without losing their broad applicability. This is crucial for deployment in production environments where domain expertise is required but specialized evaluators may not be available.

Real-world performance: We measure zero-shot performance on two industry datasets annotated by human experts in finance (FinanceBench) and medicine (CRAFT-MD). Selene Mini surpasses the base model by 5% and 10% respectively, showing high agreement with domain experts.

Robust to prompt formatting: Selene Mini maintains consistent performance on benchmarks even in the face of changes to prompt structure and formatting. This ability points to robust learned evaluation capabilities rather than mere pattern matching.

Selene Mini can be prompted to perform evaluations on different scoring scales, like binary pass/fail, a 1-5 Likert scale, multiple classes (“yes,” “no”, “maybe”), or pairwise preference between responses. It can further consider ‘ground truth’ and ‘context’ input variables when available. With these options, users can prompt Selene Mini to fit the evaluation criteria for their specific use case.

How to use Selene Mini

🌕 Download Selene Mini on Hugging Face.

Below is a quickstart for Hugging Face Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?" # replace with your eval prompt
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Try our cookbooks to get started with two popular use cases below:

To achieve best results, we provide the prompts we used for training here.

You can also test out the Selene Mini playground.

We’re excited to see what you evaluate with Selene. To discuss, join our Discord!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment