license: apache-2.0
datasets:
- HPAI-BSC/Egida
language:
- en
base_model:
- Qwen/Qwen2.5-72B-Instruct
tags:
- safety
Model Description
- Fine-Tuned from Model: Qwen/Qwen2.5-72B-Instruct
- Paper: Efficient Safety Retrofitting Against Jailbreaking for LLMs
- Point of Contact: Adrián Tormos
Model Summary
This is a fine-tuned Qwen2.5-72B-Instruct model on the Egida-DPO-Qwen2.5-72B-Instruct dataset.
The Egida dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-72B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
Training Details
- Hardware: NVIDIA H100 64 GB GPUs
- Devices: 64 GPUs (16 nodes)
- Time: 10.23h
- Batch Size: 63
- LR: 10−6
Performance
Safety Performance (Attack Success Ratio)
Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ | |
---|---|---|---|---|
Qwen-2.5-72B-Instruct | 0.235 | 0.051 | 0.329 | 0.050 |
Qwen-2.5-72B-Instruct-Egida-DPO | 0.125 | 0.042 | 0.210 | 0.019 |
General Purpose Performance
OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ | |
---|---|---|
Qwen-2.5-72B-Instruct | 0.618 | 0.771 |
Qwen-2.5-72B-Instruct-Egida-DPO | 0.620 | 0.768 |
Refusal Ratio
OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ | |
---|---|---|
Qwen-2.5-72B-Instruct | 0.015 | 0.102 |
Qwen-2.5-72B-Instruct-Egida-DPO | 0.016 | 0.170 |
Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
Environmental Impact
Citation Information
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
year={2025},
eprint={2502.13603},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13603},
}