Safetensors
English
qwen2
safety

Model Description

Model Summary

This is a fine-tuned Qwen2.5-7B-Instruct model on the Egida-DPO-Qwen2.5-7B-Instruct dataset.

The Egida dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

Training Details

  • Hardware: NVIDIA H100 64 GB GPUs
  • Devices: 4 GPUs (1 node)
  • Time: 1.59h
  • Batch Size: 8
  • LR: 10−7

Performance

Safety Performance (Attack Success Ratio)

Egida (test) ↓ DELPHI ↓ Alert-Base ↓ Alert-Adv ↓
Qwen-2.5-7B-Instruct 0.471 0.138 0.544 0.080
Qwen-2.5-7B-Instruct-Egida-DPO 0.322 0.118 0.410 0.045

General Purpose Performance

OpenLLM Leaderboard (Average) ↑ MMLU Generative (ROUGE1) ↑
Qwen-2.5-7B-Instruct 0.488 0.331
Qwen-2.5-7B-Instruct-Egida-DPO 0.488 0.296

Refusal Ratio

OR Bench 80K (refusal) ↓ OR Bench Hard (refusal) ↓
Qwen-2.5-7B-Instruct 0.021 0.175
Qwen-2.5-7B-Instruct-Egida-DPO 0.029 0.240

Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.

Environmental Impact

Citation Information

@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}
Downloads last month
57
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for HPAI-BSC/Qwen2.5-7B-Instruct-Egida-DPO

Base model

Qwen/Qwen2.5-7B
Finetuned
(703)
this model

Dataset used to train HPAI-BSC/Qwen2.5-7B-Instruct-Egida-DPO

Collection including HPAI-BSC/Qwen2.5-7B-Instruct-Egida-DPO