HPAI-BSC/Meta-Llama-3.1-8B-Instruct-Egida-DPO

Model Description

Fine-Tuned from Model: meta-llama/Llama-3.1-8B-Instruct
Paper: Efficient Safety Retrofitting Against Jailbreaking for LLMs
Point of Contact: Adrián Tormos

Model Summary

This is a fine-tuned Llama-3.1-8B-Instruct model on the Egida-DPO-Llama-3.1-8B-Instruct dataset.

The Egida dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

Performance

Safety Performance (Attack Success Ratio)

	Egida (test) ↓	DELPHI ↓	Alert-Base ↓	Alert-Adv ↓
Meta-Llama-3.1-8B-Instruct	0.347	0.160	0.446	0.039
Meta-Llama-3.1-8B-Instruct-Egida-DPO	0.038	0.025	0.038	0.014

General Purpose Performance

	OpenLLM Leaderboard (Average) ↑	MMLU Generative (ROUGE1) ↑
Meta-Llama-3.1-8B-Instruct	0.453	0.646
Meta-Llama-3.1-8B-Instruct-Egida-DPO	0.453	0.643

Refusal Ratio

	OR Bench 80K (refusal) ↓	OR Bench Hard (refusal) ↓
Meta-Llama-3.1-8B-Instruct	0.035	0.324
Meta-Llama-3.1-8B-Instruct-Egida-DPO	0.037	0.319

Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.

Training Details

Hardware: NVIDIA H100 64 GB GPUs
Devices: 4 GPUs (1 node)
Time: 1.59h
Batch Size: 8
LR: 10−7

Environmental Impact

Citation Information

@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}

HPAI-BSC
/

Meta-Llama-3.1-8B-Instruct-Egida-DPO

Model Description

Model Summary

Performance

Safety Performance (Attack Success Ratio)

General Purpose Performance

Refusal Ratio

Training Details

Environmental Impact

Citation Information

Model tree for HPAI-BSC/Meta-Llama-3.1-8B-Instruct-Egida-DPO

Dataset used to train HPAI-BSC/Meta-Llama-3.1-8B-Instruct-Egida-DPO

Collection including HPAI-BSC/Meta-Llama-3.1-8B-Instruct-Egida-DPO

Egida (LLM Safety)