---
license: apache-2.0
datasets:
- HPAI-BSC/Egida
language:
- en
base_model:
- meta-llama/Llama-3.1-70B-Instruct
---

## Model Description

- **Fine-Tuned from Model:** [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
- **Point of Contact:** [Adrián Tormos](mailto:adrian.tormos@bsc.es)


## Model Summary

This is a fine-tuned Llama-3.1-70B-Instruct model on the [Egida-DPO-Llama-3.1-70B-Instruct](http://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Meta-Llama-3.1-70B-Instruct) dataset.

The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

## Training Details

- **Hardware:** NVIDIA H100 64 GB GPUs
- **Devices:** 64 GPUs (16 node)
- **Time:** 10.23h
- **Batch Size:** 64
- **LR:** 10−6

## Performance

### Safety Performance (Attack Success Ratio)

|                              | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
| Meta-Llama-3.1-70B-Instruct  |     0.274      |  0.170   |    0.320     |    0.084    |
| Meta-Llama-3.1-70B-Instruct-Egida-DPO |     0.009      |  0.007   |    0.006     |    0.005    |

### General Purpose Performance

|                              | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
|------------------------------|:---------------------:|:---------------:|
| Meta-Llama-3.1-70B-Instruct  |         0.575         |      0.726      |
| Meta-Llama-3.1-70B-Instruct-Egida-DPO |         0.577         |      0.038      |


### Refusal Ratio

|                              | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
|------------------------------|:---------------------:|:---------------:|
| Meta-Llama-3.1-70B-Instruct         |          0.008           |           0.022           |
| Meta-Llama-3.1-70B-Instruct-Egida-DPO        |          0.347           |           0.351           |

Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.

## Environmental Impact


## Citation Information


```
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}
```