HPAI-BSC
/

Meta-Llama-3.1-70B-Instruct-Egida-DPO

Model card Files Files and versions Community

Meta-Llama-3.1-70B-Instruct-Egida-DPO / README.md

danihinjos's picture

Update README.md

2550c5f verified 8 days ago

|

history blame contribute delete

3.32 kB

	---
	license: apache-2.0
	datasets:
	- HPAI-BSC/Egida
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-70B-Instruct
	---

	## Model Description

	- Fine-Tuned from Model: [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
	- Paper: [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
	- Point of Contact: [Adrián Tormos](mailto:[email protected])


	## Model Summary

	This is a fine-tuned Llama-3.1-70B-Instruct model on the [Egida-DPO-Llama-3.1-70B-Instruct](http://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Meta-Llama-3.1-70B-Instruct) dataset.

	The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
	dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

	## Training Details

	- Hardware: NVIDIA H100 64 GB GPUs
	- Devices: 64 GPUs (16 node)
	- Time: 10.23h
	- Batch Size: 64
	- LR: 10−6

	## Performance

	### Safety Performance (Attack Success Ratio)

	\| \| Egida (test) ↓ \| DELPHI ↓ \| Alert-Base ↓ \| Alert-Adv ↓ \|
	\|------------------------------\|:--------------:\|:--------:\|:------------:\|:-----------:\|
	\| Meta-Llama-3.1-70B-Instruct \| 0.274 \| 0.170 \| 0.320 \| 0.084 \|
	\| Meta-Llama-3.1-70B-Instruct-Egida-DPO \| 0.009 \| 0.007 \| 0.006 \| 0.005 \|

	### General Purpose Performance

	\| \| OpenLLM Leaderboard (Average) ↑ \| MMLU Generative (ROUGE1) ↑ \|
	\|------------------------------\|:---------------------:\|:---------------:\|
	\| Meta-Llama-3.1-70B-Instruct \| 0.575 \| 0.726 \|
	\| Meta-Llama-3.1-70B-Instruct-Egida-DPO \| 0.577 \| 0.038 \|


	### Refusal Ratio

	\| \| OR Bench 80K (refusal) ↓ \| OR Bench Hard (refusal) ↓ \|
	\|------------------------------\|:---------------------:\|:---------------:\|
	\| Meta-Llama-3.1-70B-Instruct \| 0.008 \| 0.022 \|
	\| Meta-Llama-3.1-70B-Instruct-Egida-DPO \| 0.347 \| 0.351 \|

	Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.

	## Environmental Impact


	## Citation Information


	```
	@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
	title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
	author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
	year={2025},
	eprint={2502.13603},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.13603},
	}
	```