Papers
arxiv:2401.06824

Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering

Published on Jan 12, 2024
Authors:
,
,

Abstract

Getting large language models (LLMs) to refuse to answer hostile toxicity questions is a core issue under the theme of LLMs security. Previous approaches have used prompts engineering to jailbreak LLMs and answer some toxicity questions. These approaches can easily fail after the model manufacturer makes additional fine-tuning to the model. To promote the further understanding of model jailbreaking by researchers, we are inspired by Representation Engineering to propose a jailbreaking method that does not require elaborate construction prompts, is not affected by model fine-tuning, and can be widely applied to any open-source LLMs in a pluggable manner. We have evaluated this method on multiple mainstream LLMs on carefully supplemented toxicity datasets, and the experimental results demonstrate the significant effectiveness of our approach. After being surprised by some interesting jailbreaking cases, we did extensive in-depth research to explore the techniques behind this method.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.06824 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.06824 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.06824 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.