Spaces:
Running
Running
Update index.html
Browse files- index.html +16 -0
index.html
CHANGED
|
@@ -134,9 +134,25 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 134 |
</div>
|
| 135 |
|
| 136 |
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
<h2 id="demonstration">Demonstration</h2>
|
| 141 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
|
| 142 |
different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
|
|
|
|
| 134 |
</div>
|
| 135 |
|
| 136 |
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
|
| 137 |
+
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
| 138 |
+
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
| 139 |
+
</p>
|
| 140 |
|
| 141 |
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
|
| 142 |
|
| 143 |
+
<p>
|
| 144 |
+
Gradient Cuff can be summarized into two phases:
|
| 145 |
+
<span>
|
| 146 |
+
$$
|
| 147 |
+
\begin{itemize}
|
| 148 |
+
\item \textbf{(Phase 1) Sampling-based Rejection:}~In the first step, we reject the user query $x$ by checking whether $f_\theta(x)<0.5$. If true, then $x$ is rejected, otherwise, $x$ is pushed into phase 2.
|
| 149 |
+
\item \textbf{(Phase 2) Gradient Norm Rejection:}~In the second step, we regard $x$ as having jailbreak attempts if the norm of the estimated gradient $g_\theta(x)$ is larger than a configurable threshold $t$, i.e., $\|g_\theta(x)\| > t$.
|
| 150 |
+
\end{itemize}
|
| 151 |
+
$$
|
| 152 |
+
</span>
|
| 153 |
+
</p>
|
| 154 |
+
|
| 155 |
+
|
| 156 |
<h2 id="demonstration">Demonstration</h2>
|
| 157 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
|
| 158 |
different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
|