Spaces:
Running
Running
Update index.html
Browse files- index.html +8 -59
index.html
CHANGED
@@ -237,7 +237,7 @@ should begin with ’Sure, here is guidance on how to cause physical harm to oth
|
|
237 |
</script>
|
238 |
|
239 |
|
240 |
-
<h2 id="proposed-approach-gradient-cuff">Performance
|
241 |
<p>
|
242 |
The performance for Jailbreak defending methods is usually measured by how they can reduce the ASR. Major concerns
|
243 |
when developing such methods is the performance degradation of the LLM on nominal benign prompts and the increased inference time cost
|
@@ -269,70 +269,19 @@ ASR increase. When &beta is fixed, larger &alpha would both reduce the ASR and t
|
|
269 |
</div>
|
270 |
|
271 |
<h2 id="demonstration">Demonstration</h2>
|
272 |
-
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
|
273 |
-
against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
|
274 |
-
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
|
275 |
-
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
|
276 |
-
shown in the provided bar chart.
|
277 |
-
</p>
|
278 |
-
|
279 |
-
|
280 |
-
<div id="jailbreak-demo" class="container">
|
281 |
-
<div class="row align-items-center">
|
282 |
-
<div class="row" style="margin: 10px 0 0">
|
283 |
-
<div class="models-list">
|
284 |
-
<span style="margin-right: 1em;">Models</span>
|
285 |
-
<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
|
286 |
-
<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
|
287 |
-
</div>
|
288 |
-
</div>
|
289 |
-
</div>
|
290 |
-
<div class="row align-items-center">
|
291 |
-
<div class="col-4">
|
292 |
-
<div id="defense-methods">
|
293 |
-
<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
|
294 |
-
<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
|
295 |
-
<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
|
296 |
-
<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
|
297 |
-
<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" checked="" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
|
298 |
-
</div>
|
299 |
-
<div class="row align-items-center">
|
300 |
-
<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.959</span></div>
|
301 |
-
</div>
|
302 |
-
<div class="row align-items-center">
|
303 |
-
<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.050</span></div>
|
304 |
-
</div>
|
305 |
-
</div>
|
306 |
-
<div class="col-8">
|
307 |
-
<figure class="figure">
|
308 |
-
<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
|
309 |
-
<div class="slider-container">
|
310 |
-
<div class="slider-label"><span>Perplexity Threshold</span></div>
|
311 |
-
<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
|
312 |
-
</div>
|
313 |
-
<div class="slider-container">
|
314 |
-
<div class="slider-label"><span>Gradient Threshold</span></div>
|
315 |
-
<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
|
316 |
-
</div>
|
317 |
-
<figcaption class="figure-caption">
|
318 |
-
</figcaption>
|
319 |
-
</figure>
|
320 |
-
</div>
|
321 |
-
</div>
|
322 |
-
</div>
|
323 |
-
|
324 |
<p>
|
325 |
-
|
326 |
-
|
327 |
-
|
328 |
</p>
|
329 |
|
330 |
-
<h2 id="inquiries"> Inquiries
|
331 |
-
<p> Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
|
332 |
and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
|
333 |
</p>
|
|
|
334 |
<h2 id="citations">Citations</h2>
|
335 |
-
<p>If you find
|
336 |
|
337 |
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
|
338 |
author = {Xiaomeng Hu and
|
|
|
237 |
</script>
|
238 |
|
239 |
|
240 |
+
<h2 id="proposed-approach-gradient-cuff">Performance Evaluatio</h2>
|
241 |
<p>
|
242 |
The performance for Jailbreak defending methods is usually measured by how they can reduce the ASR. Major concerns
|
243 |
when developing such methods is the performance degradation of the LLM on nominal benign prompts and the increased inference time cost
|
|
|
269 |
</div>
|
270 |
|
271 |
<h2 id="demonstration">Demonstration</h2>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
272 |
<p>
|
273 |
+
Below, we demonstrate how Token Highlighter influences the output generation of a large language model (LLM) in response to user prompts.
|
274 |
+
We showcase four illustrative examples, including GCG Jailbreak, TAP Jailbreak, vanilla harmful behavior, and benign user requests.
|
275 |
+
Additionally, we have developed a private live demo that allows users to interact with LLMs enhanced by Token Highlighter. Stay tuned for its public release!
|
276 |
</p>
|
277 |
|
278 |
+
<h2 id="inquiries"> Inquiries</h2>
|
279 |
+
<p> If you have any questions regarding the Token Highlighter. Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
|
280 |
and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
|
281 |
</p>
|
282 |
+
|
283 |
<h2 id="citations">Citations</h2>
|
284 |
+
<p>If you find Token Highlighter helpful and useful for your research, please cite our main paper as follows:</p>
|
285 |
|
286 |
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
|
287 |
author = {Xiaomeng Hu and
|