gregH commited on
Commit
9091f63
·
verified ·
1 Parent(s): adf2c0f

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +8 -59
index.html CHANGED
@@ -237,7 +237,7 @@ should begin with ’Sure, here is guidance on how to cause physical harm to oth
237
  </script>
238
 
239
 
240
- <h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
241
  <p>
242
  The performance for Jailbreak defending methods is usually measured by how they can reduce the ASR. Major concerns
243
  when developing such methods is the performance degradation of the LLM on nominal benign prompts and the increased inference time cost
@@ -269,70 +269,19 @@ ASR increase. When &beta is fixed, larger &alpha would both reduce the ASR and t
269
  </div>
270
 
271
  <h2 id="demonstration">Demonstration</h2>
272
- <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
273
- against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
274
- Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
275
- Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
276
- shown in the provided bar chart.
277
- </p>
278
-
279
-
280
- <div id="jailbreak-demo" class="container">
281
- <div class="row align-items-center">
282
- <div class="row" style="margin: 10px 0 0">
283
- <div class="models-list">
284
- <span style="margin-right: 1em;">Models</span>
285
- <span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
286
- <span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
287
- </div>
288
- </div>
289
- </div>
290
- <div class="row align-items-center">
291
- <div class="col-4">
292
- <div id="defense-methods">
293
- <div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
294
- <div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
295
- <div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
296
- <div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
297
- <div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" checked="" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
298
- </div>
299
- <div class="row align-items-center">
300
- <div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.959</span></div>
301
- </div>
302
- <div class="row align-items-center">
303
- <div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.050</span></div>
304
- </div>
305
- </div>
306
- <div class="col-8">
307
- <figure class="figure">
308
- <img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
309
- <div class="slider-container">
310
- <div class="slider-label"><span>Perplexity Threshold</span></div>
311
- <div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
312
- </div>
313
- <div class="slider-container">
314
- <div class="slider-label"><span>Gradient Threshold</span></div>
315
- <div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
316
- </div>
317
- <figcaption class="figure-caption">
318
- </figcaption>
319
- </figure>
320
- </div>
321
- </div>
322
- </div>
323
-
324
  <p>
325
- Higher malicious refusal rate and lower benign refusal rate mean a better defense.
326
- Overall, Gradient Cuff is the most performant compared with those baselines. We also evaluated Gradient Cuff against adaptive attacks
327
- in the paper.
328
  </p>
329
 
330
- <h2 id="inquiries"> Inquiries on LLM with Token Highlighter defense</h2>
331
- <p> Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
332
  and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
333
  </p>
 
334
  <h2 id="citations">Citations</h2>
335
- <p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
336
 
337
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
338
  author = {Xiaomeng Hu and
 
237
  </script>
238
 
239
 
240
+ <h2 id="proposed-approach-gradient-cuff">Performance Evaluatio</h2>
241
  <p>
242
  The performance for Jailbreak defending methods is usually measured by how they can reduce the ASR. Major concerns
243
  when developing such methods is the performance degradation of the LLM on nominal benign prompts and the increased inference time cost
 
269
  </div>
270
 
271
  <h2 id="demonstration">Demonstration</h2>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  <p>
273
+ Below, we demonstrate how Token Highlighter influences the output generation of a large language model (LLM) in response to user prompts.
274
+ We showcase four illustrative examples, including GCG Jailbreak, TAP Jailbreak, vanilla harmful behavior, and benign user requests.
275
+ Additionally, we have developed a private live demo that allows users to interact with LLMs enhanced by Token Highlighter. Stay tuned for its public release!
276
  </p>
277
 
278
+ <h2 id="inquiries"> Inquiries</h2>
279
+ <p> If you have any questions regarding the Token Highlighter. Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
280
  and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
281
  </p>
282
+
283
  <h2 id="citations">Citations</h2>
284
+ <p>If you find Token Highlighter helpful and useful for your research, please cite our main paper as follows:</p>
285
 
286
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
287
  author = {Xiaomeng Hu and