Spaces:
Running
Running
Update index.html
Browse files- index.html +2 -2
index.html
CHANGED
@@ -81,8 +81,8 @@ Exploring Refusal Loss Landscapes </title>
|
|
81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
-
sometimes be able to bypass the safety guardrail. Based on this observation, for a given LLM
|
85 |
-
define the refusal loss function $\phi_\theta(x)$ for a given input user query $x$ as below:
|
86 |
</p>
|
87 |
|
88 |
<div class="container jailbreak-intro-sec">
|
|
|
81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
+
sometimes be able to bypass the safety guardrail. Based on this observation, for a given LLM <p>$T_\theta$</p>
|
85 |
+
parameterized with $\theta$, we define the refusal loss function $\phi_\theta(x)$ for a given input user query $x$ as below:
|
86 |
</p>
|
87 |
|
88 |
<div class="container jailbreak-intro-sec">
|