diff --git a/personas/index.html b/personas/index.html index b814be3b..a2277857 100644 --- a/personas/index.html +++ b/personas/index.html @@ -7,7 +7,7 @@ - + @@ -181,7 +181,7 @@

From a mechanistic perspective, we find that safeguards are layer-specific, and that decoding directly from earlier layers may bypass safeguards and recover misaligned content that would otherwise not have been generated.
- We then use Patchscopes to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries. + We then use Patchscopes to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries.