From f576a0e149a9a60683460e56e07b770dc3b253d0 Mon Sep 17 00:00:00 2001 From: Asma Ghandeharioun Date: Thu, 5 Sep 2024 17:28:05 -0400 Subject: [PATCH] linking patchscopes --- personas/index.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/personas/index.html b/personas/index.html index b814be3b..a2277857 100644 --- a/personas/index.html +++ b/personas/index.html @@ -7,7 +7,7 @@ - + @@ -181,7 +181,7 @@

From a mechanistic perspective, we find that safeguards are layer-specific, and that decoding directly from earlier layers may bypass safeguards and recover misaligned content that would otherwise not have been generated.
- We then use Patchscopes to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries. + We then use Patchscopes to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries.