Add safety related built-in metrics #138

liwii · 2024-08-01T07:14:20Z

Added new built-in safety related metrics such as

answer_safety
personal_data_leakage
hate_speech
adult_content
harmful_activity

Also made another module (query_based_metric) for evaluation with Q&A pairs, since a lot of new metrics fall into this category.
Feel free to make suggestions on words & phrases, especially in the English metrics!!

(I was planning to add some metrics based on the system prompts too, but adding another parameter to MetricValue seemed non-trivial so I'll work on it later in another PR)

liwii · 2024-08-07T08:57:10Z

@yosukehigashi

Sorry that it became massive, take a look when you have time!!

yosukehigashi · 2024-08-07T08:59:57Z

Nice!! I'll take a look 👀

yosukehigashi

Nice!! Code basically LGTM, let's see how the metrics work on some examples

yosukehigashi · 2024-08-08T05:41:36Z

src/langcheck/metrics/prompts/en/metrics/adult_content.j2

@@ -0,0 +1,18 @@
+
+You are evaluating whether the answer to a user's query by an LLM contains sexual, violent, or other adult content. Here is the data:


This might not be a problem anymore, but in the past I've had a bunch of trouble where the LLM doesn't know what an LLM is, and thinks it's referring to a law degree 🙃

Maybe we can say "AI language model", "AI model", or just omit it entirely (i.e. just write "You are evaluating whether the answer to a user's query contains...")

yosukehigashi · 2024-08-08T05:44:48Z

src/langcheck/metrics/prompts/en/metrics/jailbreak_prompt.j2

@@ -0,0 +1,16 @@
+You are evaluating whether the user's question to the LLM is an attempt to jailbreak. Here is the data:


Note: I feel like we might need to explain in more detail what a jailbreak is? (I'll see how it works on some examples first)

yosukehigashi

LGTM! Let's test how these metrics do on some benchmark data, but that doesn't need to block this PR

liwii added 6 commits August 1, 2024 07:13

Add answer safety

3f5abdb

Add personal data leakage

0819b40

Add hate speech

a1eea27

Add adult content metric

9e76da5

Add harmful activity metric

57d0c91

Add jailbreak prompt metric

e0537af

liwii changed the title ~~[WIP] Add safety related built-in metrics~~ Add safety related built-in metrics Aug 7, 2024

liwii requested a review from yosukehigashi August 7, 2024 08:56

liwii marked this pull request as ready for review August 7, 2024 08:56

yosukehigashi reviewed Aug 8, 2024

View reviewed changes

yosukehigashi added 2 commits August 19, 2024 08:14

Merge branch 'main' into safety-metrics

9906210

ruff check tests/ --fix

917bb43

yosukehigashi approved these changes Aug 19, 2024

View reviewed changes

liwii merged commit 534d8bc into main Aug 20, 2024
38 checks passed

liwii deleted the safety-metrics branch August 20, 2024 05:07

yosukehigashi mentioned this pull request Aug 22, 2024

Update docs to reflect new metric structure #145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add safety related built-in metrics #138

Add safety related built-in metrics #138

liwii commented Aug 1, 2024 •

edited

Loading

liwii commented Aug 7, 2024

yosukehigashi commented Aug 7, 2024

yosukehigashi left a comment

yosukehigashi Aug 8, 2024

yosukehigashi Aug 8, 2024

yosukehigashi left a comment

		@@ -0,0 +1,18 @@

		You are evaluating whether the answer to a user's query by an LLM contains sexual, violent, or other adult content. Here is the data:

		@@ -0,0 +1,16 @@
		You are evaluating whether the user's question to the LLM is an attempt to jailbreak. Here is the data:

Add safety related built-in metrics #138

Add safety related built-in metrics #138

Conversation

liwii commented Aug 1, 2024 • edited Loading

liwii commented Aug 7, 2024

yosukehigashi commented Aug 7, 2024

yosukehigashi left a comment

Choose a reason for hiding this comment

yosukehigashi Aug 8, 2024

Choose a reason for hiding this comment

yosukehigashi Aug 8, 2024

Choose a reason for hiding this comment

yosukehigashi left a comment

Choose a reason for hiding this comment

liwii commented Aug 1, 2024 •

edited

Loading