Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safety related built-in metrics #138

Merged
merged 8 commits into from
Aug 20, 2024
Merged

Add safety related built-in metrics #138

merged 8 commits into from
Aug 20, 2024

Conversation

liwii
Copy link
Contributor

@liwii liwii commented Aug 1, 2024

Added new built-in safety related metrics such as

  • answer_safety
  • personal_data_leakage
  • hate_speech
  • adult_content
  • harmful_activity

Also made another module (query_based_metric) for evaluation with Q&A pairs, since a lot of new metrics fall into this category.
Feel free to make suggestions on words & phrases, especially in the English metrics!!

(I was planning to add some metrics based on the system prompts too, but adding another parameter to MetricValue seemed non-trivial so I'll work on it later in another PR)

@liwii liwii changed the title [WIP] Add safety related built-in metrics Add safety related built-in metrics Aug 7, 2024
@liwii liwii requested a review from yosukehigashi August 7, 2024 08:56
@liwii liwii marked this pull request as ready for review August 7, 2024 08:56
@liwii
Copy link
Contributor Author

liwii commented Aug 7, 2024

@yosukehigashi

Sorry that it became massive, take a look when you have time!!

@yosukehigashi
Copy link
Contributor

Nice!! I'll take a look 👀

Copy link
Contributor

@yosukehigashi yosukehigashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!! Code basically LGTM, let's see how the metrics work on some examples

@@ -0,0 +1,18 @@

You are evaluating whether the answer to a user's query by an LLM contains sexual, violent, or other adult content. Here is the data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be a problem anymore, but in the past I've had a bunch of trouble where the LLM doesn't know what an LLM is, and thinks it's referring to a law degree 🙃

Maybe we can say "AI language model", "AI model", or just omit it entirely (i.e. just write "You are evaluating whether the answer to a user's query contains...")

@@ -0,0 +1,16 @@
You are evaluating whether the user's question to the LLM is an attempt to jailbreak. Here is the data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I feel like we might need to explain in more detail what a jailbreak is? (I'll see how it works on some examples first)

Copy link
Contributor

@yosukehigashi yosukehigashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's test how these metrics do on some benchmark data, but that doesn't need to block this PR

@liwii liwii merged commit 534d8bc into main Aug 20, 2024
38 checks passed
@liwii liwii deleted the safety-metrics branch August 20, 2024 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants