-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HumanEval #1992
Add HumanEval #1992
Conversation
Thanks for adding this (super helpful)! I'm currently running humaneval using your PR. Any ideas why greedy scores change when using different batch sizes? Seems odd to me, and I'm wondering if it indicates a bug. |
It seems to be a more general and known issue (see huggingface/transformers#26869 or huggingface/transformers#25420 (comment)), but I'm not certain. |
What is the status of this PR? Is there a reason why it hasn't landed? |
I just pulled this PR and rebased it against the current main ( I tested this with
I think the code is correct, I will run this on many other models if you want more testing. |
Hi @RawthiL, I am currently testing HumanEval using GPT-4o-mini as a proof of concept. However, I encountered an issue where the following command results in a 0 value for the pass_at_1 metric. Could you kindly guide me on why this might be happening? I would greatly appreciate your insights. Thanks.
Output:
|
Hello! Any progress? |
# Conflicts: # lm_eval/api/task.py
Hi! Sorry this took so long! Just added some confirmation boilerplate to ensure we handle unsafe code safely. Appreciate your patience in bearing with me. |
Hi, I added the widely-used HumanEval benchmark. This partially resolves #1157.
The implementation relies on
pass@k
from the HF evaluate module, so it requires the environment variableHF_ALLOW_CODE_EVAL=1
. To implement this, I also made two minimal changes to lm-eval:custom
filter to utilize custom Python functions.pass@k
, multiple model-generated strings should be passed to the metric function. I fixed type casting ofgold
inConfigurableTask.process_results
.Here are some evaluations I ran for a sanity check. Due to limited resources, I used greedy generation(
humaneval_greedy
). The versions used weretorch==2.3.1
andtransformers==4.41.2
.I found that greedy generation scores can vary with batch sizes, so I reported results for
bsz=1
andbsz=32
.I also found that the poor performance of Mistral is due to its tokenizer. It changes the number of spaces when splitting continuation tokens from context tokens. For example:
However, I didn't attempt to fix it in this PR because it seems to change here, which may have a broader impact. Refer to the reference evaluation below for possible fixes.
Reference evaluation details
Based on official repo (https://github.com/openai/human-eval), I ran simple model generation through following script.Then, I evaluated it as follows: