Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HumanEval #1992

Merged
merged 13 commits into from
Jan 15, 2025
Merged

Add HumanEval #1992

merged 13 commits into from
Jan 15, 2025

Conversation

hjlee1371
Copy link
Contributor

@hjlee1371 hjlee1371 commented Jun 19, 2024

Hi, I added the widely-used HumanEval benchmark. This partially resolves #1157.

The implementation relies on pass@k from the HF evaluate module, so it requires the environment variable HF_ALLOW_CODE_EVAL=1. To implement this, I also made two minimal changes to lm-eval:

  • HumanEval needs to concatenate the prompt and completion to build the full output code. I added a custom filter to utilize custom Python functions.
  • To estimate pass@k, multiple model-generated strings should be passed to the metric function. I fixed type casting of gold in ConfigurableTask.process_results.

Here are some evaluations I ran for a sanity check. Due to limited resources, I used greedy generation(humaneval_greedy). The versions used were torch==2.3.1 and transformers==4.41.2.

Models reference (see below) lm-eval (bsz=1) lm-eval (bsz=32)
Meta-Llama-3-8B 0.3780 0.3780 0.3720
gemma-7b 0.3232 0.3232 0.3110
Qwen2-7B 0.4756 0.4756 0.5061
Mistral-7B-v0.3 0.2744 0.0122 0.0122

I found that greedy generation scores can vary with batch sizes, so I reported results for bsz=1 and bsz=32.

I also found that the poor performance of Mistral is due to its tokenizer. It changes the number of spaces when splitting continuation tokens from context tokens. For example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
text = "\n    def foo(x):"
num_context_tokens = len(tokenizer.encode("\n", add_special_tokens=False))
print(text[1:])
# '    def foo(x):'
print(tokenizer.decode(tokenizer.encode(text, add_special_tokens=False)[num_context_tokens:]))
# '   def foo(x):'

However, I didn't attempt to fix it in this PR because it seems to change here, which may have a broader impact. Refer to the reference evaluation below for possible fixes.

Reference evaluation details Based on official repo (https://github.com/openai/human-eval), I ran simple model generation through following script.
import os
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems

STOP_STRINGS = [
    "\nclass",
    "\ndef",
    "\n#",
    "\nif",
    "\nprint",
]

def generate_one_completion(prompt, model, tokenizer):
    """Generate one completion for a given prompt."""
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    # Generate completion
    output = model.generate(
        input_ids=input_ids,
        tokenizer=tokenizer,
        do_sample=False,
        stop_strings=STOP_STRINGS,
        max_new_tokens=1024,
    )
    completion = tokenizer.decode(output[0], skip_special_tokens=True)
    completion = completion[len(prompt):]
    for stop_string in STOP_STRINGS:
        completion = completion.split(stop_string)[0]
    return completion

def main(args):
    model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype="auto").cuda()
    tokenizer = AutoTokenizer.from_pretrained(args.model)

    problems = read_problems()

    num_samples_per_task = 1
    samples = [
        dict(
            task_id=task_id,
            completion=generate_one_completion(
                problems[task_id]["prompt"],
                model,
                tokenizer,
            ),
        )
        for task_id in problems
        for _ in range(num_samples_per_task)
    ]

    model_name = os.path.basename(args.model)
    write_jsonl(f"{model_name}.jsonl", samples)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, required=True)
    args = parser.parse_args()

    main(args)

Then, I evaluated it as follows:

evaluate_functional_correctness $MODEL_NAME.jsonl

@CLAassistant
Copy link

CLAassistant commented Jun 19, 2024

CLA assistant check
All committers have signed the CLA.

jasonkrone added a commit to jasonkrone/lm-evaluation-harness that referenced this pull request Jul 10, 2024
@jasonkrone
Copy link
Contributor

Thanks for adding this (super helpful)! I'm currently running humaneval using your PR.

Any ideas why greedy scores change when using different batch sizes? Seems odd to me, and I'm wondering if it indicates a bug.

@hjlee1371
Copy link
Contributor Author

It seems to be a more general and known issue (see huggingface/transformers#26869 or huggingface/transformers#25420 (comment)), but I'm not certain.

@hjlee1371 hjlee1371 mentioned this pull request Aug 23, 2024
@RylanSchaeffer
Copy link
Contributor

What is the status of this PR? Is there a reason why it hasn't landed?

@RawthiL
Copy link
Contributor

RawthiL commented Nov 5, 2024

I just pulled this PR and rebased it against the current main (26f607f5432e1d09c55b25488c43523e7ecde657), I run into no issue.

I tested this with humaneval_greedy on the model meta-llama/Llama-3.1-8B-Instruct, the results are:

|     Tasks      |Version|Filter|n-shot| Metric  |   |Value |   |Stderr|
|----------------|------:|------|-----:|---------|---|-----:|---|-----:|
|humaneval_greedy|      1|n=1   |     0|pass_at_1|↑  |0.6341|±  |0.0377|

I think the code is correct, I will run this on many other models if you want more testing.

@johnsonafool
Copy link

Hi @RawthiL,

I am currently testing HumanEval using GPT-4o-mini as a proof of concept. However, I encountered an issue where the following command results in a 0 value for the pass_at_1 metric. Could you kindly guide me on why this might be happening? I would greatly appreciate your insights. Thanks.

lm_eval \ --model openai-chat-completions \ --model_args model=gpt-4o-mini,num_concurrent=5 \ --tasks humaneval_greedy \ --apply_chat_template

Output:

Tasks Version Filter n-shot Metric Value Stderr
humaneval_greedy 1 n=1 0 pass_at_1 0 ± 0

@veritas9872
Copy link

Hello! Any progress?

@baberabb baberabb self-requested a review as a code owner January 15, 2025 18:09
@baberabb
Copy link
Contributor

Hi! Sorry this took so long! Just added some confirmation boilerplate to ensure we handle unsafe code safely. Appreciate your patience in bearing with me.

@baberabb baberabb merged commit 4c11206 into EleutherAI:main Jan 15, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Discussion] Add Major Code Benchmarks
8 participants