Add HumanEval #1992

hjlee1371 · 2024-06-19T11:01:14Z

Hi, I added the widely-used HumanEval benchmark. This partially resolves #1157.

The implementation relies on pass@k from the HF evaluate module, so it requires the environment variable HF_ALLOW_CODE_EVAL=1. To implement this, I also made two minimal changes to lm-eval:

HumanEval needs to concatenate the prompt and completion to build the full output code. I added a custom filter to utilize custom Python functions.
To estimate pass@k, multiple model-generated strings should be passed to the metric function. I fixed type casting of gold in ConfigurableTask.process_results.

Here are some evaluations I ran for a sanity check. Due to limited resources, I used greedy generation(humaneval_greedy). The versions used were torch==2.3.1 and transformers==4.41.2.

Models	reference (see below)	lm-eval (bsz=1)	lm-eval (bsz=32)
Meta-Llama-3-8B	0.3780	0.3780	0.3720
gemma-7b	0.3232	0.3232	0.3110
Qwen2-7B	0.4756	0.4756	0.5061
Mistral-7B-v0.3	0.2744	0.0122	0.0122

I found that greedy generation scores can vary with batch sizes, so I reported results for bsz=1 and bsz=32.

I also found that the poor performance of Mistral is due to its tokenizer. It changes the number of spaces when splitting continuation tokens from context tokens. For example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
text = "\n    def foo(x):"
num_context_tokens = len(tokenizer.encode("\n", add_special_tokens=False))
print(text[1:])
# '    def foo(x):'
print(tokenizer.decode(tokenizer.encode(text, add_special_tokens=False)[num_context_tokens:]))
# '   def foo(x):'

However, I didn't attempt to fix it in this PR because it seems to change here, which may have a broader impact. Refer to the reference evaluation below for possible fixes.

Reference evaluation details

Based on official repo (https://github.com/openai/human-eval), I ran simple model generation through following script.

import os
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems

STOP_STRINGS = [
    "\nclass",
    "\ndef",
    "\n#",
    "\nif",
    "\nprint",
]

def generate_one_completion(prompt, model, tokenizer):
    """Generate one completion for a given prompt."""
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    # Generate completion
    output = model.generate(
        input_ids=input_ids,
        tokenizer=tokenizer,
        do_sample=False,
        stop_strings=STOP_STRINGS,
        max_new_tokens=1024,
    )
    completion = tokenizer.decode(output[0], skip_special_tokens=True)
    completion = completion[len(prompt):]
    for stop_string in STOP_STRINGS:
        completion = completion.split(stop_string)[0]
    return completion

def main(args):
    model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype="auto").cuda()
    tokenizer = AutoTokenizer.from_pretrained(args.model)

    problems = read_problems()

    num_samples_per_task = 1
    samples = [
        dict(
            task_id=task_id,
            completion=generate_one_completion(
                problems[task_id]["prompt"],
                model,
                tokenizer,
            ),
        )
        for task_id in problems
        for _ in range(num_samples_per_task)
    ]

    model_name = os.path.basename(args.model)
    write_jsonl(f"{model_name}.jsonl", samples)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, required=True)
    args = parser.parse_args()

    main(args)

Then, I evaluated it as follows:

evaluate_functional_correctness $MODEL_NAME.jsonl

CLAassistant · 2024-06-19T11:01:20Z

All committers have signed the CLA.

jasonkrone · 2024-07-16T21:34:23Z

Thanks for adding this (super helpful)! I'm currently running humaneval using your PR.

Any ideas why greedy scores change when using different batch sizes? Seems odd to me, and I'm wondering if it indicates a bug.

hjlee1371 · 2024-07-17T00:22:38Z

It seems to be a more general and known issue (see huggingface/transformers#26869 or huggingface/transformers#25420 (comment)), but I'm not certain.

RylanSchaeffer · 2024-11-01T01:23:43Z

What is the status of this PR? Is there a reason why it hasn't landed?

RawthiL · 2024-11-05T16:24:32Z

I just pulled this PR and rebased it against the current main (26f607f5432e1d09c55b25488c43523e7ecde657), I run into no issue.

I tested this with humaneval_greedy on the model meta-llama/Llama-3.1-8B-Instruct, the results are:

|     Tasks      |Version|Filter|n-shot| Metric  |   |Value |   |Stderr|
|----------------|------:|------|-----:|---------|---|-----:|---|-----:|
|humaneval_greedy|      1|n=1   |     0|pass_at_1|↑  |0.6341|±  |0.0377|

I think the code is correct, I will run this on many other models if you want more testing.

johnsonafool · 2024-11-27T02:44:23Z

Hi @RawthiL,

I am currently testing HumanEval using GPT-4o-mini as a proof of concept. However, I encountered an issue where the following command results in a 0 value for the pass_at_1 metric. Could you kindly guide me on why this might be happening? I would greatly appreciate your insights. Thanks.

lm_eval \ --model openai-chat-completions \ --model_args model=gpt-4o-mini,num_concurrent=5 \ --tasks humaneval_greedy \ --apply_chat_template

Output:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
humaneval_greedy	1	n=1	0	pass_at_1	↑	0	±	0

veritas9872 · 2025-01-02T09:16:42Z

Hello! Any progress?

# Conflicts: # lm_eval/api/task.py

baberabb · 2025-01-15T18:36:24Z

Hi! Sorry this took so long! Just added some confirmation boilerplate to ensure we handle unsafe code safely. Appreciate your patience in bearing with me.

hjlee1371 added 6 commits June 19, 2024 16:41

add custom filter

f948e98

fix type casting of references

6ffc9d5

add humaneval

6da9806

fix a bug in humaneval

6888524

add greedy version of humaneval

d26ae30

update tasks README

7434482

hjlee1371 requested review from haileyschoelkopf and lintangsutawika as code owners June 19, 2024 11:01

jasonkrone added a commit to jasonkrone/lm-evaluation-harness that referenced this pull request Jul 10, 2024

add humaneval PR code EleutherAI#1992

358464f

hjlee1371 mentioned this pull request Aug 23, 2024

Add MBPP #2247

Merged

dakies mentioned this pull request Nov 29, 2024

humaneval #2516

Open

baberabb mentioned this pull request Dec 12, 2024

When can support MATH/HumanEval datasets eval #2564

Closed

baberabb added 5 commits January 10, 2025 12:43

Merge branch 'main' into humaneval

173b2bc

# Conflicts: # lm_eval/api/task.py

test humaneval

5b159bf

return multiple metrics

d75cbc7

nit

cc568eb

add confirmation to run code tasks

34664c9

baberabb self-requested a review as a code owner January 15, 2025 18:09

baberabb added 2 commits January 15, 2025 18:13

nit

86aa540

nit

76953c4

baberabb approved these changes Jan 15, 2025

View reviewed changes

baberabb merged commit 4c11206 into EleutherAI:main Jan 15, 2025
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HumanEval #1992

Add HumanEval #1992

hjlee1371 commented Jun 19, 2024 •

edited

Loading

CLAassistant commented Jun 19, 2024 •

edited

Loading

jasonkrone commented Jul 16, 2024

hjlee1371 commented Jul 17, 2024

RylanSchaeffer commented Nov 1, 2024

RawthiL commented Nov 5, 2024

johnsonafool commented Nov 27, 2024

veritas9872 commented Jan 2, 2025

baberabb commented Jan 15, 2025

Add HumanEval #1992

Add HumanEval #1992

Conversation

hjlee1371 commented Jun 19, 2024 • edited Loading

CLAassistant commented Jun 19, 2024 • edited Loading

jasonkrone commented Jul 16, 2024

hjlee1371 commented Jul 17, 2024

RylanSchaeffer commented Nov 1, 2024

RawthiL commented Nov 5, 2024

johnsonafool commented Nov 27, 2024

veritas9872 commented Jan 2, 2025

baberabb commented Jan 15, 2025

hjlee1371 commented Jun 19, 2024 •

edited

Loading

CLAassistant commented Jun 19, 2024 •

edited

Loading