HumanEval Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is the HumanEval Benchmark?

OpenAI developed HumanEval and released it in 2021 with the Codex model. Codex, a GPT model fine-tuned on GitHub code, powers GitHub Copilot and solves 28.8% of the original evaluation problems, outperforming GPT-3 and GPT-J. With repeat sampling, its success rate increases to 70.2%. However, it struggled with complex docstrings and variable bindings, raising concerns about safety, security, and economic impacts.

The HumanEval benchmark is a dataset designed to evaluate the code generation capabilities of large language models (LLMs). It consists of 164 hand-crafted programming challenges, each including a function signature, docstring, body, and several unit tests, averaging 7.7 tests per problem. These challenges assess a model's understanding of language, algorithms, and simple mathematics, and are comparable to simple software interview questions.

HumanEval is used to measure the functional correctness of code generated by LLMs from docstrings. The models are evaluated based on their ability to generate code that passes the provided unit tests. The pass@k metric is used to assess the performance, where 'k' different solutions are generated by the model, and if any of these solutions pass all the unit tests, the model is considered to have solved the problem.

This benchmark was created to prevent data leakage by ensuring the problems were not included in the training sets of code generation models, thus providing a fair assessment of a model's ability to generate novel code. The HumanEval dataset has become a popular tool for assessing the progress and capabilities of code generation models since its inception in mid-2021.

The benchmark is also used to compare the performance of different models, with the current state-of-the-art on HumanEval being the Language Agent Tree Search (GPT-4). Despite its popularity, it is recognized that solving programming questions in HumanEval does not encompass all aspects of a code model's potential applications, such as code explanation, docstring generation, code infilling, and writing tests. Therefore, while HumanEval is a significant step towards evaluating code generation models, it is part of an ongoing effort to improve code benchmarking.

Current Leaderboard

As of June 25, 2024, the leaderboard is currently led by GPT-4o with LDB assistant. However, on March 4, 2024, Anthropic released Claude 3 Opus, which scores 84.9.

The leaderboard for the HumanEval benchmark ranks models based on the innovative pass@1, pass@10, and pass@100 metrics. These metrics offer a more meaningful and practical assessment of a model's ability to solve problems.

Rank	Model	pass@1	License
1	🥇 LDB (GPT-4o, based on seed programs from Reflexion)	98.2	Proprietary
2	🥇 LDB (GPT-4, based on seed programs from Reflexion)	96.9	Proprietary
3	🥈 AgentCoder (GPT-4)	96.3	Proprietary
4	🥈 LDB (GPT-3.5, based on seed programs from Reflexion)	95.1	Proprietary
5	🥉 Language Agent Tree Search (GPT-4)	94.4	Proprietary
6	MapCoder (GPT-4)	93.9	Proprietary
7	L2MAC (GPT-4)	90.2	Proprietary
8	OctorCoder (GPT-4)	86.6	Proprietary
9	ANPL (GPT-4)	86.6	Proprietary
10	MetaGPT (GPT-4)	85.9	Proprietary

How is the HumanEval benchmark used in code generation research?

The HumanEval benchmark evaluates the functional correctness of code generated by large language models (LLMs) through 164 programming challenges. These challenges test a model's grasp of language, algorithms, and mathematics by requiring it to generate code from docstrings that must pass specific unit tests. The pass@k metric assesses model performance, considering a model successful if any of its 'k' generated solutions pass all tests.

This benchmark not only ranks models, with GPT-4 currently leading, but also highlights biases and evaluation flaws, such as a focus on a narrow range of programming concepts and an abundance of simpler questions. While HumanEval is crucial for evaluating code generation, it doesn't cover all potential applications like code explanation or docstring generation, indicating the need for broader benchmarking efforts.

HumanEval's scope extends beyond Python, with versions available for other programming languages, enhancing its utility in code generation research.

How does HumanEval work?

HumanEval is a benchmark dataset designed to evaluate the functional correctness of code generated by large language models (LLMs). It consists of 164 hand-crafted programming challenges, each including a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.

The evaluation process involves generating multiple solutions based on a given prompt and running unit tests on these solutions. If any of the generated solutions pass the unit tests, it's counted as a win. This approach is encapsulated in the pass@k metric, where k represents the number of different solutions generated by the model. For instance, pass@1 means the model's first solution is tested, pass@10 means any of the model's first 10 solutions can pass the tests, and so on.

While HumanEval has become a popular benchmark for code generation models, it's important to note that it doesn't capture all aspects of real-world usage of code models. For example, tasks like code explanation, docstring generation, code infilling, and writing tests are not evaluated by HumanEval.

The HumanEval benchmark was introduced in the paper "Evaluating Large Language Models Trained on Code" by OpenAI, and the evaluation harness for the dataset is available on GitHub. The leaderboard for the benchmark, hosted by Papers with Code, ranks models based on the pass@k metrics.

The Pass@k Metric

The pass@k metric is a performance evaluation measure used primarily for models that generate code, such as OpenAI's Codex. It assesses the probability that at least one of the top k-generated code samples for a given problem passes the unit tests. This metric is particularly useful because it aligns more closely with the practices of human developers, focusing on functional correctness rather than text similarity.

When evaluating pass@k, a dataset of natural language/code pairs is used. For each natural language prompt, the model generates k code snippets. If at least one of these code snippets is correct, the model is considered to have succeeded for that prompt. The pass@k value is then the fraction of prompts for which the model succeeded.

The pass@k metric can be calculated using a numerically stable method, as shown in the provided Python function from one of the search results. This method is important because it avoids numerical instability or overflow that can occur when dealing with large values of n, c, and k, where n is the total number of samples, c is the number of correct samples, and k is the number of top samples considered.

It's important to note that the pass@k metric does not behave like a simple probability because the trials are not independent. The generation of code snippets is based on a stochastic procedure, which means that the probability of success does not scale linearly with the number of attempts (k). Therefore, pass@100 will not be 99.999999% even if pass@1 is 28%, because the trials are not independent and the probability of success does not compound in a straightforward manner.

When to use HumanEval

The HumanEval benchmark is used in several scenarios within the field of machine learning, particularly when dealing with large language models (LLMs) and code generation:

Evaluating Functional Correctness: HumanEval is primarily used to measure the functional correctness of code generated by LLMs. It provides a set of programming challenges that the models must solve by generating code from docstrings. The generated code is then evaluated based on its ability to pass the provided unit tests.
Comparing Model Performance: HumanEval serves as a benchmark for comparing the performance of different LLMs in code generation tasks. It provides a standardized set of challenges that all models must solve, allowing for a fair comparison of their capabilities.
Research and Development: HumanEval is used in research to study the capabilities and limitations of LLMs in code generation. It helps expose biases and shortcomings in code generation models, providing insights that can guide the development of improved models.
Benchmarking New Models: When new models are developed, HumanEval can be used to benchmark their performance against existing models. This helps determine whether the new models offer any improvements over the current state-of-the-art.
Developing New Evaluation Metrics: The use of HumanEval has led to the development of new evaluation metrics like the pass@k metric, which provides a more meaningful and practical assessment of a model's ability to solve programming challenges.

However, it's important to note that while HumanEval is a valuable tool, it does not capture all aspects of a code model's potential applications, such as code explanation, docstring generation, code infilling, and writing tests. Therefore, it should be used in conjunction with other evaluation methods to get a comprehensive understanding of a model's capabilities.

Limitations of HumanEval

The HumanEval benchmark, while a significant tool for evaluating code generation models, has several limitations:

Bias Towards Certain Concepts: HumanEval has been found to have a significant bias towards a limited number of programming concepts, with many concepts not represented at all. This could lead to an overestimation of a model's performance on code generation tasks.
Limited Scope: The benchmark consists of problems that are mostly algorithmic and similar to interview-style coding questions, which may not reflect the complexity of real-world software development tasks. For instance, in a corporate setting, developers often work with multiple files and existing codebases, which is not captured by the single-function problems in HumanEval.
Weak Unit Tests: The unit tests provided with HumanEval problems are sometimes too weak, allowing incorrect implementations to pass. This undermines the reliability of the benchmark in assessing the correctness of generated code.
Training Data Contamination: Since HumanEval has been around for a while, there is a risk that the models have been exposed to the problems or similar ones during training, leading to potential data leakage and invalidating the evaluation results.
Narrow Evaluation Metric: The pass@k metric, while useful, provides a narrow view of a model's capabilities. It does not account for other important aspects of code generation such as code explanation, docstring generation, code infilling, handling Stack Overflow questions, and writing tests.
Popularity and Memorization: The popularity of HumanEval might lead to models being fine-tuned specifically to perform well on its problems, which does not necessarily translate to general code generation ability. This can also lead to memorization of solutions rather than genuine problem-solving.
Lack of Real-world Relevance: The benchmark does not fully capture the nuances and expectations of real-world code model usage, as it is based on a limited set of standalone functions rather than complex, multi-file projects.
Non-representative of Diverse Programming Tasks: The tasks in HumanEval may not represent the diversity of programming tasks found in different domains and real-world scenarios.

Researchers and practitioners in the field are aware of these limitations and there is ongoing work to improve code benchmarking to address these issues.

Klu is remote-first and global

Follow us

HumanEval Benchmark

What is the HumanEval Benchmark?

Current Leaderboard

How is the HumanEval benchmark used in code generation research?

How does HumanEval work?

The Pass@k Metric

When to use HumanEval

Limitations of HumanEval

More terms

What is robotics?

What is Google AI Studio?

It's time to build

LLMOps

Guides

LLMs