BigCodeBench: A New Benchmark for Evaluating LLMs on Programming Tasks

June 18, 2024

by Stephen M. Walker II, Co-Founder / CEO

What is BigCodeBench?

BigCodeBench is a new benchmark designed to evaluate large language models (LLMs) on solving practical and challenging programming tasks. It was created to address the limitations of existing benchmarks like HumanEval, which have been criticized for being too simple and not representative of real-world programming tasks.

Key features of BigCodeBench include:

1,140 function-level tasks that challenge LLMs to follow instructions and compose multiple function calls from 139 libraries
An average of 5.6 test cases per task with 99% branch coverage
Complex, user-oriented instructions for each task, including functionality descriptions, input/output formats, and error handling
Verified interactive examples for each task
Tasks that require more complex reasoning and problem-solving skills compared to other benchmarks

Current Leaderboard

As of June 28, 2024, the leaderboard is currently led by GPT-4o-2024-05-13, which achieves a Complete score of 61.1 and an Instruct score of 51.1.

The BigCodeBench leaderboard ranks models based on their performance in two main scenarios: BigCodeBench-Complete and BigCodeBench-Instruct. The benchmark uses the Pass@1 metric with greedy decoding to assess LLM performance, measuring the percentage of tasks correctly solved with the first generated code snippet via curated test cases. Additionally, an Elo rating system is used to provide a comprehensive ranking of the models.

Rank	Model	Complete	Instruct	Elo MLE	Parameters
1	GPT-4o-2024-05-13	61.1	51.1	1269	—
2	DeepSeek-Coder-V2-Instruct	59.7	48.2	1251	21B
3	Claude-3.5-Sonnet-20240620	58.6	46.8	1214	—
4	GPT-4-Turbo-2024-04-09	58.2	48.2	1216	—
5	Gemini-1.5-Pro-API-0514	57.5	43.8	1218	—
6	Claude-3-Opus-20240229	57.4	45.5	1209	—
7	GPT-4-0613	57.2	46.0	1213	—
8	Hermes-2-Theta-Llama-3-70B	55.6	45.6	1190	70B
9	Gemini-1.5-Flash-API-0514	55.1	43.5	1187	—
10	Llama-3-70B-Instruct	54.5	43.6	1170	70B
11	Qwen2-72B-Chat	54.0	38.5	1168	72B

How does BigCodeBench work?

BigCodeBench evaluates LLMs in two main scenarios:

BigCodeBench-Complete — LLMs are required to finish the implementation of a function based on detailed instructions in the docstring.
BigCodeBench-Instruct — A more challenging variant designed to evaluate instruction-tuned LLMs, where requirements are described in a more conversational and less verbose manner.

The benchmark uses the Pass@1 metric with greedy decoding to assess LLM performance. This measures the percentage of tasks correctly solved with the first generated code snippet via curated test cases.

How were the tasks created?

The tasks in BigCodeBench were created through a systematic "Human-LLM collaboration process":

Started with ODEX as a "seed dataset" of short human intents and Python one-liners from Stack Overflow
Used GPT-4 to expand these one-liners into comprehensive function-level tasks
20 human experts with extensive Python experience guided GPT-4 to refine tasks and add test cases
Tasks and test cases were examined in a local environment, pre-evaluated on other LLMs, and cross-checked by additional human experts
11 human experts solved a sample of tasks, achieving an average human performance of 97%

How do LLMs perform on BigCodeBench?

Performance on BigCodeBench is significantly lower than human performance:

The best model (GPT-4o) achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct
There is a notable performance gap between closed and open LLMs
149 tasks in BigCodeBench-Complete and 278 tasks in BigCodeBench-Instruct remain unsolved by all models
Only 6 tasks in BigCodeBench-Complete and 14 tasks in BigCodeBench-Instruct are fully solved by all models

An Elo rating system is used to rank models, with GPT-4o outperforming other models by a large margin.

How can I evaluate my model on BigCodeBench?

BigCodeBench is a sophisticated benchmark that rigorously evaluates the code generation capabilities of large language models (LLMs) in realistic scenarios. It surpasses traditional HumanEval-like tasks by incorporating complex instructions and diverse function calls, offering a more comprehensive assessment of LLMs' programming abilities.

The bigcodebench Python package comprises three key components:

A curated dataset of challenging programming tasks
Robust scripts for generating code samples
Advanced evaluation tools for assessing model performance

Leveraging the EvalPlus framework, BigCodeBench provides a flexible and extensible platform for evaluating code generation tasks. This architecture enables researchers and developers to conduct thorough, comparative analyses of various LLMs' performance in practical programming contexts.

BigCodeBench provides a simple evaluation framework accessible via PyPI:

Install the package:
```
pip install bigcodebench --upgrade
```

Generate code samples:

bigcodebench.generate --model [model_name] --subset [complete|instruct] ...

Post-process the generated code:

bigcodebench.sanitize --samples samples.jsonl --calibrate

Evaluate the code (preferably using Docker):

docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated

Future Developments

The BigCodeBench team has outlined several areas for future improvement:

Multilingualism: Extending beyond Python to other programming languages
Rigorousness: Improving test case coverage and assessment accuracy
Generalization: Including tasks with emerging libraries like transformers and langchain
Evolution: Addressing the challenge of evolving libraries and potential test set contamination
Interaction: Exploring LLMs as Agents in less constrained sandbox environments

BigCodeBench represents a significant step forward in evaluating the programming capabilities of LLMs, providing a more comprehensive and challenging benchmark that better reflects real-world programming tasks.

Klu is remote-first and global

Follow us

BigCodeBench: A New Benchmark for Evaluating LLMs on Programming Tasks

What is BigCodeBench?

Current Leaderboard

How does BigCodeBench work?

How were the tasks created?

How do LLMs perform on BigCodeBench?

How can I evaluate my model on BigCodeBench?

Future Developments

More terms

What is LLM Hallucination?

What is decision theory?

It's time to build

LLMOps

Guides

LLMs