BigCodeBench: A New Benchmark for Evaluating LLMs on Programming Tasks

by Stephen M. Walker II, Co-Founder / CEO

What is BigCodeBench?

BigCodeBench is a new benchmark designed to evaluate large language models (LLMs) on solving practical and challenging programming tasks. It was created to address the limitations of existing benchmarks like HumanEval, which have been criticized for being too simple and not representative of real-world programming tasks.

Key features of BigCodeBench include:

  • 1,140 function-level tasks that challenge LLMs to follow instructions and compose multiple function calls from 139 libraries
  • An average of 5.6 test cases per task with 99% branch coverage
  • Complex, user-oriented instructions for each task, including functionality descriptions, input/output formats, and error handling
  • Verified interactive examples for each task
  • Tasks that require more complex reasoning and problem-solving skills compared to other benchmarks

Current Leaderboard

As of June 28, 2024, the leaderboard is currently led by GPT-4o-2024-05-13, which achieves a Complete score of 61.1 and an Instruct score of 51.1.

The BigCodeBench leaderboard ranks models based on their performance in two main scenarios: BigCodeBench-Complete and BigCodeBench-Instruct. The benchmark uses the Pass@1 metric with greedy decoding to assess LLM performance, measuring the percentage of tasks correctly solved with the first generated code snippet via curated test cases. Additionally, an Elo rating system is used to provide a comprehensive ranking of the models.

RankModelCompleteInstructElo MLEParameters
1GPT-4o-2024-05-1361.151.11269
2DeepSeek-Coder-V2-Instruct59.748.2125121B
3Claude-3.5-Sonnet-2024062058.646.81214
4GPT-4-Turbo-2024-04-0958.248.21216
5Gemini-1.5-Pro-API-051457.543.81218
6Claude-3-Opus-2024022957.445.51209
7GPT-4-061357.246.01213
8Hermes-2-Theta-Llama-3-70B55.645.6119070B
9Gemini-1.5-Flash-API-051455.143.51187
10Llama-3-70B-Instruct54.543.6117070B
11Qwen2-72B-Chat54.038.5116872B

How does BigCodeBench work?

BigCodeBench evaluates LLMs in two main scenarios:

  • BigCodeBench-Complete — LLMs are required to finish the implementation of a function based on detailed instructions in the docstring.

  • BigCodeBench-Instruct — A more challenging variant designed to evaluate instruction-tuned LLMs, where requirements are described in a more conversational and less verbose manner.

The benchmark uses the Pass@1 metric with greedy decoding to assess LLM performance. This measures the percentage of tasks correctly solved with the first generated code snippet via curated test cases.

How were the tasks created?

The tasks in BigCodeBench were created through a systematic "Human-LLM collaboration process":

  1. Started with ODEX as a "seed dataset" of short human intents and Python one-liners from Stack Overflow
  2. Used GPT-4 to expand these one-liners into comprehensive function-level tasks
  3. 20 human experts with extensive Python experience guided GPT-4 to refine tasks and add test cases
  4. Tasks and test cases were examined in a local environment, pre-evaluated on other LLMs, and cross-checked by additional human experts
  5. 11 human experts solved a sample of tasks, achieving an average human performance of 97%

How do LLMs perform on BigCodeBench?

Performance on BigCodeBench is significantly lower than human performance:

  • The best model (GPT-4o) achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct
  • There is a notable performance gap between closed and open LLMs
  • 149 tasks in BigCodeBench-Complete and 278 tasks in BigCodeBench-Instruct remain unsolved by all models
  • Only 6 tasks in BigCodeBench-Complete and 14 tasks in BigCodeBench-Instruct are fully solved by all models

An Elo rating system is used to rank models, with GPT-4o outperforming other models by a large margin.

How can I evaluate my model on BigCodeBench?

BigCodeBench is a sophisticated benchmark that rigorously evaluates the code generation capabilities of large language models (LLMs) in realistic scenarios. It surpasses traditional HumanEval-like tasks by incorporating complex instructions and diverse function calls, offering a more comprehensive assessment of LLMs' programming abilities.

The bigcodebench Python package comprises three key components:

  1. A curated dataset of challenging programming tasks
  2. Robust scripts for generating code samples
  3. Advanced evaluation tools for assessing model performance

Leveraging the EvalPlus framework, BigCodeBench provides a flexible and extensible platform for evaluating code generation tasks. This architecture enables researchers and developers to conduct thorough, comparative analyses of various LLMs' performance in practical programming contexts.

BigCodeBench provides a simple evaluation framework accessible via PyPI:

  1. Install the package:

    pip install bigcodebench --upgrade
    
  2. Generate code samples:

    bigcodebench.generate --model [model_name] --subset [complete|instruct] ...
    
  3. Post-process the generated code:

    bigcodebench.sanitize --samples samples.jsonl --calibrate
    
  4. Evaluate the code (preferably using Docker):

    docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated
    

Future Developments

The BigCodeBench team has outlined several areas for future improvement:

  1. Multilingualism: Extending beyond Python to other programming languages
  2. Rigorousness: Improving test case coverage and assessment accuracy
  3. Generalization: Including tasks with emerging libraries like transformers and langchain
  4. Evolution: Addressing the challenge of evolving libraries and potential test set contamination
  5. Interaction: Exploring LLMs as Agents in less constrained sandbox environments

BigCodeBench represents a significant step forward in evaluating the programming capabilities of LLMs, providing a more comprehensive and challenging benchmark that better reflects real-world programming tasks.

More terms

Context Window (LLMs)

The context window is akin to a short-term memory that determines how much text the model can consider for generating responses. Specifically, it refers to the number of tokens—individual pieces of text from tokenization—that the model processes at one time. This capacity varies among LLMs, affecting their input handling and comprehension abilities. For instance, GPT-3 can manage a context of 2,000 tokens, while GPT-4 Turbo extends to 128,000 tokens. Larger context windows enable the processing of more extensive information, which is crucial for tasks that require the model to learn from examples and respond accordingly.

Read more

What is a hyper-heuristic?

A hyper-heuristic is a higher-level strategy or method that helps in selecting, generating, or modifying lower-level heuristics used for solving optimization problems or search tasks. Hyper-heuristics automate the process of choosing the most appropriate low-level heuristic based on problem characteristics and constraints.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free