Klu raises $1.7M to empower AI Teams  

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

by Stephen M. Walker II, Co-Founder / CEO

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

GPQA, or Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Introduced by researchers, GPQA comprises 448 multiple-choice questions across the domains of biology, physics, and chemistry, crafted by domain experts to ensure high quality and difficulty.

There are three variations of the test dataset with variable question lengths: extended (546), main (448), and diamond (198). The original research benchmark compared zero-shot, few-shot, CoT, and search variations.

GPQA Eval Leaderboard

OrganizationModelDiamond Set (198 Questions)
AnthropicClaude 3 Opus50.4% Zero-shot CoT
AnthropicClaude 3 Sonnet40.4% Zero-shot CoT
OpenAIGPT-435.7% Zero-shot CoT
AnthropicClaude 3 Haiku33.3% Zero-shot CoT
MetaLlama-2-70B-chat31.1% Zero-shot CoT
OpenAIGPT-3.528.1% Zero-shot CoT
GoogleGemini 1.0 Ultra
GoogleGemini 1.0 Pro

Key Features and Performance Insights

  • Expert-Level Difficulty — The questions are designed to be extremely challenging, with domain experts (those with or pursuing PhDs in the relevant fields) achieving an accuracy of 65% (74% when discounting clear mistakes identified in retrospect). This level of difficulty is intended to reflect graduate-level understanding in the respective sciences.
  • Google-Proof Nature — Highly skilled non-expert validators, despite having unrestricted web access and spending over 30 minutes per question on average, only reached a 34% accuracy rate. This "Google-proof" characteristic underscores the benchmark's resistance to simple lookup or shallow web searches, aiming at deeper understanding and reasoning.
  • Performance of AI Systems — The strongest GPT-4 based baseline model achieved a 39% accuracy, highlighting the significant challenge GPQA poses even to state-of-the-art AI systems. This gap between expert human performance and AI capabilities underscores the need for advanced scalable oversight methods to ensure AI systems can provide reliable and truthful information, especially in complex scientific domains.

Objectives and Implications

GPQA, the Graduate-Level Google-Proof Q&A Benchmark, challenges Large Language Models (LLMs) with 448 expert-crafted multiple-choice questions in biology, physics, and chemistry. It tests LLMs' deep understanding and reasoning in these domains, serving as a crucial tool for scalable oversight. This oversight is essential for developing methods that enable human experts to supervise and validate AI outputs, especially in areas where AI might exceed human capabilities.

The introduction of GPQA marks a pivotal moment in AI evaluation, addressing the urgent need for models that can process and generate accurate information in specialized scientific fields. As AI evolves, benchmarks like GPQA are indispensable for measuring progress towards AI systems capable of contributing to scientific advancements. They guide the development of sophisticated AI systems, aiming to narrow the gap between AI-generated content and human expert knowledge, ensuring reliability and truthfulness in AI-generated scientific information.

More terms

Convolutional neural network

A Convolutional Neural Network (CNN or ConvNet) is a type of deep learning architecture that excels at processing data with a grid-like topology, such as images. CNNs are particularly effective at identifying patterns in images to recognize objects, classes, and categories, but they can also classify audio, time-series, and signal data.

Read more

What is a constrained conditional model?

A Constrained Conditional Model (CCM) is a framework in machine learning that combines the learning of conditional models with declarative constraints within a constrained optimization framework. These constraints can be either hard, which prohibit certain assignments, or soft, which penalize unlikely assignments. The constraints are used to incorporate domain-specific knowledge into the model, allowing for more expressive decision-making in complex output spaces.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free