GPQA: A Graduate-Level Google-Proof Q&A Benchmark

by Stephen M. Walker II, Co-Founder / CEO

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

GPQA, or Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Introduced by researchers, GPQA comprises 448 multiple-choice questions across the domains of biology, physics, and chemistry, crafted by domain experts to ensure high quality and difficulty.

There are three variations of the test dataset with variable question lengths: extended (546), main (448), and diamond (198). The original research benchmark compared zero-shot, few-shot, CoT, and search variations.

GPQA Eval Leaderboard

The GPQA Eval Leaderboard, updated as of June 26, 2024, showcases AI model performance on the challenging 198-question Diamond Set of the GPQA benchmark. Anthropic's Claude 3.5 Sonnet leads with 59.4% zero-shot Chain-of-Thought accuracy, followed by OpenAI's GPT-4 Opus (0513) at 53.6% and Anthropic's Claude 3 Opus at 50.4%. These results demonstrate significant progress in Anthropic and OpenAI models' ability to handle complex, graduate-level scientific questions across biology, physics, and chemistry.

OrganizationModelDiamond Set (198 Questions)
AnthropicClaude 3.5 Sonnet59.4% Zero-shot CoT
OpenAIGPT-4o (0513)53.6% Zero-shot CoT
AnthropicClaude 3 Opus50.4% Zero-shot CoT
OpenAIGPT-4 Turbo (0409)48.0% Zero-shot CoT
GoogleGemini 1.5 Pro (052024)46.2% Zero-shot CoT
GoogleGemini 1.5 Pro (022024)41.4% Zero-shot CoT
AnthropicClaude 3 Sonnet40.4% Zero-shot CoT
GoogleGemini 1.5 Pro39.5% Zero-shot CoT
OpenAIGPT-4 (0314)35.7% Zero-shot CoT
AnthropicClaude 3 Haiku33.3% Zero-shot CoT
MetaLlama-2-70B-chat31.1% Zero-shot CoT
OpenAIGPT-3.528.1% Zero-shot CoT
GoogleGemini 1.0 Ultra
GoogleGemini 1.0 Pro

Key Features and Performance Insights

  • Expert-Level Difficulty — The questions are designed to be extremely challenging, with domain experts (those with or pursuing PhDs in the relevant fields) achieving an accuracy of 65% (74% when discounting clear mistakes identified in retrospect). This level of difficulty is intended to reflect graduate-level understanding in the respective sciences.
  • Google-Proof Nature — Highly skilled non-expert validators, despite having unrestricted web access and spending over 30 minutes per question on average, only reached a 34% accuracy rate. This "Google-proof" characteristic underscores the benchmark's resistance to simple lookup or shallow web searches, aiming at deeper understanding and reasoning.
  • Performance of AI Systems — The strongest GPT-4 based baseline model achieved a 39% accuracy, highlighting the significant challenge GPQA poses even to state-of-the-art AI systems. This gap between expert human performance and AI capabilities underscores the need for advanced scalable oversight methods to ensure AI systems can provide reliable and truthful information, especially in complex scientific domains.

How does GPQA compare to other benchmarks like GAIA and BASIS

GAIA: Real-World AI Assistant Assessment GAIA (General AI Assistant Benchmark) evaluates AI systems on practical, real-world tasks that encompass reasoning, multi-modal processing, web browsing, and tool utilization. Despite being conceptually simple for humans, who achieve 92% accuracy, GAIA poses significant challenges for AI, with GPT-4 (with plugins) scoring only 15%. This stark performance gap underscores GAIA's effectiveness in benchmarking AI systems' robustness and adaptability across diverse, everyday scenarios, emphasizing the need for AI to match or exceed average human performance on practical tasks.

BASIS: Frontier of Scientific AI Capabilities BASIS (Benchmark for Advanced Scientific Inquiry Systems) pushes the boundaries of AI evaluation in scientific domains, surpassing even GPQA in complexity. Tailored for assessing AI systems expected to perform at or beyond human expert level, BASIS focuses on tasks demanding advanced scientific inquiry and reasoning. This benchmark is crucial for developing and evaluating AI systems capable of contributing meaningfully to cutting-edge scientific research and problem-solving, potentially accelerating breakthroughs across various scientific disciplines.

Objectives and Implications

GPQA, the Graduate-Level Google-Proof Q&A Benchmark, rigorously evaluates Large Language Models (LLMs) through 448 meticulously crafted multiple-choice questions spanning biology, physics, and chemistry. This benchmark probes LLMs' capacity for deep comprehension and sophisticated reasoning within these scientific domains, serving as a critical metric for scalable oversight mechanisms. GPQA's design specifically targets the development of robust methodologies enabling human experts to effectively supervise and validate AI outputs, particularly in domains where AI capabilities may surpass human expertise.

The advent of GPQA represents a significant milestone in AI assessment, directly addressing the critical need for models capable of processing and generating precise information in specialized scientific fields. As AI technology advances, GPQA and similar benchmarks become indispensable tools for quantifying progress towards AI systems capable of meaningful contributions to scientific research. These benchmarks drive the evolution of increasingly sophisticated AI architectures, aiming to minimize the disparity between AI-generated content and human expert knowledge. Ultimately, GPQA's rigorous standards promote the development of AI systems that can reliably produce truthful and accurate scientific information, potentially accelerating the pace of scientific discovery and innovation.

More terms

What is an echo state network?

An Echo State Network (ESN) is a type of recurrent neural network (RNN) that falls under the umbrella of reservoir computing. It is characterized by a sparsely connected hidden layer, often referred to as the "reservoir", where the connectivity and weights of the neurons are fixed and randomly assigned.

Read more

Effective Altruism

Effective Altruism (EA) is a philosophical and social movement that applies evidence and reason to determine the most effective ways to benefit others. It encompasses a community and a research field dedicated to finding and implementing the best methods to assist others. EA is characterized by its focus on using resources efficiently to maximize positive impact, whether through career choices, charitable donations, or other actions aimed at improving the world.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free