GPQA: A Graduate-Level Google-Proof Q&A Benchmark

by Stephen M. Walker II, Co-Founder / CEO

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

GPQA, or Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Introduced by researchers, GPQA comprises 448 multiple-choice questions across the domains of biology, physics, and chemistry, crafted by domain experts to ensure high quality and difficulty.

There are three variations of the test dataset with variable question lengths: extended (546), main (448), and diamond (198). The original research benchmark compared zero-shot, few-shot, CoT, and search variations.

GPQA Eval Leaderboard

The GPQA Eval Leaderboard, updated as of June 26, 2024, showcases AI model performance on the challenging 198-question Diamond Set of the GPQA benchmark. Anthropic's Claude 3.5 Sonnet leads with 59.4% zero-shot Chain-of-Thought accuracy, followed by OpenAI's GPT-4 Opus (0513) at 53.6% and Anthropic's Claude 3 Opus at 50.4%. These results demonstrate significant progress in Anthropic and OpenAI models' ability to handle complex, graduate-level scientific questions across biology, physics, and chemistry.

Organization	Model	Diamond Set (198 Questions)
Anthropic	Claude 3.5 Sonnet	59.4% Zero-shot CoT
OpenAI	GPT-4o (0513)	53.6% Zero-shot CoT
Anthropic	Claude 3 Opus	50.4% Zero-shot CoT
OpenAI	GPT-4 Turbo (0409)	48.0% Zero-shot CoT
Google	Gemini 1.5 Pro (052024)	46.2% Zero-shot CoT
Google	Gemini 1.5 Pro (022024)	41.4% Zero-shot CoT
Anthropic	Claude 3 Sonnet	40.4% Zero-shot CoT
Google	Gemini 1.5 Pro	39.5% Zero-shot CoT
OpenAI	GPT-4 (0314)	35.7% Zero-shot CoT
Anthropic	Claude 3 Haiku	33.3% Zero-shot CoT
Meta	Llama-2-70B-chat	31.1% Zero-shot CoT
OpenAI	GPT-3.5	28.1% Zero-shot CoT
Google	Gemini 1.0 Ultra	—
Google	Gemini 1.0 Pro	—

Key Features and Performance Insights

Expert-Level Difficulty — The questions are designed to be extremely challenging, with domain experts (those with or pursuing PhDs in the relevant fields) achieving an accuracy of 65% (74% when discounting clear mistakes identified in retrospect). This level of difficulty is intended to reflect graduate-level understanding in the respective sciences.
Google-Proof Nature — Highly skilled non-expert validators, despite having unrestricted web access and spending over 30 minutes per question on average, only reached a 34% accuracy rate. This "Google-proof" characteristic underscores the benchmark's resistance to simple lookup or shallow web searches, aiming at deeper understanding and reasoning.
Performance of AI Systems — The strongest GPT-4 based baseline model achieved a 39% accuracy, highlighting the significant challenge GPQA poses even to state-of-the-art AI systems. This gap between expert human performance and AI capabilities underscores the need for advanced scalable oversight methods to ensure AI systems can provide reliable and truthful information, especially in complex scientific domains.

How does GPQA compare to other benchmarks like GAIA and BASIS

GAIA: Real-World AI Assistant Assessment GAIA (General AI Assistant Benchmark) evaluates AI systems on practical, real-world tasks that encompass reasoning, multi-modal processing, web browsing, and tool utilization. Despite being conceptually simple for humans, who achieve 92% accuracy, GAIA poses significant challenges for AI, with GPT-4 (with plugins) scoring only 15%. This stark performance gap underscores GAIA's effectiveness in benchmarking AI systems' robustness and adaptability across diverse, everyday scenarios, emphasizing the need for AI to match or exceed average human performance on practical tasks.

BASIS: Frontier of Scientific AI Capabilities BASIS (Benchmark for Advanced Scientific Inquiry Systems) pushes the boundaries of AI evaluation in scientific domains, surpassing even GPQA in complexity. Tailored for assessing AI systems expected to perform at or beyond human expert level, BASIS focuses on tasks demanding advanced scientific inquiry and reasoning. This benchmark is crucial for developing and evaluating AI systems capable of contributing meaningfully to cutting-edge scientific research and problem-solving, potentially accelerating breakthroughs across various scientific disciplines.

Objectives and Implications

GPQA, the Graduate-Level Google-Proof Q&A Benchmark, rigorously evaluates Large Language Models (LLMs) through 448 meticulously crafted multiple-choice questions spanning biology, physics, and chemistry. This benchmark probes LLMs' capacity for deep comprehension and sophisticated reasoning within these scientific domains, serving as a critical metric for scalable oversight mechanisms. GPQA's design specifically targets the development of robust methodologies enabling human experts to effectively supervise and validate AI outputs, particularly in domains where AI capabilities may surpass human expertise.

The advent of GPQA represents a significant milestone in AI assessment, directly addressing the critical need for models capable of processing and generating precise information in specialized scientific fields. As AI technology advances, GPQA and similar benchmarks become indispensable tools for quantifying progress towards AI systems capable of meaningful contributions to scientific research. These benchmarks drive the evolution of increasingly sophisticated AI architectures, aiming to minimize the disparity between AI-generated content and human expert knowledge. Ultimately, GPQA's rigorous standards promote the development of AI systems that can reliably produce truthful and accurate scientific information, potentially accelerating the pace of scientific discovery and innovation.

Klu is remote-first and global

Follow us

GPQA: A Graduate-Level Google-Proof Q&A Benchmark