MMLU Pro Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is the MMLU Pro Benchmark ?

MMLU-Pro is an enhanced benchmark designed to evaluate the language understanding capabilities of LLMs across a broader and more challenging set of tasks. It builds upon the original Massive Multitask Language Understanding (MMLU) dataset by addressing several limitations and introducing new features to increase the difficulty and robustness of the evaluation.

Klu MMLU Pro Leaderboard

Key enhancements in MMLU-Pro include:

  • Expanded task diversity — MMLU-Pro introduces additional domains and subjects, providing a more comprehensive assessment of model capabilities.
  • Increased complexity — Questions are designed to be more challenging, requiring deeper reasoning and problem-solving skills.
  • Improved question quality — Ambiguous or poorly phrased questions have been refined or replaced to ensure clarity and accuracy.
  • Enhanced answer validation — A rigorous review process has been implemented to minimize incorrect answers and improve overall dataset quality.
  • Multi-step reasoning tasks — New questions that require models to break down complex problems into multiple steps have been added.

MMLU-Pro serves as a more rigorous standardized way to evaluate AI performance across a wide range of disciplines, from advanced mathematics and specialized scientific fields to intricate legal and ethical scenarios. It provides a comprehensive assessment of an LLM's knowledge breadth, reasoning capabilities, and problem-solving skills.

The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.

Researchers and AI teams can utilize MMLU-Pro for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.

MMLU Pro Leaderboard

ModelOverallBusinessEngineeringLaw
Claude 3.5 Sonnet0.72830.7820.56860.5731
GPT-4o0.72550.78580.550.5104
Gemini 1.5 Pro0.69030.72880.48710.5077
Claude 3 Opus0.68450.73380.4840.5349
Qwen2 72BChat0.64380.69960.67240.4587
GPT-4 Turbo0.63710.6730.35910.5123
DeepSeek-Coder-V2-Instruct0.63630.73260.51750.3506

Updated June 28, 2024

What are the key differences from the original MMLU?

MMLU-Pro introduces several key enhancements over the original MMLU dataset, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in the original benchmark and create a more robust testing environment. The main differences include:

  • Increased answer options — MMLU-Pro expands the number of answer choices from 4 to 10, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.

  • Enhanced reasoning requirements — While the original MMLU focused primarily on knowledge-based questions, MMLU-Pro incorporates more problems that demand complex reasoning. This shift results in Chain-of-Thought (CoT) approaches outperforming traditional Perplexity (PPL) methods by up to 20% in MMLU-Pro.

  • Improved robustness — By increasing the number of distractors, MMLU-Pro reduces the impact of chance on correct answers. This enhancement leads to greater benchmark stability, with sensitivity to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro across 24 different prompt styles tested.

What is in the MMLU Pro Dataset

Questions and Options: The MMLU-Pro dataset typically features ten multiple-choice options per question, expanding from the original MMLU's four options. This increase enhances complexity and requires deeper reasoning. Some questions have fewer options due to the removal of unreasonable choices during review.

Dataset Sources and Enhanced Disciplines

SourceDescriptionEnhanced Disciplines
Original MMLURefined by removing trivial and ambiguous questionsAll disciplines
STEM WebsiteCarefully selected high-quality STEM problemsBiology, Business, Chemistry, Computer Science, Economics, Engineering, Math, Physics, Psychology
TheoremQAHuman-annotated questions requiring theorem applicationMath, Physics, Computer Science
SciBenchCollege-level science exam questionsBiology, Chemistry, Physics

Question Distribution

DisciplineTotal QuestionsFrom Original MMLUNewly Added
Math1351846505
Physics1299411888
Chemistry1132178954
Law110111010
Engineering96967902
Other9249240
Economics844444400
Health8188180
Psychology798493305
Business789155634
Biology717219498
Philosophy4994990
Computer Science410274136
History3813810
Total1203268105222

Dataset Construction

The MMLU-Pro dataset construction process involved several key steps to ensure its quality and effectiveness. The process began with a thorough review of the original MMLU dataset, retaining only the most challenging and relevant questions.

To expand the dataset, additional high-quality questions were carefully selected from STEM websites, theoremQA, and scibench, focusing on complex problems that would challenge advanced models' analytical capabilities.

GPT-4 was then used to increase the number of answer choices from four to ten for each question, creating plausible distractors that require discriminative reasoning. Finally, a panel of over ten experts rigorously reviewed each question and its associated options, ensuring they were challenging, comprehensive, accurate, and fair. This meticulous process was essential to maintain the dataset's integrity and effectiveness as a benchmarking tool for advanced language models.

CoT vs Direct Evaluation

Unlike the original MMLU, which favors PPL evaluation, MMLU-Pro requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for the GPT-4o model across various disciplines:

PromptingOverallBusinessEngineeringPhysics
CoT0.72550.78580.55000.7467
Direct0.53460.39200.39810.3971

The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MMLU-Pro dataset. Math shows the most substantial improvement with CoT, demonstrating a remarkable increase of 0.4182 in performance. Chemistry and Business follow closely behind, with improvements of 0.3946 and 0.3938 respectively. Physics also sees a notable boost of 0.3496, while Computer Science rounds out the top five most impacted categories with a 0.2016 increase.

On the other end of the spectrum, some categories show minimal or even negative impacts from CoT prompting. Law, interestingly, experiences a slight decrease in performance (-0.0316) when using CoT. History sees an almost negligible improvement of 0.0058, while Health, Psychology, and Philosophy show modest gains of 0.0279, 0.0291, and 0.0400 respectively. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.

As evident from the data, the performance drops significantly (up to 19%) without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MMLU-Pro dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.

How do the results compare: MMLU vs. MMLU-Pro

The table below compares the performance of various language models on the original MMLU dataset and the more challenging MMLU-Pro dataset:

ModelOriginal MMLU ScoreMMLU-Pro ScorePerformance Drop
GPT-4o0.8870.72550.1615 (16.15%)
Claude-3-Opus0.8680.68450.1835 (18.35%)
Claude-3-Sonnet0.8150.55110.2639 (26.39%)
Gemini 1.5 Flash0.7890.59120.1978 (19.78%)
Llama-3-70B-Instruct0.8200.56200.2580 (25.80%)

We can observe significant variations in performance drops across different models when tested on MMLU-Pro compared to the original MMLU:

  • GPT-4o shows the smallest decrease, with only a 16.15% drop in performance.
  • Other models experience more substantial declines, ranging from 18.35% to 26.39%.
  • While not shown in the table, it's worth noting that some models like Mixtral-8x7B reportedly experience even larger drops, exceeding 30%.

More terms

What is a semantic network?

A semantic network is a knowledge representation framework that depicts the relationships between concepts in the form of a network. It consists of nodes representing concepts and edges that establish semantic connections between these concepts. These networks can be directed or undirected graphs and are often used to map out semantic fields, illustrating how different ideas are interrelated.

Read more

Concept Drift

Concept drift, also known as drift, is a phenomenon in predictive analytics, data science, and machine learning where the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This evolution of data can invalidate the data model, causing the predictions to become less accurate as time passes.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free