MMLU Pro Benchmark

June 28, 2024

by Stephen M. Walker II, Co-Founder / CEO

What is the MMLU Pro Benchmark ?

MMLU-Pro is an enhanced benchmark designed to evaluate the language understanding capabilities of LLMs across a broader and more challenging set of tasks. It builds upon the original Massive Multitask Language Understanding (MMLU) dataset by addressing several limitations and introducing new features to increase the difficulty and robustness of the evaluation.

Key enhancements in MMLU-Pro include:

Expanded task diversity — MMLU-Pro introduces additional domains and subjects, providing a more comprehensive assessment of model capabilities.
Increased complexity — Questions are designed to be more challenging, requiring deeper reasoning and problem-solving skills.
Improved question quality — Ambiguous or poorly phrased questions have been refined or replaced to ensure clarity and accuracy.
Enhanced answer validation — A rigorous review process has been implemented to minimize incorrect answers and improve overall dataset quality.
Multi-step reasoning tasks — New questions that require models to break down complex problems into multiple steps have been added.

MMLU-Pro serves as a more rigorous standardized way to evaluate AI performance across a wide range of disciplines, from advanced mathematics and specialized scientific fields to intricate legal and ethical scenarios. It provides a comprehensive assessment of an LLM's knowledge breadth, reasoning capabilities, and problem-solving skills.

The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.

Researchers and AI teams can utilize MMLU-Pro for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.

MMLU Pro Leaderboard

Model	Overall	Business	Engineering	Law
Claude 3.5 Sonnet	0.7283	0.782	0.5686	0.5731
GPT-4o	0.7255	0.7858	0.55	0.5104
Gemini 1.5 Pro	0.6903	0.7288	0.4871	0.5077
Claude 3 Opus	0.6845	0.7338	0.484	0.5349
Qwen2 72BChat	0.6438	0.6996	0.6724	0.4587
GPT-4 Turbo	0.6371	0.673	0.3591	0.5123
DeepSeek-Coder-V2-Instruct	0.6363	0.7326	0.5175	0.3506

Updated June 28, 2024

What are the key differences from the original MMLU?

MMLU-Pro introduces several key enhancements over the original MMLU dataset, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in the original benchmark and create a more robust testing environment. The main differences include:

Increased answer options — MMLU-Pro expands the number of answer choices from 4 to 10, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.
Enhanced reasoning requirements — While the original MMLU focused primarily on knowledge-based questions, MMLU-Pro incorporates more problems that demand complex reasoning. This shift results in Chain-of-Thought (CoT) approaches outperforming traditional Perplexity (PPL) methods by up to 20% in MMLU-Pro.
Improved robustness — By increasing the number of distractors, MMLU-Pro reduces the impact of chance on correct answers. This enhancement leads to greater benchmark stability, with sensitivity to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro across 24 different prompt styles tested.

What is in the MMLU Pro Dataset

Questions and Options: The MMLU-Pro dataset typically features ten multiple-choice options per question, expanding from the original MMLU's four options. This increase enhances complexity and requires deeper reasoning. Some questions have fewer options due to the removal of unreasonable choices during review.

Dataset Sources and Enhanced Disciplines

Source	Description	Enhanced Disciplines
Original MMLU	Refined by removing trivial and ambiguous questions	All disciplines
STEM Website	Carefully selected high-quality STEM problems	Biology, Business, Chemistry, Computer Science, Economics, Engineering, Math, Physics, Psychology
TheoremQA	Human-annotated questions requiring theorem application	Math, Physics, Computer Science
SciBench	College-level science exam questions	Biology, Chemistry, Physics

Question Distribution

Discipline	Total Questions	From Original MMLU	Newly Added
Math	1351	846	505
Physics	1299	411	888
Chemistry	1132	178	954
Law	1101	1101	0
Engineering	969	67	902
Other	924	924	0
Economics	844	444	400
Health	818	818	0
Psychology	798	493	305
Business	789	155	634
Biology	717	219	498
Philosophy	499	499	0
Computer Science	410	274	136
History	381	381	0
Total	12032	6810	5222

Dataset Construction

The MMLU-Pro dataset construction process involved several key steps to ensure its quality and effectiveness. The process began with a thorough review of the original MMLU dataset, retaining only the most challenging and relevant questions.

To expand the dataset, additional high-quality questions were carefully selected from STEM websites, theoremQA, and scibench, focusing on complex problems that would challenge advanced models' analytical capabilities.

GPT-4 was then used to increase the number of answer choices from four to ten for each question, creating plausible distractors that require discriminative reasoning. Finally, a panel of over ten experts rigorously reviewed each question and its associated options, ensuring they were challenging, comprehensive, accurate, and fair. This meticulous process was essential to maintain the dataset's integrity and effectiveness as a benchmarking tool for advanced language models.

CoT vs Direct Evaluation

Unlike the original MMLU, which favors PPL evaluation, MMLU-Pro requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for the GPT-4o model across various disciplines:

Prompting	Overall	Business	Engineering	Physics
CoT	0.7255	0.7858	0.5500	0.7467
Direct	0.5346	0.3920	0.3981	0.3971

The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MMLU-Pro dataset. Math shows the most substantial improvement with CoT, demonstrating a remarkable increase of 0.4182 in performance. Chemistry and Business follow closely behind, with improvements of 0.3946 and 0.3938 respectively. Physics also sees a notable boost of 0.3496, while Computer Science rounds out the top five most impacted categories with a 0.2016 increase.

On the other end of the spectrum, some categories show minimal or even negative impacts from CoT prompting. Law, interestingly, experiences a slight decrease in performance (-0.0316) when using CoT. History sees an almost negligible improvement of 0.0058, while Health, Psychology, and Philosophy show modest gains of 0.0279, 0.0291, and 0.0400 respectively. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.

As evident from the data, the performance drops significantly (up to 19%) without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MMLU-Pro dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.

How do the results compare: MMLU vs. MMLU-Pro

The table below compares the performance of various language models on the original MMLU dataset and the more challenging MMLU-Pro dataset:

Model	Original MMLU Score	MMLU-Pro Score	Performance Drop
GPT-4o	0.887	0.7255	0.1615 (16.15%)
Claude-3-Opus	0.868	0.6845	0.1835 (18.35%)
Claude-3-Sonnet	0.815	0.5511	0.2639 (26.39%)
Gemini 1.5 Flash	0.789	0.5912	0.1978 (19.78%)
Llama-3-70B-Instruct	0.820	0.5620	0.2580 (25.80%)

We can observe significant variations in performance drops across different models when tested on MMLU-Pro compared to the original MMLU:

GPT-4o shows the smallest decrease, with only a 16.15% drop in performance.
Other models experience more substantial declines, ranging from 18.35% to 26.39%.
While not shown in the table, it's worth noting that some models like Mixtral-8x7B reportedly experience even larger drops, exceeding 30%.

Klu is remote-first and global

Follow us

MMLU Pro Benchmark

What is the MMLU Pro Benchmark ?

MMLU Pro Leaderboard

What are the key differences from the original MMLU?

What is in the MMLU Pro Dataset

Dataset Construction

CoT vs Direct Evaluation

How do the results compare: MMLU vs. MMLU-Pro

More terms

What is sensor fusion?

MMMU: Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark

It's time to build

LLMOps

Guides

LLMs