MMLU Pro Benchmark
by Stephen M. Walker II, Co-Founder / CEO
What is the MMLU Pro Benchmark ?
MMLU-Pro is an enhanced benchmark designed to evaluate the language understanding capabilities of LLMs across a broader and more challenging set of tasks. It builds upon the original Massive Multitask Language Understanding (MMLU) dataset by addressing several limitations and introducing new features to increase the difficulty and robustness of the evaluation.
Key enhancements in MMLU-Pro include:
- Expanded task diversity — MMLU-Pro introduces additional domains and subjects, providing a more comprehensive assessment of model capabilities.
- Increased complexity — Questions are designed to be more challenging, requiring deeper reasoning and problem-solving skills.
- Improved question quality — Ambiguous or poorly phrased questions have been refined or replaced to ensure clarity and accuracy.
- Enhanced answer validation — A rigorous review process has been implemented to minimize incorrect answers and improve overall dataset quality.
- Multi-step reasoning tasks — New questions that require models to break down complex problems into multiple steps have been added.
MMLU-Pro serves as a more rigorous standardized way to evaluate AI performance across a wide range of disciplines, from advanced mathematics and specialized scientific fields to intricate legal and ethical scenarios. It provides a comprehensive assessment of an LLM's knowledge breadth, reasoning capabilities, and problem-solving skills.
The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.
Researchers and AI teams can utilize MMLU-Pro for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.
MMLU Pro Leaderboard
Model | Overall | Business | Engineering | Law |
---|---|---|---|---|
Claude 3.5 Sonnet | 0.7283 | 0.782 | 0.5686 | 0.5731 |
GPT-4o | 0.7255 | 0.7858 | 0.55 | 0.5104 |
Gemini 1.5 Pro | 0.6903 | 0.7288 | 0.4871 | 0.5077 |
Claude 3 Opus | 0.6845 | 0.7338 | 0.484 | 0.5349 |
Qwen2 72BChat | 0.6438 | 0.6996 | 0.6724 | 0.4587 |
GPT-4 Turbo | 0.6371 | 0.673 | 0.3591 | 0.5123 |
DeepSeek-Coder-V2-Instruct | 0.6363 | 0.7326 | 0.5175 | 0.3506 |
Updated June 28, 2024
What are the key differences from the original MMLU?
MMLU-Pro introduces several key enhancements over the original MMLU dataset, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in the original benchmark and create a more robust testing environment. The main differences include:
-
Increased answer options — MMLU-Pro expands the number of answer choices from 4 to 10, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.
-
Enhanced reasoning requirements — While the original MMLU focused primarily on knowledge-based questions, MMLU-Pro incorporates more problems that demand complex reasoning. This shift results in Chain-of-Thought (CoT) approaches outperforming traditional Perplexity (PPL) methods by up to 20% in MMLU-Pro.
-
Improved robustness — By increasing the number of distractors, MMLU-Pro reduces the impact of chance on correct answers. This enhancement leads to greater benchmark stability, with sensitivity to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro across 24 different prompt styles tested.
What is in the MMLU Pro Dataset
Questions and Options: The MMLU-Pro dataset typically features ten multiple-choice options per question, expanding from the original MMLU's four options. This increase enhances complexity and requires deeper reasoning. Some questions have fewer options due to the removal of unreasonable choices during review.
Dataset Sources and Enhanced Disciplines
Source | Description | Enhanced Disciplines |
---|---|---|
Original MMLU | Refined by removing trivial and ambiguous questions | All disciplines |
STEM Website | Carefully selected high-quality STEM problems | Biology, Business, Chemistry, Computer Science, Economics, Engineering, Math, Physics, Psychology |
TheoremQA | Human-annotated questions requiring theorem application | Math, Physics, Computer Science |
SciBench | College-level science exam questions | Biology, Chemistry, Physics |
Question Distribution
Discipline | Total Questions | From Original MMLU | Newly Added |
---|---|---|---|
Math | 1351 | 846 | 505 |
Physics | 1299 | 411 | 888 |
Chemistry | 1132 | 178 | 954 |
Law | 1101 | 1101 | 0 |
Engineering | 969 | 67 | 902 |
Other | 924 | 924 | 0 |
Economics | 844 | 444 | 400 |
Health | 818 | 818 | 0 |
Psychology | 798 | 493 | 305 |
Business | 789 | 155 | 634 |
Biology | 717 | 219 | 498 |
Philosophy | 499 | 499 | 0 |
Computer Science | 410 | 274 | 136 |
History | 381 | 381 | 0 |
Total | 12032 | 6810 | 5222 |
Dataset Construction
The MMLU-Pro dataset construction process involved several key steps to ensure its quality and effectiveness. The process began with a thorough review of the original MMLU dataset, retaining only the most challenging and relevant questions.
To expand the dataset, additional high-quality questions were carefully selected from STEM websites, theoremQA, and scibench, focusing on complex problems that would challenge advanced models' analytical capabilities.
GPT-4 was then used to increase the number of answer choices from four to ten for each question, creating plausible distractors that require discriminative reasoning. Finally, a panel of over ten experts rigorously reviewed each question and its associated options, ensuring they were challenging, comprehensive, accurate, and fair. This meticulous process was essential to maintain the dataset's integrity and effectiveness as a benchmarking tool for advanced language models.
CoT vs Direct Evaluation
Unlike the original MMLU, which favors PPL evaluation, MMLU-Pro requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for the GPT-4o model across various disciplines:
Prompting | Overall | Business | Engineering | Physics |
---|---|---|---|---|
CoT | 0.7255 | 0.7858 | 0.5500 | 0.7467 |
Direct | 0.5346 | 0.3920 | 0.3981 | 0.3971 |
The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MMLU-Pro dataset. Math shows the most substantial improvement with CoT, demonstrating a remarkable increase of 0.4182 in performance. Chemistry and Business follow closely behind, with improvements of 0.3946 and 0.3938 respectively. Physics also sees a notable boost of 0.3496, while Computer Science rounds out the top five most impacted categories with a 0.2016 increase.
On the other end of the spectrum, some categories show minimal or even negative impacts from CoT prompting. Law, interestingly, experiences a slight decrease in performance (-0.0316) when using CoT. History sees an almost negligible improvement of 0.0058, while Health, Psychology, and Philosophy show modest gains of 0.0279, 0.0291, and 0.0400 respectively. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.
As evident from the data, the performance drops significantly (up to 19%) without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MMLU-Pro dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.
How do the results compare: MMLU vs. MMLU-Pro
The table below compares the performance of various language models on the original MMLU dataset and the more challenging MMLU-Pro dataset:
Model | Original MMLU Score | MMLU-Pro Score | Performance Drop |
---|---|---|---|
GPT-4o | 0.887 | 0.7255 | 0.1615 (16.15%) |
Claude-3-Opus | 0.868 | 0.6845 | 0.1835 (18.35%) |
Claude-3-Sonnet | 0.815 | 0.5511 | 0.2639 (26.39%) |
Gemini 1.5 Flash | 0.789 | 0.5912 | 0.1978 (19.78%) |
Llama-3-70B-Instruct | 0.820 | 0.5620 | 0.2580 (25.80%) |
We can observe significant variations in performance drops across different models when tested on MMLU-Pro compared to the original MMLU:
- GPT-4o shows the smallest decrease, with only a 16.15% drop in performance.
- Other models experience more substantial declines, ranging from 18.35% to 26.39%.
- While not shown in the table, it's worth noting that some models like Mixtral-8x7B reportedly experience even larger drops, exceeding 30%.