MMLU Benchmark (Massive Multi-task Language Understanding)

by Stephen M. Walker II, Co-Founder / CEO

What is the MMLU Benchmark (Massive Multi-task Language Understanding)?

The MMLU Benchmark (Massive Multi-task Language Understanding) is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings. The MMLU serves as a standardized way to assess AI performance on tasks that range from simple math to complex legal reasoning.

Download MMLU dataset

While the MMLU is a widely used benchmark, it's important to acknowledge its limitations:

Inaccurate question-answer pairs
Incorrect answers in some instances
Ambiguously or poorly phrased questions

These issues can potentially impact the reliability and validity of MMLU evaluations for LLMs, as a perfect score would imply overtraining on the dataset.

The MMLU contains 57 tasks across various topics, including elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills, making it a diverse set of tests designed to evaluate the understanding and capabilities of language models across multiple domains.

The MMLU provides a way to test and compare various language models like OpenAI GPT-4o, Mistral 7b, Google Gemini, and Anthropic Claude 3, etc.

AI teams can use the MMLU for comprehensive evaluations when building or fine-tuning custom models that significantly modify a foundation model.

MMLU 5-Shot Leaderboard

Organization	Model	MMLU Score	Release Date
OpenAI	GPT-4 Opus	88.7	May 2024
Anthropic	Claude 3.5 Sonnet	88.7	June 2024
x.ai	Grok-2	87.5	August 2024
Anthropic	Claude 3 Opus	86.8	March 2024
OpenAI	GPT-4 Turbo	86.4	April 2024
OpenAI	GPT-4	86.4	April 2023
x.ai	Grok-2 Mini	86.2	August 2024
Meta	Llama 3 400B	86.1	Unreleased
Google	Gemini 1.5 Pro	85.9	May 2024
Google	Gemini Ultra	83.7	December 2023
Google	Gemini 1.5 Pro	81.9	February 2024
Mistral AI	Mistral Large	81.2	February 2024
Google	Gemini 1.5 Flash	78.9	May 2024
Anthropic	Claude 2	78.5	November 2023
Google	PaLM 2	78.3	May 2023
Google	PaLM	75.2	April 2022
Google	Gemini Pro	71.8	December 2023
Mistral AI	Mistral 8x7b	71.3	December 2023
OpenAI	GPT-3.5	70.0	November 2022
Mistral AI	Zephyr 7b	66.08	October 2023
Meta	Llama 2 65b	63.4	July 2022
Mistral AI	Mistral 7b	60.1	September 2023

MMLU Leaderboard updated September 2024

Example MMLU Question and Response

We'll compare responses to a sample MMLU question from three language models: OpenAI's GPT-4.0-Turbo, Anthropic's Claude 2, and Zephyr (Mistrals 7b Variant). This comparison highlights their respective capabilities.

To prevent poisoning future data, we do not disclose the correct response.

Example question on High School European History:

This question refers to the following information. Albeit the king's Majesty justly and rightfully is and ought to be the supreme head of the Church of England, and so is recognized by the clergy of this realm in their convocations, yet nevertheless, for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same, be it enacted, by authority of this present Parliament, that the king, our sovereign lord, his heirs and successors, kings of this realm, shall be taken, accepted, and reputed the only supreme head in earth of the Church of England, called Anglicans Ecclesia; and shall have and enjoy, annexed and united to the imperial crown of this realm, as well the title and style thereof, as all honors, dignities, preeminences, jurisdictions, privileges, authorities, immunities, profits, and commodities to the said dignity of the supreme head of the same Church belonging and appertaining; and that our said sovereign lord, his heirs and successors, kings of this realm, shall have full power and authority from time to time to visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities, whatsoever they be, which by any manner of spiritual authority or jurisdiction ought or may lawfully be reformed, repressed, ordered, redressed, corrected, restrained, or amended, most to the pleasure of Almighty God, the increase of virtue in Christ's religion, and for the conservation of the peace, unity, and tranquility of this realm; any usage, foreign land, foreign authority, prescription, or any other thing or things to the contrary hereof notwithstanding. English Parliament, Act of Supremacy, 1534 

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(A) give the English king a new position of authority
(B) give the position of head of the Church of England to Henry VIII
(C) establish Calvinism as the one true theology in England
(D) end various forms of corruption plaguing the Church in England

OpenAI GPT-4 Turbo's Response:

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage emphasizes the intent to "repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same," and to have the authority to "visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This indicates a clear aim to reform the Church and address issues of corruption or malpractice.

Anthropic Claude 2's Response:

Based on the passage, the best inference about what the English Parliament wished to argue is that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that one purpose of the Act is "to repress and extirpate all errors, heresies, and other enormities and abuses" in the Church. It also gives the king power to "repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This suggests the Act was intended to address corruption in the Church.

Options A and B are not supported, as the passage suggests the king was already recognized as the supreme head of the Church. Option C about establishing Calvinism is not mentioned. So option D best reflects what Parliament wished to argue about the Act's purpose.

Anthropic Claude 3 Sonnet's Response:

To answer this question, let's analyze the key points from the passage:

1. The Act confirms the king as "supreme head of the Church of England."
2. It states this is done to "increase virtue in Christ's religion" and to "repress and extirpate all errors, heresies, and other enormities and abuses."
3. The Act gives the king and his successors power to "visit, repress, redress, record, order, correct, restrain, and amend" various religious issues.
4. It emphasizes that this is done for "the conservation of the peace, unity, and tranquility of this realm."

Now, let's evaluate each option:

(A) Give the English king a new position of authority:
This is not the best answer. The passage states that the king "justly and rightfully is and ought to be the supreme head of the Church of England," suggesting this authority already existed. The Act is confirming and strengthening this position rather than creating a new one.

(B) Give the position of head of the Church of England to Henry VIII:
While this Act does confirm the king's position as head of the Church, it's not specifically about giving this position to Henry VIII. The Act refers to "the king, our sovereign lord, his heirs and successors," indicating it's about the role of the monarchy in general, not just Henry VIII.

(C) Establish Calvinism as the one true theology in England:
There's no mention of Calvinism or any specific theology in this passage. The Act is about the king's authority over the Church, not about establishing a particular theological doctrine.

(D) End various forms of corruption plaguing the Church in England:
This is the best answer. The Act repeatedly mentions addressing "errors, heresies, and other enormities and abuses" within the Church. It gives the king power to "repress, redress, record, order, correct, restrain, and amend" these issues, which strongly implies tackling corruption and other problems within the Church.

Therefore, based on the information provided in the passage, the best answer is (D). The English Parliament wished to argue that the Act of Supremacy would end various forms of corruption plaguing the Church in England.

Zephyr's Response:

The English Parliament intended to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that the Act is enacted "for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same." This suggests that the Parliament believed that the Act would help address issues of corruption, error, heresy, and abuse within the Church in England. The Act specifically mentions "all such errors, heresies, abuses, offenses, contempts, and enormities" and empowers the king to take action against them. Therefore, the Parliament's intention appears to be that the Act would help cleanse the Church of England by addressing issues of corruption and bringing it into greater conformity with Christian virtue.

It is not clear from the passage whether the Parliament intended for the Act to create a new position of authority for the king (A), exclude his heirs from holding the position of head of the Church (B), or establish Calvinism as the one true theology in England (C). Instead, it seems that the Parliament's intent was focused on addressing issues within the English Church and granting further power to the king as the supreme head of the Church.

The performance of the models can be analyzed based on several criteria, such as coherence, relevance, detail, and clarity:

Coherence — All three models provide coherent responses that logically follow from the passage cited. They maintain a consistent argument throughout their responses.
Relevance — Each model's response is relevant to the question asked, focusing on the intent behind the Act of Supremacy as it relates to corruption in the Church of England.
Detail — The responses differ slightly in the amount of detail provided. OpenAI GPT4.0-Turbo and Anthropic Claude 2 offer a detailed explanation of the passage, citing specific phrases and explaining their significance. Zephyr's response is also detailed but includes a broader interpretation of the Act's intent, mentioning "increase of virtue in Christ's religion" as an additional goal.
Clarity — All models present their arguments in a clear and understandable manner. However, OpenAI GPT4.0-Turbo and Anthropic Claude 2 structure their responses with a slightly clearer focus on the direct implications of the Act, while Zephyr provides a more generalized conclusion.

Overall, the performance of each model is quite similar, with only minor differences in the level of detail and the breadth of the context provided. Each model effectively communicates the intended inference from the passage, demonstrating a high level of language understanding and response generation.

The goals and structure of the MMLU

The MMLU acts as a standard, however imperfect, for assessing the relative generalization capabilities of foundation models, helping researchers and developers make informed decisions when selecting models for specific applications.

From the Measuring Massive Multitask Language Understanding paper:

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Some key details of the MMLU benchmark include:

Training and Validation Sets — The dataset contains 15908 questions, split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and consists of 1540 questions, and the test set has 14079 questions.
Model Performance — Initial results from MMLU revealed that smaller LLMs tended to perform around chance (25% accurate), while the larger GPT-3 (175 billion parameters) fared better with 43.9% few-shot accuracy and 37.7% zero-shot accuracy. In 2023, GPT-4 reached 86.4% 5-shot accuracy, and Google Gemini reached 83.7% 5-shot accuracy. However, even the best models still need substantial improvements before they can reach human expert-level accuracy (89.8%).
Challenging Subjects — Models, especially large language models (LLMs), struggle with calculation-intensive tasks (e.g., physics and math) and human-value-laden subjects (e.g., morality and law).

What are the key features of the MMLU benchmark?

The MMLU benchmark measures a language model's performance across a wide range of tasks, covering subjects in STEM, humanities, social sciences, and more. Some key features of the MMLU benchmark include:

57 Subjects — The benchmark covers 57 subjects across various fields, ranging from elementary mathematics to advanced professional levels in areas like law and ethics.
Granularity and Breadth — MMLU tests both world knowledge and problem-solving ability, making it ideal for identifying a model's understanding of various subjects.
Multitask Accuracy — The test measures a model's multitask accuracy by covering a diverse set of tasks, ensuring a comprehensive evaluation of the model's academic and professional knowledge.
No Large Training Sets Required — Unlike some other benchmarks, MMLU does not require large training sets. Instead, it assumes that models have acquired the necessary knowledge from reading vast quantities of diverse text, a process typically called pretraining.

These key features make the MMLU benchmark a valuable tool for evaluating the performance of language models and their ability to understand and generate language in various contexts.

How does the MMLU work?

The MMLU (Massive Multitask Language Understanding) benchmark evaluates language models across a diverse set of 57 subjects. Here's how it operates:

Assesses models in zero-shot and few-shot settings
Covers a wide range of academic and professional domains
Tests both world knowledge and problem-solving abilities
Utilizes multiple-choice questions for each subject

The benchmark calculates a model's performance by:

Evaluating responses to questions in each subject area
Aggregating scores across all subjects
Computing an average score as the final MMLU metric

This comprehensive approach provides a nuanced measure of a model's language understanding and generalization capabilities across various domains.

What are its benefits?

The MMLU benchmark offers several key advantages for evaluating language models:

Quantitative Comparison — It provides a robust, numerical framework for assessing and contrasting the performance of various language models.
Computational Efficiency — The benchmark can be calculated quickly and yields results that are straightforward to interpret.
Contextual Language Understanding — By evaluating a model's capacity to comprehend and generate language across diverse scenarios, MMLU captures crucial aspects of linguistic structure and versatility.

These benefits make MMLU a valuable tool for researchers and developers in the field of natural language processing and artificial intelligence.

What are the limitations of MMLU?

MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings across 57 diverse subjects.

To address these limitations, a new version, MMLU Pro, was developed as a potential successor to the original dataset.

However, several critical issues undermine its effectiveness as a reliable benchmark:

Missing Contextual Information — Some questions lack crucial context, rendering them difficult or impossible to answer correctly. These omissions are likely due to copy-paste errors.
Answer Set Ambiguity — The benchmark includes questions with ambiguous answer sets, leading to potential confusion and inaccurate model performance evaluation.
Incorrect Answer Sets — Certain questions contain outright incorrect answer sets, resulting in misleading evaluation results.
Prompt Sensitivity — MMLU exhibits extreme sensitivity to the specific prompts used, causing significant performance variations based on prompt formulation. This sensitivity complicates result comparisons across different implementations and research papers.

These issues were highlighted in the video "SmartGPT: Major Benchmark Broken - 89% on MMLU + Exam's Many Errors", where researchers exposed severe shortcomings in the MMLU benchmark's testing questions. The findings underscore the need for a dedicated benchmarking organization capable of developing accurate, robust, and meaningful evaluation suites for state-of-the-art language models.

Klu is remote-first and global

Follow us

MMLU Benchmark (Massive Multi-task Language Understanding)

What is the MMLU Benchmark (Massive Multi-task Language Understanding)?

MMLU 5-Shot Leaderboard

Example MMLU Question and Response

The goals and structure of the MMLU

What are the key features of the MMLU benchmark?

How does the MMLU work?

What are its benefits?

What are the limitations of MMLU?

More terms

What is a transition system?

What is affective computing?

It's time to build

LLMOps

Guides

LLMs