MMLU Benchmark (Massive Multi-task Language Understanding)

by Stephen M. Walker II, Co-Founder / CEO

What is the MMLU Benchmark (Massive Multi-task Language Understanding)?

The MMLU Benchmark (Massive Multi-task Language Understanding) is a comprehensive evaluation is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings. The MMLU serves as a standardized way to assess AI performance on tasks that range from simple math to complex legal reasoning.

While the MMLU is a widely used benchmark, it's important to acknowledge its limitations:

  • Inaccurate question-answer pairs
  • Incorrect answers in some instances
  • Ambiguously or poorly phrased questions

These issues can potentially impact the reliability and validity of MMLU evaluations for LLMs.

The MMLU Benchmark is a diverse set of tests designed to evaluate the understanding and problem-solving abilities of language models across multiple domains. The MMLU contains 57 tasks across topics including elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills.

The MMLU provides a way to test and compare various language models like OpenAI GPT-4o, Mistral 7b, Google Gemini, and Anthropic Claude 3, etc.

AI teams can use the MMLU for comprehensive evaluations when building or fine-tuning custom models that significantly modify a foundation model.

MMLU 5-Shot Leaderboard

OrganizationModelMMLU ScoreRelease Date
OpenAIGPT-4 Opus88.7May 2024
AnthropicClaude 3.5 Sonnet88.7June 2024
AnthropicClaude 3 Opus86.8March 2024
OpenAIGPT-4 Turbo86.4April 2024
OpenAIGPT-486.4April 2023
MetaLlama 3 400B86.1Unreleased
GoogleGemini 1.5 Pro85.9May 2024
GoogleGemini Ultra83.7December 2023
GoogleGemini 1.5 Pro81.9February 2024
Mistral AIMistral Large81.2February 2024
GoogleGemini 1.5 Flash78.9May 2024
AnthropicClaude 278.5November 2023
GooglePaLM 278.3May 2023
GooglePaLM75.2April 2022
GoogleGemini Pro71.8December 2023
Mistral AIMistral 8x7b71.3December 2023
OpenAIGPT-3.570.0November 2022
Mistral AIZephyr 7b66.08October 2023
MetaLlama 2 65b63.4July 2022
Mistral AIMistral 7b60.1September 2023

Example MMLU Question and Response

We'll compare responses to a sample MMLU question from three language models: OpenAI's GPT-4.0-Turbo, Anthropic's Claude 2, and Zephyr (Mistrals 7b Variant). This comparison highlights their respective capabilities.

To prevent poisoning future data, we do not disclose the correct response.

Example question on High School European History:

This question refers to the following information. Albeit the king's Majesty justly and rightfully is and ought to be the supreme head of the Church of England, and so is recognized by the clergy of this realm in their convocations, yet nevertheless, for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same, be it enacted, by authority of this present Parliament, that the king, our sovereign lord, his heirs and successors, kings of this realm, shall be taken, accepted, and reputed the only supreme head in earth of the Church of England, called Anglicans Ecclesia; and shall have and enjoy, annexed and united to the imperial crown of this realm, as well the title and style thereof, as all honors, dignities, preeminences, jurisdictions, privileges, authorities, immunities, profits, and commodities to the said dignity of the supreme head of the same Church belonging and appertaining; and that our said sovereign lord, his heirs and successors, kings of this realm, shall have full power and authority from time to time to visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities, whatsoever they be, which by any manner of spiritual authority or jurisdiction ought or may lawfully be reformed, repressed, ordered, redressed, corrected, restrained, or amended, most to the pleasure of Almighty God, the increase of virtue in Christ's religion, and for the conservation of the peace, unity, and tranquility of this realm; any usage, foreign land, foreign authority, prescription, or any other thing or things to the contrary hereof notwithstanding. English Parliament, Act of Supremacy, 1534 

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(A) give the English king a new position of authority
(B) give the position of head of the Church of England to Henry VIII
(C) establish Calvinism as the one true theology in England
(D) end various forms of corruption plaguing the Church in England

OpenAI GPT4.0-Turbo's Response:

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage emphasizes the intent to "repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same," and to have the authority to "visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This indicates a clear aim to reform the Church and address issues of corruption or malpractice.

Anthropic Claude 2's Response:

Based on the passage, the best inference about what the English Parliament wished to argue is that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that one purpose of the Act is "to repress and extirpate all errors, heresies, and other enormities and abuses" in the Church. It also gives the king power to "repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This suggests the Act was intended to address corruption in the Church.

Options A and B are not supported, as the passage suggests the king was already recognized as the supreme head of the Church. Option C about establishing Calvinism is not mentioned. So option D best reflects what Parliament wished to argue about the Act's purpose.

Zephyr's Response:

The English Parliament intended to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that the Act is enacted "for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same." This suggests that the Parliament believed that the Act would help address issues of corruption, error, heresy, and abuse within the Church in England. The Act specifically mentions "all such errors, heresies, abuses, offenses, contempts, and enormities" and empowers the king to take action against them. Therefore, the Parliament's intention appears to be that the Act would help cleanse the Church of England by addressing issues of corruption and bringing it into greater conformity with Christian virtue.

It is not clear from the passage whether the Parliament intended for the Act to create a new position of authority for the king (A), exclude his heirs from holding the position of head of the Church (B), or establish Calvinism as the one true theology in England (C). Instead, it seems that the Parliament's intent was focused on addressing issues within the English Church and granting further power to the king as the supreme head of the Church.

The performance of the models can be analyzed based on several criteria, such as coherence, relevance, detail, and clarity:

  • Coherence — All three models provide coherent responses that logically follow from the passage cited. They maintain a consistent argument throughout their responses.

  • Relevance — Each model's response is relevant to the question asked, focusing on the intent behind the Act of Supremacy as it relates to corruption in the Church of England.

  • Detail — The responses differ slightly in the amount of detail provided. OpenAI GPT4.0-Turbo and Anthropic Claude 2 offer a detailed explanation of the passage, citing specific phrases and explaining their significance. Zephyr's response is also detailed but includes a broader interpretation of the Act's intent, mentioning "increase of virtue in Christ's religion" as an additional goal.

  • Clarity — All models present their arguments in a clear and understandable manner. However, OpenAI GPT4.0-Turbo and Anthropic Claude 2 structure their responses with a slightly clearer focus on the direct implications of the Act, while Zephyr provides a more generalized conclusion.

Overall, the performance of each model is quite similar, with only minor differences in the level of detail and the breadth of the context provided. Each model effectively communicates the intended inference from the passage, demonstrating a high level of language understanding and response generation.

The goals and structure of the MMLU

The MMLU acts as a standard, however imperfect, for assessing the relative generalization capabilities of foundation models, helping researchers and developers make informed decisions when selecting models for specific applications.

From the Measuring Massive Multitask Language Understanding paper:

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Some key details of the MMLU benchmark include:

  • Training and Validation Sets — The dataset contains 15908 questions, split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and consists of 1540 questions, and the test set has 14079 questions.

  • Model Performance — Initial results from MMLU revealed that smaller LLMs tended to perform around chance (25% accurate), while the larger GPT-3 (175 billion parameters) fared better with 43.9% few-shot accuracy and 37.7% zero-shot accuracy. In 2023, GPT-4 reached 86.4% 5-shot accuracy, and Google Gemini reached 83.7% 5-shot accuracy. However, even the best models still need substantial improvements before they can reach human expert-level accuracy (89.8%).

  • Challenging Subjects — Models, especially large language models (LLMs), struggle with calculation-intensive tasks (e.g., physics and math) and human-value-laden subjects (e.g., morality and law).

What are the key features of the MMLU benchmark?

The MMLU benchmark measures a language model's performance across a wide range of tasks, covering subjects in STEM, humanities, social sciences, and more. Some key features of the MMLU benchmark include:

  • 57 Subjects — The benchmark covers 57 subjects across various fields, ranging from elementary mathematics to advanced professional levels in areas like law and ethics.

  • Granularity and Breadth — MMLU tests both world knowledge and problem-solving ability, making it ideal for identifying a model's understanding of various subjects.

  • Multitask Accuracy — The test measures a model's multitask accuracy by covering a diverse set of tasks, ensuring a comprehensive evaluation of the model's academic and professional knowledge.

  • No Large Training Sets Required — Unlike some other benchmarks, MMLU does not require large training sets. Instead, it assumes that models have acquired the necessary knowledge from reading vast quantities of diverse text, a process typically called pretraining.

These key features make the MMLU benchmark a valuable tool for evaluating the performance of language models and their ability to understand and generate language in various contexts.

How does the MMLU work?

The MMLU (Massive Multitask Language Understanding) benchmark evaluates language models across a diverse set of 57 subjects. Here's how it operates:

  • Assesses models in zero-shot and few-shot settings
  • Covers a wide range of academic and professional domains
  • Tests both world knowledge and problem-solving abilities
  • Utilizes multiple-choice questions for each subject

The benchmark calculates a model's performance by:

  • Evaluating responses to questions in each subject area
  • Aggregating scores across all subjects
  • Computing an average score as the final MMLU metric

This comprehensive approach provides a nuanced measure of a model's language understanding and generalization capabilities across various domains.

What are its benefits?

The MMLU benchmark offers several key advantages for evaluating language models:

  • Quantitative Comparison — It provides a robust, numerical framework for assessing and contrasting the performance of various language models.

  • Computational Efficiency — The benchmark can be calculated quickly and yields results that are straightforward to interpret.

  • Contextual Language Understanding — By evaluating a model's capacity to comprehend and generate language across diverse scenarios, MMLU captures crucial aspects of linguistic structure and versatility.

These benefits make MMLU a valuable tool for researchers and developers in the field of natural language processing and artificial intelligence.

What are the limitations of MMLU?

MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings across 57 diverse subjects. However, several critical issues undermine its effectiveness as a reliable benchmark:

  • Missing Contextual Information — Some questions lack crucial context, rendering them difficult or impossible to answer correctly. These omissions are likely due to copy-paste errors.

  • Answer Set Ambiguity — The benchmark includes questions with ambiguous answer sets, leading to potential confusion and inaccurate model performance evaluation.

  • Incorrect Answer Sets — Certain questions contain outright incorrect answer sets, resulting in misleading evaluation results.

  • Prompt Sensitivity — MMLU exhibits extreme sensitivity to the specific prompts used, causing significant performance variations based on prompt formulation. This sensitivity complicates result comparisons across different implementations and research papers.

These issues were highlighted in the video "SmartGPT: Major Benchmark Broken - 89% on MMLU + Exam's Many Errors", where researchers exposed severe shortcomings in the MMLU benchmark's testing questions. The findings underscore the need for a dedicated benchmarking organization capable of developing accurate, robust, and meaningful evaluation suites for state-of-the-art language models.

More terms

What is synthetic intelligence?

Synthetic Intelligence (SI) is an alternative term for Artificial Intelligence (AI), emphasizing that the intelligence of machines can be a genuine form of intelligence, not just a simulation. The term "synthetic" refers to something produced by synthesis, combining parts to form a whole, often a human-made version of something that has arisen naturally.

Read more

What is Backpropagation through time (BPTT)

Backpropagation through time (BPTT) is a method for training recurrent neural networks (RNNs), which are designed to process sequences of data by maintaining a 'memory' of previous inputs through internal states. BPTT extends the concept of backpropagation used in feedforward networks to RNNs by taking into account the temporal sequence of data.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free