What is the MMLU Benchmark (Massive Multi-task Language Understanding)?
The MMLU Benchmark, also known as the Massive Multi-task Language Understanding Benchmark, is a comprehensive evaluation is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings.
The MMLU spans 57 tasks such as elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills. The MMLU provides a way to test and compare various language models like OpenAI GPT-4, Google Gemini, and Anthropic Claude 2, etc. Thus, it acts as a standard for assessing the generalization capabilities of these models, helping researchers and developers make informed decisions when selecting models for specific applications.
MMLU 5-Shot Leaderboard
Model | MMLU Score | Release Date |
---|
GPT-4 | 86.4 | April 2023 |
Gemini Ultra | 83.7 | December 2023 |
PaLM 2 | 78.3 | May 2023 |
PaLM | 75.2 | April 2022 |
Gemini Pro | 71.8 | December 2023 |
Mistral 8x7b | 71.3 | December 2023 |
GPT-3.5 | 70 | November 2022 |
Zephyr 7b | 66.08 | October 2023 |
Llama 2 65b | 63.4 | July 2022 |
Mistral 7b | 60.1 | September 2023 |
Some key details of the MMLU benchmark include:
-
Training and Validation Sets: The dataset contains 15908 questions, split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and consists of 1540 questions, and the test set has 14079 questions.
-
Model Performance: Initial results from MMLU revealed that smaller LLMs tended to perform around chance (25% accurate), while the larger GPT-3 (175 billion parameters) fared better with 43.9% few-shot accuracy and 37.7% zero-shot accuracy. In 2023, GPT-4 reached 86.4% 5-shot accuracy, and Google Gemini reached 83.7% 5-shot accuracy. However, even the best models still need substantial improvements before they can reach human expert-level accuracy (89.8%).
-
Challenging Subjects: Models, especially large language models (LLMs), struggle with calculation-intensive tasks (e.g., physics and math) and human-value-laden subjects (e.g., morality and law).
What are the key features of the MMLU benchmark?
The MMLU benchmark measures a language model's performance across a wide range of tasks, covering subjects in STEM, humanities, social sciences, and more. Some key features of the MMLU benchmark include:
-
57 Subjects: The benchmark covers 57 subjects across various fields, ranging from elementary mathematics to advanced professional levels in areas like law and ethics.
-
Granularity and Breadth: MMLU tests both world knowledge and problem-solving ability, making it ideal for identifying a model's understanding of various subjects.
-
Multitask Accuracy: The test measures a model's multitask accuracy by covering a diverse set of tasks, ensuring a comprehensive evaluation of the model's academic and professional knowledge.
-
No Large Training Sets Required: Unlike some other benchmarks, MMLU does not require large training sets. Instead, it assumes that models have acquired the necessary knowledge from reading vast quantities of diverse text, a process typically called pretraining.
These key features make the MMLU benchmark a valuable tool for evaluating the performance of language models and their ability to understand and generate language in various contexts.
How does the MMLU work?
The MMLU benchmark works by evaluating the performance of a language model across a wide range of tasks. It measures the model's ability to understand and generate language in various contexts, including machine translation, text summarization, and sentiment analysis.
The final MMLU score is the average of the scores obtained in each task, providing a comprehensive measure of the model's overall performance.
What are its benefits?
There are many benefits to the MMLU benchmark, but three of the most important benefits are:
-
It provides a quantitative way to compare the performance of different language models.
-
It can be computed efficiently and is easy to understand.
-
It considers the model's ability to understand and generate language in various contexts, which can capture some aspects of language structure.
What are the limitations of MMLU?
MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings across 57 diverse subjects. However, there are several issues with the MMLU benchmark that make it a bad benchmark:
-
Crucial context missing from questions: Some questions in the MMLU benchmark have missing context, which makes them difficult or impossible to answer correctly. These issues are likely due to copy-paste errors.
-
Ambiguous sets of answers: The benchmark contains questions with ambiguous answer sets, which can lead to confusion and incorrect evaluation of model performance.
-
Wrong sets of answers: Some questions in the MMLU benchmark have incorrect answer sets, which can lead to misleading evaluation results.
-
Sensitivity to prompting: MMLU is extremely sensitive to the exact prompting used, which can lead to significant differences in performance depending on the prompts. This makes it difficult to compare results across different implementations and papers.
An example of these issues can be found in the video "SmartGPT: Major Benchmark Broken - 89% on MMLU + Exam's Many Errors", where researchers expose severe shortcomings in the testing questions used in the MMLU benchmark. These issues highlight the need for a proper benchmarking organization that can research and create accurate, robust, and sensible benchmarking suites for evaluating state-of-the-art models.