Klu raises $1.7M to empower AI Teams  

MMMU: Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is the MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) Benchmark?

The MMMU benchmark, which stands for Massive Multi-discipline Multimodal Understanding and Reasoning, is a new benchmark designed to evaluate the capabilities of multimodal models on tasks that require college-level subject knowledge and expert-level reasoning across multiple disciplines.

MMMU Evaluation

It covers six core disciplines: Art & Design, Business, Health & Medicine, Science, Humanities & Social Science, and Technology & Engineering, and includes over 183 subfields. The benchmark includes a variety of image formats such as diagrams, tables, charts, chemical structures, photographs, paintings, geometric shapes, and musical scores, among others.

MMMU Evaluation

The MMMU benchmark, designed to surpass traditional benchmarks limited to everyday knowledge, assesses advanced AI capabilities through expert-level questions.

These questions are meticulously sourced from university textbooks and lecture materials by a dedicated team of 50 students, including co-authors. The benchmark's structure includes a few-shot development set, a validation set, and a comprehensive test set with 10.5K questions.

The MMMU benchmark is considered a valuable tool for evaluating the capabilities of large language models (LLMs) and is seen as a step towards the advancement of Artificial General Intelligence (AGI). It has been used to evaluate various models, including both closed- and open-source models, under a zero-shot setting. Models like GPT-4V and others have been evaluated on this benchmark, with GPT-4V achieving a score of 55.7% accuracy. The benchmark is relatively new, having been introduced in November 2023, and is already influencing the field of AI.

Current Leaderboard

As of March 04, 2024, the leaderboard is currently led by Anthropic Claude 3 Opus and Google Gemini Ultra tied for first overall position, followed by OpenAI GPT-4 Vision. Gemini Ultra leads on several dimensions, but is not available for use yet, making Claude 3 Opus and GPT-4 Vision the leading available models.

In real-world tests, GPT-4 Vision performs better at tasks describing screenshots or analyzing tabular data in images. Gemini Ultra is much better than Pro at articulating nuanced differences in images. GPT-4 Vision is better at spotting specific details in images.

The MMMU benchmark's current leaderboard highlights Google Gemini Ultra and OpenAI's GPT-4 Vision as the leading model with an overall score of 55.7%, excelling in Art & Design (65.3%), Business (64.3%), Health & Medicine (63.5%), Humanities & Social Sciences (76.3%), and lagging in Science (48.4%) and Tech & Engineering (41.7%). Following closely is Qwen-VL-MAX* with an overall score of 46.8%. This benchmark evaluates multimodal models across six disciplines using 11.5K questions from college-level materials, aiming to push the limits of expert-level reasoning in AI.

ModelOverallArt & DesignBusinessScienceHealth & MedicineHuman. & Social Sci.Tech & Eng.
Claude 3 Opus59.4------
Gemini 1.0 Ultra59.47056.74867.378.347.1
GPT-4 Vision56.865.364.354.763.576.341.7
Claude 3 Sonnet53.1------
Claude 3 Haiku50.2------
Gemini 1.0 Pro47.9------

Along with a range of open source models.

ModelOverallArt & DesignBusinessScienceHealth & MedicineHuman. & Social Sci.Tech & Eng.
Qwen-VL-MAX*46.864.239.836.352.570.440.7
Yi-VL-34B*41.656.133.332.945.966.536.0
Qwen-VL-PLUS*40.859.934.532.843.765.532.9
Marco-VL*40.456.531.031.046.966.533.8
InternLM-XComposer2-VL*38.256.832.830.139.860.731.8
Yi-VL-6B*37.853.430.330.039.358.534.1
InfiMM-Zephyr-7B*35.550.029.628.237.554.631.1
InternVL-Chat-V1.1*35.353.731.728.236.556.428.0
SVIT*34.148.928.026.835.550.930.7
Emu2-Chat*34.150.627.728.032.450.331.3
BLIP-2 FLAN-T5-XXL34.049.228.627.333.751.530.4
InstructBLIP-T5-XXL33.848.530.627.633.649.829.4
LLaVA-1.5-13B33.649.828.225.934.954.728.3

Google's Gemini Ultra set a new standard on the MMMU benchmark with a 59.4% overall accuracy, outperforming models like GPT-4V. It demonstrated strong capabilities in Business (62.7%), Health and Medicine (71.3%), Humanities and Social Science (78.3%), and Technology and Engineering (53.0%). While MMMU is a rigorous test of multimodal understanding and reasoning, it is not without limitations. It may not encompass all aspects of Expert AGI, as it primarily focuses on college-level knowledge. Nevertheless, a high score on MMMU is considered indicative of a model's proficiency towards achieving Expert AGI.

How does MMMU work?

The MMMU benchmark operates by evaluating large multimodal models (LMMs) on their ability to perceive, understand, and reason across a wide range of disciplines and subfields using various image types. It is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning.

The benchmark includes over 11.5K questions that span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types such as diagrams, tables, charts, chemical structures, photographs, paintings, geometric shapes, and musical scores.

Klu Comparison MMMU Evaluation

The MMMU benchmark assesses models in a zero-shot setting, meaning that the models are tested on their ability to generate answers without any prior specific training or fine-tuning on the benchmark's tasks. This approach is intended to gauge the models' innate capabilities.

The benchmark is divided into a few-shot development set, a validation set, and a test set, with the test set comprising 10.5K questions. The few-shot development set includes 5 questions per subject, and the validation set contains approximately 900 questions.

MMMU Subject Distrubtion

Models are evaluated on their performance across the six core disciplines, and the results are presented with the best-performing model in each category highlighted in bold and the second best underlined. The benchmark's comprehensiveness and focus on college-level subject knowledge and expert-level reasoning make it a valuable tool for advancing the development of artificial general intelligence (AGI).

MMMU Evaluation Image Type Distribution

The MMMU benchmark is also used to identify specific areas where models need improvement. For example, an analysis of mispredicted cases by GPT-4Vision showed that errors could be attributed to flaws in visual perception, lack of domain knowledge, or reasoning process flaws. This detailed feedback helps guide further research and development of multimodal models.

When to use MMMU

The MMMU benchmark, designed for evaluating multimodal models, is pivotal for assessing large language models' (LLMs) understanding and reasoning across diverse disciplines. It focuses on college-level knowledge within six core disciplines—Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering—encompassing 30 subjects and 183 subfields. The benchmark's 30 image types, including charts, diagrams, and chemical structures, are sourced from academic textbooks and lectures.

MMMU stands out by testing expert-level perception and reasoning, requiring domain-specific knowledge. It challenges models to provide precise answers without prior fine-tuning or few-shot learning, aiming to simulate expert tasks. For example, Google's Gemini Ultra model set a new benchmark record with a 59.4% accuracy, surpassing GPT-4V's 56%. These results highlight the benchmark's role in identifying areas for AI improvement and guiding future research.

When considering the use of MMMU, it's important to recognize its role in pushing the boundaries of AI capabilities. It is particularly useful for gauging a model's proficiency in expert-level reasoning and multimodal understanding, which are critical for advancements towards Artificial General Intelligence (AGI).

Limitations of MMMU

The Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark (MMMU) is a dataset designed to evaluate the capabilities of large language models (LLMs) and multimodal models in understanding and reasoning across a wide range of disciplines and modalities. However, MMMU has several limitations:

  1. Limited Data Availability: MMMU relies on a limited dataset of questions and images, which could affect the generalizability of its results and raise the potential for overfitting. Expanding the dataset with more diverse and representative data is necessary for improving the benchmark.

  2. Lack of Explanation: The benchmark provides limited insights into the reasoning processes of LLMs, making it difficult to understand why LLMs make certain mistakes and how their performance can be improved.

  3. Comprehensiveness: While MMMU's comprehensiveness is a strength, it also presents a challenge. The benchmark covers six core disciplines and over 30 different image formats, which can be difficult for models to interpret and understand, especially when it comes to complex scientific concepts and notations.

  4. Performance Gaps: Even advanced models like GPT-4V(ision) and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement in the field.

  5. Manual Curation Process: The manual curation process used to create MMMU may carry biases, and the focus on college-level subjects might not fully test the capabilities required for Expert AGI.

  6. Model Performance Variability: Performance varies across disciplines, with better results in visually simpler fields compared to more complex fields like Science and Engineering.

  7. Need for Advanced Joint Interpretation: Additional features like OCR and captioning do not substantially enhance performance, highlighting the need for more advanced joint interpretation of images and text.

To address these limitations, future work on MMMU includes plans to incorporate human evaluations and to expand the dataset to ensure it captures a broader and deeper range of knowledge and reasoning skills.

More terms

Markov decision process (MDP)

A Markov decision process (MDP) is a mathematical framework used for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are an extension of Markov chains, which are models for stochastic processes without decision-making. The key difference in MDPs is the addition of actions and rewards, which introduce the concepts of choice and motivation, respectively.

Read more

What is Software 2.0?

Software 2.0 refers to the new generation of software that is written in the language of machine learning and artificial intelligence. Unlike traditional software that is explicitly programmed, Software 2.0 learns from data and improves over time. It can perform complex tasks such as natural language processing, pattern recognition, and prediction, which are difficult or impossible for traditional software. The capabilities of Software 2.0 extend beyond simple data entry and can include advanced tasks like facial recognition and understanding natural language.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free