MuSR Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is the MuSR Benchmark?

MuSR (Multistep Soft Reasoning) is a benchmark designed to evaluate the reasoning capabilities of LLMs through complex, multistep tasks specified in natural language narratives. It addresses the limitations of existing benchmarks by introducing sophisticated natural language and complex reasoning challenges.

MuSR Evaluation

Key features of MuSR include:

  • Diverse task domains — MuSR spans various domains such as murder mysteries, object placements, and team allocation, providing a comprehensive assessment of model capabilities.
  • Complex reasoning requirements — Tasks are designed to be challenging, requiring deep reasoning and problem-solving skills.
  • High-quality narratives — The dataset includes well-crafted natural language narratives to ensure clarity and accuracy.
  • Enhanced validation — A rigorous review process minimizes incorrect answers and improves overall dataset quality.
  • Multistep reasoning tasks — Tasks require models to break down complex problems into multiple steps, testing their chain-of-thought capabilities.

MuSR serves as a robust tool to evaluate AI performance across a wide range of disciplines, from social and physical deductive reasoning to observational and constraint reasoning. It provides a comprehensive assessment of an LLM's reasoning capabilities and problem-solving skills.

MuSR Dataset Comparison

Recent reasoning datasets used for benchmarking LLMs

The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.

Researchers and AI teams can utilize MuSR for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.

MuSR Evaluation Results

ModelMurder Mysteries (MM)Object Placements (OP)Team Allocation (TA)
Human Eval94.195.0100.0
Mistral Large 283.268.880.8
GPT-4o85.662.576.8
GPT-4 Turbo (0409)84.462.570.8
GPT-4 (0613)80.460.968.4
GPT-4o Mini72.061.760.0
GPT-3.561.646.940.4
Llama3.1 405B51.626.230.4
Llama3.1 70B56.424.655.6
Llama3.1 8B52.826.531.6
Llama2 70b Chat48.842.244.8
Llama2 7b Chat50.829.336.8
Vicuna 7b v1.548.429.726.4
Vicuna 13b v1.550.834.432.0
Vicuna 33b v1.349.631.230.0
random50.024.633.3
ModelAverage Score
Human Eval96.37
Mistral Large 277.60
GPT-4o74.97
GPT-4 Turbo (0409)72.57
GPT-4 (0613)69.90
GPT-4o Mini64.57
GPT-3.549.63
Llama3.1 405B36.07
Llama3.1 70B45.53
Llama3.1 8B37.00
Llama2 70b Chat45.27
Llama2 7b Chat38.97
Vicuna 7b v1.534.83
Vicuna 13b v1.539.07
Vicuna 33b v1.336.93
random35.97

Updated July 24, 2024

Key Differences from Traditional Benchmarks

MuSR introduces several key enhancements over traditional benchmarks, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in existing benchmarks and create a more robust testing environment.

MuSR LLM CoT Reasoning Trees

Partial reasoning trees

The main differences include:

  • Increased answer options — MuSR expands the number of answer choices, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.
  • Enhanced reasoning requirements — MuSR incorporates more problems that demand complex reasoning, resulting in Chain-of-Thought (CoT) approaches outperforming traditional methods.
  • Improved robustness — By increasing the number of distractors, MuSR reduces the impact of chance on correct answers, leading to greater benchmark stability.

MuSR Dataset Composition

The MuSR dataset is created through a novel neurosymbolic synthetic-to-natural generation algorithm. This process involves several key steps, beginning with Tree Template Construction. In this phase, the authors create a high-level fact set and question-answer pairs for each domain. These serve as the foundation for the reasoning trees that will be developed.

MuSR Dataset strategy

Dataset construction process for MuSR

Next, the Reasoning Tree Completion stage expands upon the initial fact set. Using recursive sampling from an LLM (in this case, GPT-4), the authors generate a tree of intermediate reasoning steps. This process creates a set of scenario-specific facts and commonsense knowledge that logically lead to the root facts.

The final stage is Story Generation, where the generated facts are embedded into a natural narrative. This is achieved through a process the authors call "chaptering," which involves generating portions of the narrative based on subsets of facts and then combining them into a coherent whole. This approach allows for the creation of longer, more complex narratives that can theoretically scale beyond the 1000-word examples in the current dataset.

CoT vs Direct Evaluation

Unlike traditional benchmarks, which favor simpler evaluation methods, MuSR requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for various models across different domains:

PromptingOverallMurder MysteriesObject PlacementsTeam Allocation
CoT0.72550.78580.550.7467
Direct0.53460.39200.39810.3971

The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MuSR dataset. Some categories show substantial improvement with CoT, while others show minimal or even negative impacts. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.

As evident from the data, the performance drops significantly without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MuSR dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.

How do the results compare: Traditional Benchmarks vs. MuSR

The table below compares the performance of various language models on traditional benchmarks and the more challenging MuSR dataset:

ModelMMLU ScoreMuSR ScorePerformance Drop
GPT-40.8870.72550.1615 (16.15%)
Claude-30.8680.68450.1835 (18.35%)
Claude-3.50.8150.55110.2639 (26.39%)
Gemini 1.50.7890.59120.1978 (19.78%)
Llama-3-70B0.8200.56200.2580 (25.80%)

We can observe significant variations in performance drops across different models when tested on MuSR compared to traditional benchmarks:

  • GPT-4 shows the smallest decrease, with only a 16.15% drop in performance.
  • Other models experience more substantial declines, ranging from 18.35% to 26.39%.
  • Some models reportedly experience even larger drops, exceeding 30%.

These results highlight the challenging nature of the MuSR dataset and the need for continued research into more sophisticated reasoning techniques for LLMs.

More terms

What is a prediction model?

A prediction model, also known as predictive modeling, is a statistical technique used to forecast future behavior, events, or outcomes. It involves analyzing historical and current data, and then using this analysis to generate a model that can predict future outcomes.

Read more

What is computational learning theory?

Computational learning theory (CoLT) is a subfield of artificial intelligence that focuses on understanding the design, analysis, and theoretical underpinnings of machine learning algorithms. It combines elements from computer science, particularly the theory of computation, and statistics to create mathematical models that capture key aspects of learning. The primary objectives of computational learning theory are to analyze the complexity and capabilities of learning algorithms, to determine the conditions under which certain learning problems can be solved, and to quantify the performance of algorithms in terms of their accuracy and efficiency.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free