MuSR Benchmark

June 28, 2024

by Stephen M. Walker II, Co-Founder / CEO

What is the MuSR Benchmark?

MuSR (Multistep Soft Reasoning) is a benchmark designed to evaluate the reasoning capabilities of LLMs through complex, multistep tasks specified in natural language narratives. It addresses the limitations of existing benchmarks by introducing sophisticated natural language and complex reasoning challenges.

Key features of MuSR include:

Diverse task domains — MuSR spans various domains such as murder mysteries, object placements, and team allocation, providing a comprehensive assessment of model capabilities.
Complex reasoning requirements — Tasks are designed to be challenging, requiring deep reasoning and problem-solving skills.
High-quality narratives — The dataset includes well-crafted natural language narratives to ensure clarity and accuracy.
Enhanced validation — A rigorous review process minimizes incorrect answers and improves overall dataset quality.
Multistep reasoning tasks — Tasks require models to break down complex problems into multiple steps, testing their chain-of-thought capabilities.

MuSR serves as a robust tool to evaluate AI performance across a wide range of disciplines, from social and physical deductive reasoning to observational and constraint reasoning. It provides a comprehensive assessment of an LLM's reasoning capabilities and problem-solving skills.

Recent reasoning datasets used for benchmarking LLMs

The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.

Researchers and AI teams can utilize MuSR for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.

MuSR Evaluation Results

Model	Murder Mysteries (MM)	Object Placements (OP)	Team Allocation (TA)
Human Eval	94.1	95.0	100.0
Mistral Large 2	83.2	68.8	80.8
GPT-4o	85.6	62.5	76.8
GPT-4 Turbo (0409)	84.4	62.5	70.8
GPT-4 (0613)	80.4	60.9	68.4
GPT-4o Mini	72.0	61.7	60.0
GPT-3.5	61.6	46.9	40.4
Llama3.1 405B	51.6	26.2	30.4
Llama3.1 70B	56.4	24.6	55.6
Llama3.1 8B	52.8	26.5	31.6
Llama2 70b Chat	48.8	42.2	44.8
Llama2 7b Chat	50.8	29.3	36.8
Vicuna 7b v1.5	48.4	29.7	26.4
Vicuna 13b v1.5	50.8	34.4	32.0
Vicuna 33b v1.3	49.6	31.2	30.0
random	50.0	24.6	33.3

Model	Average Score
Human Eval	96.37
Mistral Large 2	77.60
GPT-4o	74.97
GPT-4 Turbo (0409)	72.57
GPT-4 (0613)	69.90
GPT-4o Mini	64.57
GPT-3.5	49.63
Llama3.1 405B	36.07
Llama3.1 70B	45.53
Llama3.1 8B	37.00
Llama2 70b Chat	45.27
Llama2 7b Chat	38.97
Vicuna 7b v1.5	34.83
Vicuna 13b v1.5	39.07
Vicuna 33b v1.3	36.93
random	35.97

Updated July 24, 2024

Key Differences from Traditional Benchmarks

MuSR introduces several key enhancements over traditional benchmarks, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in existing benchmarks and create a more robust testing environment.

Partial reasoning trees

The main differences include:

Increased answer options — MuSR expands the number of answer choices, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.
Enhanced reasoning requirements — MuSR incorporates more problems that demand complex reasoning, resulting in Chain-of-Thought (CoT) approaches outperforming traditional methods.
Improved robustness — By increasing the number of distractors, MuSR reduces the impact of chance on correct answers, leading to greater benchmark stability.

MuSR Dataset Composition

The MuSR dataset is created through a novel neurosymbolic synthetic-to-natural generation algorithm. This process involves several key steps, beginning with Tree Template Construction. In this phase, the authors create a high-level fact set and question-answer pairs for each domain. These serve as the foundation for the reasoning trees that will be developed.

Dataset construction process for MuSR

Next, the Reasoning Tree Completion stage expands upon the initial fact set. Using recursive sampling from an LLM (in this case, GPT-4), the authors generate a tree of intermediate reasoning steps. This process creates a set of scenario-specific facts and commonsense knowledge that logically lead to the root facts.

The final stage is Story Generation, where the generated facts are embedded into a natural narrative. This is achieved through a process the authors call "chaptering," which involves generating portions of the narrative based on subsets of facts and then combining them into a coherent whole. This approach allows for the creation of longer, more complex narratives that can theoretically scale beyond the 1000-word examples in the current dataset.

CoT vs Direct Evaluation

Unlike traditional benchmarks, which favor simpler evaluation methods, MuSR requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for various models across different domains:

Prompting	Overall	Murder Mysteries	Object Placements	Team Allocation
CoT	0.7255	0.7858	0.55	0.7467
Direct	0.5346	0.3920	0.3981	0.3971

The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MuSR dataset. Some categories show substantial improvement with CoT, while others show minimal or even negative impacts. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.

As evident from the data, the performance drops significantly without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MuSR dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.

How do the results compare: Traditional Benchmarks vs. MuSR

The table below compares the performance of various language models on traditional benchmarks and the more challenging MuSR dataset:

Model	MMLU Score	MuSR Score	Performance Drop
GPT-4	0.887	0.7255	0.1615 (16.15%)
Claude-3	0.868	0.6845	0.1835 (18.35%)
Claude-3.5	0.815	0.5511	0.2639 (26.39%)
Gemini 1.5	0.789	0.5912	0.1978 (19.78%)
Llama-3-70B	0.820	0.5620	0.2580 (25.80%)

We can observe significant variations in performance drops across different models when tested on MuSR compared to traditional benchmarks:

GPT-4 shows the smallest decrease, with only a 16.15% drop in performance.
Other models experience more substantial declines, ranging from 18.35% to 26.39%.
Some models reportedly experience even larger drops, exceeding 30%.

These results highlight the challenging nature of the MuSR dataset and the need for continued research into more sophisticated reasoning techniques for LLMs.

Klu is remote-first and global

Follow us

MuSR Benchmark

What is the MuSR Benchmark?

MuSR Evaluation Results

Key Differences from Traditional Benchmarks

MuSR Dataset Composition

CoT vs Direct Evaluation

How do the results compare: Traditional Benchmarks vs. MuSR

More terms

What is similarity learning (AI)?

What is AI Winter?

It's time to build

LLMOps

Guides

LLMs