MuSR Benchmark
by Stephen M. Walker II, Co-Founder / CEO
What is the MuSR Benchmark?
MuSR (Multistep Soft Reasoning) is a benchmark designed to evaluate the reasoning capabilities of LLMs through complex, multistep tasks specified in natural language narratives. It addresses the limitations of existing benchmarks by introducing sophisticated natural language and complex reasoning challenges.
Key features of MuSR include:
- Diverse task domains — MuSR spans various domains such as murder mysteries, object placements, and team allocation, providing a comprehensive assessment of model capabilities.
- Complex reasoning requirements — Tasks are designed to be challenging, requiring deep reasoning and problem-solving skills.
- High-quality narratives — The dataset includes well-crafted natural language narratives to ensure clarity and accuracy.
- Enhanced validation — A rigorous review process minimizes incorrect answers and improves overall dataset quality.
- Multistep reasoning tasks — Tasks require models to break down complex problems into multiple steps, testing their chain-of-thought capabilities.
MuSR serves as a robust tool to evaluate AI performance across a wide range of disciplines, from social and physical deductive reasoning to observational and constraint reasoning. It provides a comprehensive assessment of an LLM's reasoning capabilities and problem-solving skills.
Recent reasoning datasets used for benchmarking LLMs
The benchmark allows for testing and comparing various state-of-the-art language models, including but not limited to OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Mistral AI's models.
Researchers and AI teams can utilize MuSR for in-depth evaluations when developing, fine-tuning, or benchmarking language models, especially when significant modifications are made to foundation models. This enhanced benchmark offers a more nuanced and challenging assessment of an LLM's true capabilities and limitations.
MuSR Evaluation Results
Model | Murder Mysteries (MM) | Object Placements (OP) | Team Allocation (TA) |
---|---|---|---|
Human Eval | 94.1 | 95.0 | 100.0 |
Mistral Large 2 | 83.2 | 68.8 | 80.8 |
GPT-4o | 85.6 | 62.5 | 76.8 |
GPT-4 Turbo (0409) | 84.4 | 62.5 | 70.8 |
GPT-4 (0613) | 80.4 | 60.9 | 68.4 |
GPT-4o Mini | 72.0 | 61.7 | 60.0 |
GPT-3.5 | 61.6 | 46.9 | 40.4 |
Llama3.1 405B | 51.6 | 26.2 | 30.4 |
Llama3.1 70B | 56.4 | 24.6 | 55.6 |
Llama3.1 8B | 52.8 | 26.5 | 31.6 |
Llama2 70b Chat | 48.8 | 42.2 | 44.8 |
Llama2 7b Chat | 50.8 | 29.3 | 36.8 |
Vicuna 7b v1.5 | 48.4 | 29.7 | 26.4 |
Vicuna 13b v1.5 | 50.8 | 34.4 | 32.0 |
Vicuna 33b v1.3 | 49.6 | 31.2 | 30.0 |
random | 50.0 | 24.6 | 33.3 |
Model | Average Score |
---|---|
Human Eval | 96.37 |
Mistral Large 2 | 77.60 |
GPT-4o | 74.97 |
GPT-4 Turbo (0409) | 72.57 |
GPT-4 (0613) | 69.90 |
GPT-4o Mini | 64.57 |
GPT-3.5 | 49.63 |
Llama3.1 405B | 36.07 |
Llama3.1 70B | 45.53 |
Llama3.1 8B | 37.00 |
Llama2 70b Chat | 45.27 |
Llama2 7b Chat | 38.97 |
Vicuna 7b v1.5 | 34.83 |
Vicuna 13b v1.5 | 39.07 |
Vicuna 33b v1.3 | 36.93 |
random | 35.97 |
Updated July 24, 2024
Key Differences from Traditional Benchmarks
MuSR introduces several key enhancements over traditional benchmarks, aimed at providing a more comprehensive and challenging evaluation of language models. These improvements address limitations in existing benchmarks and create a more robust testing environment.
Partial reasoning trees
The main differences include:
- Increased answer options — MuSR expands the number of answer choices, making the evaluation more realistic and challenging. This change significantly reduces the probability of correct answers by random guessing.
- Enhanced reasoning requirements — MuSR incorporates more problems that demand complex reasoning, resulting in Chain-of-Thought (CoT) approaches outperforming traditional methods.
- Improved robustness — By increasing the number of distractors, MuSR reduces the impact of chance on correct answers, leading to greater benchmark stability.
MuSR Dataset Composition
The MuSR dataset is created through a novel neurosymbolic synthetic-to-natural generation algorithm. This process involves several key steps, beginning with Tree Template Construction. In this phase, the authors create a high-level fact set and question-answer pairs for each domain. These serve as the foundation for the reasoning trees that will be developed.
Dataset construction process for MuSR
Next, the Reasoning Tree Completion stage expands upon the initial fact set. Using recursive sampling from an LLM (in this case, GPT-4), the authors generate a tree of intermediate reasoning steps. This process creates a set of scenario-specific facts and commonsense knowledge that logically lead to the root facts.
The final stage is Story Generation, where the generated facts are embedded into a natural narrative. This is achieved through a process the authors call "chaptering," which involves generating portions of the narrative based on subsets of facts and then combining them into a coherent whole. This approach allows for the creation of longer, more complex narratives that can theoretically scale beyond the 1000-word examples in the current dataset.
CoT vs Direct Evaluation
Unlike traditional benchmarks, which favor simpler evaluation methods, MuSR requires Chain-of-Thought (CoT) reasoning to achieve better results. The following tables demonstrate the performance difference between CoT and direct (non-CoT) prompting for various models across different domains:
Prompting | Overall | Murder Mysteries | Object Placements | Team Allocation |
---|---|---|---|---|
CoT | 0.7255 | 0.7858 | 0.55 | 0.7467 |
Direct | 0.5346 | 0.3920 | 0.3981 | 0.3971 |
The impact of Chain-of-Thought (CoT) prompting varies significantly across different categories in the MuSR dataset. Some categories show substantial improvement with CoT, while others show minimal or even negative impacts. These variations highlight the diverse nature of reasoning required across different disciplines and underscore the importance of tailored approaches in language model prompting.
As evident from the data, the performance drops significantly without chain-of-thought reasoning. This substantial difference underscores the challenging nature of the MuSR dataset and highlights the importance of CoT reasoning in achieving optimal results across various disciplines.
How do the results compare: Traditional Benchmarks vs. MuSR
The table below compares the performance of various language models on traditional benchmarks and the more challenging MuSR dataset:
Model | MMLU Score | MuSR Score | Performance Drop |
---|---|---|---|
GPT-4 | 0.887 | 0.7255 | 0.1615 (16.15%) |
Claude-3 | 0.868 | 0.6845 | 0.1835 (18.35%) |
Claude-3.5 | 0.815 | 0.5511 | 0.2639 (26.39%) |
Gemini 1.5 | 0.789 | 0.5912 | 0.1978 (19.78%) |
Llama-3-70B | 0.820 | 0.5620 | 0.2580 (25.80%) |
We can observe significant variations in performance drops across different models when tested on MuSR compared to traditional benchmarks:
- GPT-4 shows the smallest decrease, with only a 16.15% drop in performance.
- Other models experience more substantial declines, ranging from 18.35% to 26.39%.
- Some models reportedly experience even larger drops, exceeding 30%.
These results highlight the challenging nature of the MuSR dataset and the need for continued research into more sophisticated reasoning techniques for LLMs.