GSM8K Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is GSM8K?

GSM8K, or Grade School Math 8K, is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

The problems in GSM8K are conceptually simple, but they can be challenging for state-of-the-art language models due to the high diversity of problems.

Some key features of the GSM8K dataset include:

  • Problem Distribution — The dataset consists of 7,500 training problems and 1,000 test problems.
  • Solution Steps — Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations.
  • Linguistic Diversity — The dataset is designed to test models' ability to understand and reason about mathematical word problems with varying linguistic complexity.

Researchers have been using GSM8K to develop methods for improving the performance of large language models on multi-step mathematical reasoning tasks. One such method involves training verifiers to judge the correctness of model completions, which has been shown to significantly improve performance on the GSM8K dataset.

GSM8K Leaderboard

RankModelAccuracyMethodology
1Anthropic Claude 395%Zero shot
2Google Gemini Ultra94.4%Majority Vote, 32 Generations
3OpenAI GPT-492%SFT & 5-shot CoT
4Anthropic Claude 288%Zero shot
5Google Gemini Pro86.5%Majority Vote, 32 Generations
6Inflection 281.4%8-shot Learning
7Mistral Large81%5-shot Learning
8Google PaLM 280%5-shot Learning
9Mistral Medium66.7%5-shot Learning
10xAI Grok 162.9%8-shot Learning
11Mistral Mixtral 8x7b58.4%5-shot Learning
12OpenAI GPT3.557.1%5-shot Learning
13Meta Llama 256.8%5-shot Learning

How was GSM8K dataset created?

The GSM8K dataset, a collaborative effort between OpenAI and Surge AI, comprises 8,500 high-quality math word problems, crafted by experts to reflect linguistic diversity and grade school math concepts. Designed for step-by-step problem-solving, the dataset serves as both a benchmark for large language models like GPT-3 and a tool for advancing AI problem-solving techniques.

While the detailed methodology of problem creation and curation is not public, it likely involved expert knowledge in elementary math, attention to linguistic variety, and stringent curation to ensure clarity and solvability through basic arithmetic.

How does GSM8K work?

GSM8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7,500 training problems and 1,000 test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations. The dataset is designed to train language models like GPT-3 to solve natural language math problems and measure their performance.

The current state-of-the-art on GSM8K is GPT-4 Code Interpreter (CSV, K=5). Researchers have found that even the largest transformer models struggle to achieve high test performance on GSM8K, despite the conceptual simplicity of the problem distribution. To increase performance, some researchers propose training verifiers to judge the correctness of model completions. At test time, they generate many candidate solutions and select the one ranked highest by the verifier, demonstrating that verification significantly improves performance on GSM8K.

What are some common methods for implementing GSM8K?

Some common methods for implementing GSM8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems, involve using large language models (LLMs) and various prompting techniques to solve multi-step mathematical reasoning problems. Some of these methods include:

  • Chain-of-Thought Prompting — This approach involves generating a series of intermediate reasoning steps, which helps LLMs understand the problem description and decompose it into steps, as well as solve each step.

  • Tree-of-Thought Prompting — Similar to chain-of-thought prompting, this method uses a tree-based structure to represent the problem-solving process, guiding the LLM to generate a sequence of reasoning steps.

  • Process- and Outcome-Based Feedback — This approach combines process-based supervision, which supervises the reasoning process itself, and outcome-based supervision, which supervises the final result. This combination helps improve the performance of LLMs on math word problems.

  • Mixed Policy Exploration — This method proposes a two-level token exploration policy, where the abstract level explores the next token with probability, and the second level is deterministic, selecting the next token with the highest score in a greedy way. This approach has been tested on the GSM8K dataset with the GPT-2 model and demonstrated a performance gain.

These methods aim to improve the performance of LLMs on GSM8K by guiding them through the problem-solving process and providing feedback on both the reasoning steps and the final outcome.

What are some benefits of GSM8K?

GSM8K is a dataset consisting of 8,500 high-quality, linguistically diverse grade school math word problems created by human problem writers. The benefits of GSM8K include:

  • Training language models — GSM8K is used to train large language models like GPT-3 to solve math problems. It has been used in Google's PaLM (540B language model) and Chain of Thought papers.

  • Evaluating problem-solving capabilities — GSM8K is a popular dataset for evaluating the progress of large language models in solving math word problems. The problems in GSM8K are conceptually simple, yet one subtle mistake can derail an entire solution.

  • Verification techniques — Researchers have developed verifiers to evaluate the correctness of generated solutions for GSM8K problems. Verification techniques have shown promising results, with 8-shot Minerva achieving 78.5% accuracy using majority voting.

  • Improving reasoning skills — GSM8K has been used to develop methods that make large language models better reasoners, such as step-aware verifiers. These methods can further boost the accuracy of GSM8K problems, with one example achieving 83.2% accuracy using 8 exemplars in each prompt.

GSM8K offers a valuable resource for training and evaluating large language models in solving math word problems and improving their reasoning skills.

What are some challenges associated with GSM8K?

GSM8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems created by human problem writers. Some challenges associated with GSM8K include:

  • Inaccurate answers — Even the largest transformer models struggle to achieve high test performance on GSM8K, despite the conceptual simplicity of the problem distribution. For example, Zero-shot-CoT using GPT-3 LLM has been found to return incorrect answers for some GSM8K problems.

  • Verification — To increase performance on GSM8K, researchers have proposed training verifiers to judge the correctness of model completions. However, training verifiers can be challenging, as it requires generating many candidate solutions and selecting the one ranked highest by the verifier.

  • Scaling — The GSM8K dataset has been used to evaluate the progress of large language models (LLMs) on math word problems. However, the dataset's size and complexity can make it difficult to scale and improve LLM performance on GSM8K.

  • Comparison with other datasets — GSM8K problems can be solved in a straightforward, step-by-step fashion, but not all math problems are like that. Comparing LLM performance on GSM8K with more challenging datasets, such as MATH, can be challenging.

GSM8K presents challenges for LLMs in terms of accuracy, verification, scaling, and comparison with other datasets. Researchers continue to explore ways to improve LLM performance on GSM8K and similar datasets.

What are some future directions for GSM8K research?

Some future directions for GSM8K research include improving the reliability of large language models (LLMs) and enhancing their ability to solve complex mathematical problems. OpenAI's Q* project is an example of such research, which aims to bring groundbreaking progress in artificial general intelligence (AGI) by enhancing mathematical reasoning ability in conventional LLMs.

For GSM8K, researchers can explore the following directions:

  • Training LLMs on more complex reasoning tasks — OpenAI has started training models not only on the final answers, but also on the reasoning steps between the prompt and the response, moving towards more challenging datasets like MATH.

  • Improving the verification process — Instead of having a verifier grade an entire answer, researchers can train a verifier to evaluate individual steps in a solution, making the process more efficient and accurate.

  • Combining small generators with small verifiers — OpenAI's testing showed that a small generator combined with a small verifier could produce results about as accurate as larger models, which can help reduce computational resources required for training and inference.

As for Q*, the project is still in development, and its future success is uncertain. However, researchers at OpenAI are optimistic about Q*'s potential to advance AI capabilities, particularly in mathematical reasoning. The Q* project aims to solve certain mathematical problems and has the potential to bring significant progress in AGI research.

More terms

What is an issue tree?

An issue tree is a graphical representation of a problem or question, broken down into its component parts or causes. It helps organize complex issues by breaking them down into smaller, more manageable components, making it easier to analyze and address each part individually.

Read more

Frontier AI Models

Frontier AI models represent the cutting edge of artificial intelligence technology, pushing the boundaries of what AI can achieve. These models are characterized by their advanced capabilities, often surpassing the performance of existing models in a wide range of tasks. The term "frontier AI" encompasses both foundational models and general-purpose AI (GPAI), distinguishing them from narrow AI systems that are designed for specific tasks.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free