GSM8K Benchmark
by Stephen M. Walker II, CoFounder / CEO
What is GSM8K?
GSM8K, or Grade School Math 8K, is a dataset of 8,500 highquality, linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multistep reasoning.
The problems in GSM8K are conceptually simple, but they can be challenging for stateoftheart language models due to the high diversity of problems.
Some key features of the GSM8K dataset include:
 Problem Distribution — The dataset consists of 7,500 training problems and 1,000 test problems.
 Solution Steps — Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations.
 Linguistic Diversity — The dataset is designed to test models' ability to understand and reason about mathematical word problems with varying linguistic complexity.
Researchers have been using GSM8K to develop methods for improving the performance of large language models on multistep mathematical reasoning tasks. One such method involves training verifiers to judge the correctness of model completions, which has been shown to significantly improve performance on the GSM8K dataset.
GSM8K Leaderboard
Rank  Model  Accuracy  Methodology 

1  Anthropic Claude 3  95%  Zero shot 
2  Google Gemini Ultra  94.4%  Majority Vote, 32 Generations 
3  OpenAI GPT4  92%  SFT & 5shot CoT 
4  Anthropic Claude 2  88%  Zero shot 
5  Google Gemini Pro  86.5%  Majority Vote, 32 Generations 
6  Inflection 2  81.4%  8shot Learning 
7  Mistral Large  81%  5shot Learning 
8  Google PaLM 2  80%  5shot Learning 
9  Mistral Medium  66.7%  5shot Learning 
10  xAI Grok 1  62.9%  8shot Learning 
11  Mistral Mixtral 8x7b  58.4%  5shot Learning 
12  OpenAI GPT3.5  57.1%  5shot Learning 
13  Meta Llama 2  56.8%  5shot Learning 
How was GSM8K dataset created?
The GSM8K dataset, a collaborative effort between OpenAI and Surge AI, comprises 8,500 highquality math word problems, crafted by experts to reflect linguistic diversity and grade school math concepts. Designed for stepbystep problemsolving, the dataset serves as both a benchmark for large language models like GPT3 and a tool for advancing AI problemsolving techniques.
While the detailed methodology of problem creation and curation is not public, it likely involved expert knowledge in elementary math, attention to linguistic variety, and stringent curation to ensure clarity and solvability through basic arithmetic.
How does GSM8K work?
GSM8K is a dataset of 8,500 highquality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7,500 training problems and 1,000 test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations. The dataset is designed to train language models like GPT3 to solve natural language math problems and measure their performance.
The current stateoftheart on GSM8K is GPT4 Code Interpreter (CSV, K=5). Researchers have found that even the largest transformer models struggle to achieve high test performance on GSM8K, despite the conceptual simplicity of the problem distribution. To increase performance, some researchers propose training verifiers to judge the correctness of model completions. At test time, they generate many candidate solutions and select the one ranked highest by the verifier, demonstrating that verification significantly improves performance on GSM8K.
What are some common methods for implementing GSM8K?
Some common methods for implementing GSM8K, a dataset of 8.5K highquality linguistically diverse grade school math word problems, involve using large language models (LLMs) and various prompting techniques to solve multistep mathematical reasoning problems. Some of these methods include:

ChainofThought Prompting — This approach involves generating a series of intermediate reasoning steps, which helps LLMs understand the problem description and decompose it into steps, as well as solve each step.

TreeofThought Prompting — Similar to chainofthought prompting, this method uses a treebased structure to represent the problemsolving process, guiding the LLM to generate a sequence of reasoning steps.

Process and OutcomeBased Feedback — This approach combines processbased supervision, which supervises the reasoning process itself, and outcomebased supervision, which supervises the final result. This combination helps improve the performance of LLMs on math word problems.

Mixed Policy Exploration — This method proposes a twolevel token exploration policy, where the abstract level explores the next token with probability, and the second level is deterministic, selecting the next token with the highest score in a greedy way. This approach has been tested on the GSM8K dataset with the GPT2 model and demonstrated a performance gain.
These methods aim to improve the performance of LLMs on GSM8K by guiding them through the problemsolving process and providing feedback on both the reasoning steps and the final outcome.
What are some benefits of GSM8K?
GSM8K is a dataset consisting of 8,500 highquality, linguistically diverse grade school math word problems created by human problem writers. The benefits of GSM8K include:

Training language models — GSM8K is used to train large language models like GPT3 to solve math problems. It has been used in Google's PaLM (540B language model) and Chain of Thought papers.

Evaluating problemsolving capabilities — GSM8K is a popular dataset for evaluating the progress of large language models in solving math word problems. The problems in GSM8K are conceptually simple, yet one subtle mistake can derail an entire solution.

Verification techniques — Researchers have developed verifiers to evaluate the correctness of generated solutions for GSM8K problems. Verification techniques have shown promising results, with 8shot Minerva achieving 78.5% accuracy using majority voting.

Improving reasoning skills — GSM8K has been used to develop methods that make large language models better reasoners, such as stepaware verifiers. These methods can further boost the accuracy of GSM8K problems, with one example achieving 83.2% accuracy using 8 exemplars in each prompt.
GSM8K offers a valuable resource for training and evaluating large language models in solving math word problems and improving their reasoning skills.
What are some challenges associated with GSM8K?
GSM8K is a dataset of 8,500 highquality linguistically diverse grade school math word problems created by human problem writers. Some challenges associated with GSM8K include:

Inaccurate answers — Even the largest transformer models struggle to achieve high test performance on GSM8K, despite the conceptual simplicity of the problem distribution. For example, ZeroshotCoT using GPT3 LLM has been found to return incorrect answers for some GSM8K problems.

Verification — To increase performance on GSM8K, researchers have proposed training verifiers to judge the correctness of model completions. However, training verifiers can be challenging, as it requires generating many candidate solutions and selecting the one ranked highest by the verifier.

Scaling — The GSM8K dataset has been used to evaluate the progress of large language models (LLMs) on math word problems. However, the dataset's size and complexity can make it difficult to scale and improve LLM performance on GSM8K.

Comparison with other datasets — GSM8K problems can be solved in a straightforward, stepbystep fashion, but not all math problems are like that. Comparing LLM performance on GSM8K with more challenging datasets, such as MATH, can be challenging.
GSM8K presents challenges for LLMs in terms of accuracy, verification, scaling, and comparison with other datasets. Researchers continue to explore ways to improve LLM performance on GSM8K and similar datasets.
What are some future directions for GSM8K research?
Some future directions for GSM8K research include improving the reliability of large language models (LLMs) and enhancing their ability to solve complex mathematical problems. OpenAI's Q* project is an example of such research, which aims to bring groundbreaking progress in artificial general intelligence (AGI) by enhancing mathematical reasoning ability in conventional LLMs.
For GSM8K, researchers can explore the following directions:

Training LLMs on more complex reasoning tasks — OpenAI has started training models not only on the final answers, but also on the reasoning steps between the prompt and the response, moving towards more challenging datasets like MATH.

Improving the verification process — Instead of having a verifier grade an entire answer, researchers can train a verifier to evaluate individual steps in a solution, making the process more efficient and accurate.

Combining small generators with small verifiers — OpenAI's testing showed that a small generator combined with a small verifier could produce results about as accurate as larger models, which can help reduce computational resources required for training and inference.
As for Q*, the project is still in development, and its future success is uncertain. However, researchers at OpenAI are optimistic about Q*'s potential to advance AI capabilities, particularly in mathematical reasoning. The Q* project aims to solve certain mathematical problems and has the potential to bring significant progress in AGI research.