What is SuperGLUE?

by Stephen M. Walker II, Co-Founder / CEO

What is SuperGLUE?

SuperGLUE Eval is a benchmarking suite designed to evaluate the performance of language understanding models. It was developed as an evolution of the General Language Understanding Evaluation (GLUE) benchmark, with the aim of addressing some of its limitations and providing a more comprehensive evaluation of language understanding models.

SuperGLUE consists of a public leaderboard built around eight primary tasks and two diagnostic tasks. These tasks are designed to be solvable by an English-speaking college student and exclude domain knowledge, meaning the model is tested only on its language understanding, not its extent of knowledge. The tasks within SuperGLUE include Boolean Questions (BoolQ), CommitmentBank (CB), Choice of Plausible Alternatives (COPA), and Multi-Sentence Reading Comprehension (MultiRC), among others.

The benchmark provides public training and development datasets, while testing data is hidden and only used to evaluate predictions submitted to the leaderboard. The leaderboard contains information about each submission, as well as the scores for the subtasks included within the SuperGLUE benchmark.

SuperGLUE is anticipated to drive significant progress in several core areas of machine learning, including sample-efficient, transfer, multitask, and unsupervised or self-supervised learning. It is populated mainly by models developed by smaller research labs, and submissions often include variations of already well-known models such as GPT, BERT, and RoBERTa.

SuperGLUE Leaderboard (January 2024)

Model	Score
Meta Llama 2 (70B)	94.40%
Meta LLaMA (65B)	90.80%
OpenAI text-davinci-002	90.50%
Mistral v0.1 (7B)	88.40%
Cohere Command beta (52.4B)	87.40%
OpenAI text-davinci-003	87.20%
Jurassic-2 Jumbo (178B)	82.40%
Meta Llama 2 (13B)	82.30%
TNLG v2 (530B)	78.70%
OpenAI gpt-3.5-turbo-0613	78.30%

How does SuperGLUE differ from GLUE?

SuperGLUE is an advanced benchmark designed to evaluate language understanding models, building upon its predecessor, GLUE (General Language Understanding Evaluation). While GLUE provided a composite metric across nine tasks to gauge a model's language capabilities, SuperGLUE narrows its focus to more complex tasks, maintaining only two from GLUE and incorporating additional ones selected for their difficulty. This benchmark challenges models with a wider variety of tasks and demands a more profound understanding of language and the world, setting a higher bar for performance.

How does SuperGLUE work?

SuperGLUE is a benchmark for evaluating the performance of models on a variety of language understanding tasks. It was designed as a more challenging successor to the GLUE benchmark, with the aim of providing a robust evaluation metric for any method that can be applied to a broad range of language understanding tasks.

SuperGLUE consists of eight primary tasks and two diagnostic tasks. These tasks are designed to be solvable by an English-speaking college student and exclude domain knowledge, meaning the model is tested only on its language understanding, not its extent of knowledge. The tasks include:

Boolean Questions (BoolQ): A question-answering task consisting of a short passage from a Wikipedia article and a yes/no question about the passage.
CommitmentBank (CB): This task involves determining the writer's commitment level to the truth of the clause.

To evaluate a model on SuperGLUE, you need to collect your system's predictions on these tasks. The performance of the model is then calculated across these tasks, with the final score derived as an average from all these tasks.

SuperGLUE also provides a public leaderboard, which is still active with submissions and improvements. This leaderboard contains information about each submission, as well as the scores for the subtasks included within the SuperGLUE benchmark.

To evaluate your model on SuperGLUE, you can use the provided software toolkit. If you're using Python, you can use the Hugging Face's datasets library to load the SuperGLUE dataset. For example, to load the 'BoolQ' task, you can use the following code:

from datasets import load_dataset
dataset = load_dataset('super_glue', 'boolq')

After generating predictions with your model, you can submit the results to the SuperGLUE leaderboard. Note that to limit overfitting to the private test data, users are limited to a maximum of two submissions per day and six submissions per month.

What are some future directions for SuperGLUE research?

Future research directions for the SuperGLUE benchmark are likely to focus on enhancing task complexity to challenge advanced language models with nuanced context understanding and reasoning. Incorporating multimodal benchmarks, cross-lingual tasks, and assessments of ethical considerations could provide a more comprehensive evaluation of AI capabilities.

Interactive tasks that simulate real-world conversations and domain-specific applications may become prevalent, alongside efficiency metrics that account for computational resource usage. Additionally, robustness to adversarial attacks, generalization from limited data, and continuous learning capabilities are areas poised for development.

These improvements aim to advance core machine learning areas, including efficient, transfer, multitask, and self-supervised learning.

What are some limitations of SuperGLUE benchmark?

SuperGLUE, while a robust benchmark for evaluating language understanding models, does have some limitations:

Limited Coverage — SuperGLUE does not cover all forms of language understanding tasks. It only retained 2 out of 9 tasks from the GLUE benchmark, which means it may not fully represent the diversity and complexity of language understanding.
Distribution Difference — The distribution difference between the training set and the validation-test set can limit the expected performance of the models.
Uniformity — The format of SuperGLUE is less uniform than its predecessor, GLUE. This could potentially discourage some researchers who were attracted to the highly uniform framework of GLUE.
Dataset Size — The datasets in SuperGLUE are less large-scale compared to other benchmarks. This could potentially reduce the scope of types of work that are evaluated here. However, it may also lower the computational barrier to entry, which could democratize research on this benchmark.
Model Performance — Some AI models from major tech companies like Microsoft and Google have already surpassed human performance on the SuperGLUE language benchmark. This raises questions about the benchmark's ability to continue pushing the boundaries of language understanding models.
Biases — Despite efforts to minimize biases, some may still exist in the tasks and datasets included in SuperGLUE.

Despite these limitations, SuperGLUE remains a valuable tool for evaluating the performance of language understanding models, and ongoing improvements and adaptations are expected to address some of these issues over time.

Klu is remote-first and global

Follow us

What is SuperGLUE?