What is HELM?

by Stephen M. Walker II, Co-Founder / CEO

What is HELM?

Holistic Evaluation of Language Models (HELM) is a comprehensive benchmark framework designed to improve the transparency of language models (LMs) by taxonomizing the vast space of potential scenarios and metrics of interest for LMs. Developed by Stanford CRFM, HELM serves as a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

Key aspects of HELM include:

Taxonomy — HELM taxonomizes the vast space of potential scenarios (use cases) and metrics (desiderata) that are of interest for LMs.
Targeted Evaluations — HELM performs 7 targeted evaluations based on 26 targeted scenarios to analyze specific aspects of LMs, such as knowledge, reasoning, memorization/copyright, and disinformation.
Large-Scale Evaluation — HELM conducts a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation.

HELM aims to provide better transparency for LMs and their capabilities, limitations, and risks. It is intended to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

HELM Leaderboard (January 2024)

Model	Score
Meta Llama 2 (70B)	94.40%
Meta LLaMA (65B)	90.80%
OpenAI text-davinci-002	90.50%
Mistral v0.1 (7B)	88.40%
Cohere Command beta (52.4B)	87.40%
OpenAI text-davinci-003	87.20%
Jurassic-2 Jumbo (178B)	82.40%
Meta Llama 2 (13B)	82.30%
TNLG v2 (530B)	78.70%
OpenAI gpt-3.5-turbo-0613	78.30%

How does HELM work?

Holistic Evaluation of Language Models (HELM) is a framework designed to improve the transparency of language models (LMs) by evaluating their capabilities, limitations, and risks across a broad range of scenarios and metrics. HELM involves three main elements:

Broad coverage and recognition of incompleteness — HELM evaluates LMs over a wide range of scenarios, recognizing that there is still room for improvement and that not all aspects can be covered.
Multi-metric approach — HELM measures seven metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios, ensuring that metrics beyond accuracy are not overlooked and trade-offs are clearly exposed.
Targeted evaluations — HELM performs seven targeted evaluations based on 26 targeted scenarios to analyze specific aspects, such as reasoning and disinformation, in more depth.

HELM conducts a large-scale evaluation of 30 prominent language models on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, but with HELM, this has improved to 96.0%. The framework serves as a living benchmark for transparency in language models, continuously updated with new scenarios, metrics, and models.

What are some future directions for HELM research?

The Holistic Evaluation of Language Models (HELM) is a benchmarking approach developed to improve the transparency of language models. It is intended to be a living benchmark, continuously updated with new scenarios, metrics, and models.

Future directions for HELM research could include:

Assessing Large Multimodal Models — Future research could focus on evaluating large multimodal models that can process and generate multiple types of data such as text, images, and audio. This could provide a more comprehensive understanding of the capabilities and limitations of these models.
Expanding Language Coverage — The first version of HELM primarily focused on English. Future research could expand the evaluation to include other languages.
Exploring Non-traditional Applications — HELM could be applied to tasks beyond traditional Natural Language Processing (NLP) tasks. For instance, it could be used in applications such as copywriting or other creative tasks.
Improving Human-LM Interaction Metrics — Future research could focus on developing metrics that better capture the interaction between humans and language models. This could help in understanding how these models are used in real-world scenarios and how they can be improved.
Continuous Updating of Scenarios, Metrics, and Models — As stated by the creators of HELM, the benchmark is intended to be a living entity, continuously updated with new scenarios, metrics, and models. This implies that future research will involve the constant addition of new data and the refinement of existing benchmarks.
Risk Assessment — Future research could also focus on assessing the risks associated with language models. This could involve developing metrics or methods to identify and mitigate potential risks.

These directions align with the broader trend in AI and machine learning research, which is moving towards more comprehensive, transparent, and robust evaluation methods.

Klu is remote-first and global

Follow us

What is HELM?

What is HELM?

HELM Leaderboard (January 2024)

How does HELM work?

What are some future directions for HELM research?

More terms

What is versioning in LLMOps?

How critical is infrastructure in LLMOps?

It's time to build

LLMOps

Guides

LLMs