Klu raises $1.7M to empower AI Teams  

AlpacaEval

by Stephen M. Walker II, Co-Founder / CEO

What is AlpacaEval?

AlpacaEval, along with MT-Bench, is one of the best LLM evaluations for understanding the relative ranking of LLMs compared to their peers. While not perfect, it provides an automated comparison.

AlpacaEval is an automated tool for evaluating instruction-following language models against the AlpacaFarm dataset. It stands out for its human-validated, high-quality assessments that are both cost-effective and rapid.

Klu AlpacaEval Workflow

The evaluator is specifically designed for chat-based large language models (LLMs) and features a leaderboard to benchmark model performance.

AlpacaEval calculates win-rates for models across a variety of tasks, including traditional NLP and instruction-tuning datasets, providing a comprehensive measure of model capabilities.

AlpacaEval is a single-turn benchmark, which means it evaluates models based on their responses to single-turn prompts. It has been used to assess models like OpenAI GPT-4, Mistral Mixtral, Anthropic Claude 2, and others.

Current Leaderboard

As of January 15, 2024, the current leaderboard is led by GPT-4 Turbo, mirroring the human preference results of MT-Bench.

Model NameWin RateLengthLicense
GPT-4 Turbo50.00%2049Proprietary
Yi 34B Chat29.66%2123Open Source
GPT-423.58%1365Proprietary
GPT-4 031422.07%1371Proprietary
Mistral Medium21.86%1500Proprietary
Mixtral 8x7B v0.118.26%1465Open Source
Claude 217.19%1069Proprietary
Claude16.99%1082Proprietary
Gemini Pro16.85%1315Proprietary
Tulu 2+DPO 70B15.98%1418Open Source
GPT-4 061315.76%1140Proprietary
Claude 2.115.73%1096Proprietary
Mistral 7B v0.214.72%1676Open Source
GPT 3.5 Turbo 061314.13%1328Proprietary
LLaMA2 Chat 70B13.87%1790Open Source
Cohere Command12.90%1983Proprietary
Vicuna 33B v1.312.71%1479Open Source
OpenHermes-2.5-Mistral10.34%1107Open Source
GPT 3.5 Turbo 03019.62%827Proprietary
GPT 3.5 Turbo 11069.18%796Proprietary

How does AlpacaEval work?

AlpacaEval is an automatic evaluation system for instruction-following language models. The original eval system version benchmarked against OpenAI's Davinci003 and used GPT-4 as the auto-annotator.

A recent upgrade replaces both models with GPT-4 Turbo instead, signaling the new SOTA model since its original creation.

Klu AlpacaEval Workflow

AlpacaEval is designed to be fast, cost-effective, and replicable, and it has been validated against 20,000 human annotations. The evaluation is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions.

The responses generated by the models are then compared to reference responses, and the comparison is performed by GPT-4 based auto-annotators.

AlpacaEval automates the evaluation of instruction-following language models, such as ChatGPT, which traditionally relies on time-consuming and costly human interaction. Validated against 20,000 human annotations, AlpacaEval offers a fast, cost-effective, and replicable solution for model development.

Despite advancements, it's important to note limitations such as a bias towards longer outputs and models similar to the evaluator's base model (GPT-4 Turbo).

The system includes a leaderboard that ranks common models based on their performance on the AlpacaEval set. It features an automatic evaluator with high human agreement, capable of comparing model outputs to those of a reference model. The evaluator supports caching and output randomization to enhance reliability.

AlpacaEval also provides a toolkit for creating sophisticated automatic evaluators with features like caching, batching, and multi-annotator support. This toolkit allows for comprehensive analysis of evaluator performance, including quality, cost, speed, and statistical measures such as bias and variance.

Additionally, AlpacaEval offers access to a dataset of 20,000 human preferences, including 2,500 cross-annotations, to facilitate comparison between model and reference outputs. The AlpacaEval dataset simplifies the AlpacaFarm evaluation set by combining "instructions" and "inputs" into a single field and providing longer reference outputs.

AlpacaEval 2

AlpacaEval2 is an upgraded version of the original AlpacaEval. The improvements in AlpacaEval2 include being seven times faster and three times cheaper than its predecessor. The upgraded version of AlpacaEval, brings several significant improvements over the original version.

  • Speed: AlpacaEval2 is seven times faster than its predecessor.
  • Cost: The new version is three times cheaper, making it more accessible for a wider range of users.
  • Baseline: AlpacaEval2 uses GPT-4 turbo as its baseline, which is a more advanced language model compared to the ones used in the original version.
  • Length Bias: The updated version has less length bias, which means it's more fair in evaluating responses of different lengths.
  • New Models: AlpacaEval2 supports new models such as Mixtral and Gemini, expanding its utility and versatility.

These improvements make AlpacaEval2 a more efficient, cost-effective, and versatile tool for evaluating instruction-following language models.

Evaluation Dataset

Our evaluation of automatic annotators on the AlpacaEval set is based on a comparison with 2.5K human annotations. The evaluators range from those provided by Alpaca Farm and Aviary to LMSys, with human annotations serving as a reference. Metrics for all available evaluators and their configurations can be found in the AlpacaEval Github repository.

ModelHuman agreementPrice ($/1k)Time (seconds/1k)Bias
alpaca_eval_gpt469.213.6145528.4
alpaca_eval_cot_gpt4_turbo_fn68.66.3198929.3
alpaca_eval_gpt4_turbo_fn68.15.586430.2
gpt466.912.5103731.5
alpaca_farm_greedy_gpt466.415.387830.2
alpaca_eval_cot_gpt4_turbo_fn65.74.322833.9
humans65.7300.0368000.0
claude65.33.317332.4
lmsys_gpt465.313.91798231.6
text_davinci_00364.18.712133.8
longest62.20.0037.8
chatgpt57.30.828539.4

Human agreement

Calculated by comparing an annotator's preferences to the majority of human annotations from our dataset of ~650 instructions, each with 4 human annotations. For automatic annotators, the same method is applied, ensuring comparability.

Price $/1000 examples

Reflects the average cost for 1000 annotations. For human annotations, it's based on the payment to Mechanical Turkers. If the cost is machine-dependent, it's left blank.

Time seconds/1000 examples

Represents the average time to annotate 1000 examples. For humans, it's the median time taken by Mechanical Turkers. For automatic annotators, it's the average time taken in our runs, subject to API limits and cluster load.

Spearman and Pearson correlations

These measure the correlation between leaderboards generated by auto-annotator preferences and human preferences, using method-level agreement from AlpacaFarm annotations. Due to a small sample size of 9 models, these correlations may not be highly reliable.

Bias

Assesses the agreement between the most likely human label and the automatic one, estimated by sampling 4 different annotations per example. A low bias indicates that the annotator's preferences align with humans on average.

Variance

Indicates the consistency of an annotator's preferences, estimated similarly to human agreement. Lower variance means more consistent results across different samples.

Proba. prefer longer

The likelihood that an annotator favors a longer output when there's a significant length difference between two outputs.

Automating Evaluations

The AlpacaEval system is particularly useful for evaluating large language models (LLMs) like GPT-4 and Vicuna-80. It provides a leaderboard that ranks models based on their performance, and this ranking is highly correlated with rankings based on human annotations. The leaderboard is open to contributions from the community, and new models can be added by running the model on the evaluation set, auto-annotating the outputs, and submitting a pull request with the model configuration and leaderboard results.

AlpacaEval is a single-turn benchmark, meaning it evaluates a model's ability to generate a response to a single instruction or prompt. It's part of a broader toolkit for building advanced automatic evaluators, which can include features like caching, batching, and multi-annotator analysis.

AlpacaEval uses the Hugging Face's PEFT library and LoRA (Low-Rank Adaptation) to train Alpaca using a single commercial GPU. The Alpaca model itself is a fine-tuned version of the LLaMA model, and it's trained using a dataset of 52,000 demonstrations of following instructions. The training process leverages techniques like Fully Sharded Data Parallel and mixed precision training.

However, it's important to note that while AlpacaEval provides a useful comparison of model capabilities in following instructions, it's not a comprehensive measure of a model's overall performance or capabilities.

When to use AlpacaEval

AlpacaEval serves as a rapid and cost-effective proxy for human evaluation in simple instruction-following tasks, ideal for frequent assessments during model development.

However, it should not be used as a sole evaluation method in critical decision-making processes, such as determining model releases. AlpacaEval's limitations include a potential lack of representativeness in its instruction set for advanced LLM applications, a possible bias towards stylistic elements over factual accuracy, and an inability to assess the risks associated with model deployment.

Limitations of AlpacaEval

AlpacaEval's evaluation pipeline, like other evaluators, has significant limitations and is not a substitute for human evaluation in critical settings, such as determining model deployment readiness. The limitations include:

The instruction set may not reflect real-world usage: AlpacaEval's dataset, sourced from various collections (self-instruct, open-assistant, vicuna, koala, hh-rlhf), may not accurately represent the diverse applications and usage of advanced models like GPT-4.

Consequently, the performance of proprietary models (GPT-4, Claude, Mistral Medium, etc.) may appear closer to open models than they actually are. While AlpacaFarm's evaluation set shows a high correlation with user interaction win-rates, and the AlpacaEval leaderboard indicates a larger performance gap between open models and OpenAI models compared to other leaderboards, these findings are preliminary and should be interpreted with caution.

The AlpacaEval analysis indicates that automatic annotators have a tendency to favor longer outputs and those containing lists. This bias is also observed in human annotations, but it may reflect limitations in the annotation process rather than true human preferences. Moreover, automatic annotators often prioritize style over content, such as factuality, and show a preference for outputs from models trained on similar data, as evidenced by disparities in leaderboard rankings.

AlpacaEval focuses solely on instruction-following capabilities and does not assess potential harms, such as toxicity or bias. Therefore, the narrow performance gap between ChatGPT and top open-source models should not be misconstrued as an endorsement for deployment readiness.

More terms

Stephen Wolfram

Stephen Wolfram is a British-American computer scientist, physicist, and businessman. He is known for his work in theoretical particle physics, cellular automata, complexity theory, and computer algebra. He is the founder and CEO of the software company Wolfram Research where he worked as the lead developer of Mathematica and the Wolfram Alpha answer engine.

Read more

What Are Multi-Task Learning Models in AI?

Multi-Task Learning Models in AI are designed to handle multiple learning tasks simultaneously, leveraging commonalities and differences across tasks to improve the performance of all tasks. They are used in various domains like natural language processing, computer vision, and speech recognition.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free