MixEval Benchmark for Open LLMs
by Stephen M. Walker II, Co-Founder / CEO
What is MixEval?
MixEval is a benchmark that merges existing evaluation datasets with real-world user queries to bridge the gap between academic benchmarks and practical applications.
MixEval achieves a high correlation with user-facing evaluations like LMSYS Chatbot Arena but at a fraction of the cost and time.
Key Features of MixEval
MixEval provides a robust framework for evaluating language models by combining high correlation with user evaluations and cost-effectiveness.
The benchmark utilizes dynamic data updates, comprehensive query distribution, and fair grading to ensure reliable and unbiased results.
- High Correlation — MixEval achieves a 0.96 model ranking correlation with Chatbot Arena, ensuring reliable model evaluation.
- Cost-Effective — Running MixEval costs around $0.6 using GPT-3.5 as a judge, significantly cheaper than other benchmarks.
- Dynamic Evaluation — The benchmark uses a rapid data update pipeline to reduce contamination risk.
- Comprehensive Query Distribution — Based on a large-scale web corpus, MixEval provides a less biased evaluation.
- Fair Grading — Its ground-truth-based nature ensures an unbiased evaluation process.
How MixEval Works
MixEval evaluates large language models (LLMs) using a mix of established benchmarks and real-world user queries. This ensures a comprehensive and unbiased dataset.
The process starts with filtering and annotating the dataset to maintain quality. LLMs are then tested against this dataset, producing detailed performance metrics.
MixEval-Hard focuses on challenging queries to better distinguish model performance. Dynamic evaluation updates the dataset periodically to prevent data contamination, ensuring each evaluation round remains fresh and robust.
MixEval-Hard challenges focus on complex language, ambiguous contexts, and deeper reasoning, pushing models to their limits.
MixEval Versions
MixEval offers different versions to meet various evaluation needs. Each version provides unique insights into model performance.
- MixEval — The standard version that balances comprehensiveness and efficiency.
- MixEval-Hard — A more challenging version designed to push model improvements and better distinguish strong models.
MixEval Leaderboard
The MixEval Leaderboard ranks language models based on their performance in MixEval and MixEval-Hard benchmarks. It helps users compare models in terms of efficiency, accuracy, and overall capability.
Arena Elo and MMLU metrics are also included for reference.
Top 3 Performers (July 15, 2024)
Claude 3.5 Sonnet-0620
Claude 3.5 Sonnet-0620 tops the MixEval Leaderboard with scores of 68.05 in MixEval-Hard and 89.9 in MixEval. It excels in accuracy and reliability, with MMLU scores of 84.2 and 85.4 for MMLU-Hard.
GPT-4o-2024-05-13
GPT-4o-2024-05-13 follows with a MixEval-Hard score of 64.7 and a MixEval score of 87.9. It has an Arena Elo score of 1287 and MMLU scores of 85.4 and 86.8 for MMLU-Hard, showing strong performance across tasks.
Claude 3 Opus
Claude 3 Opus ranks third with a MixEval-Hard score of 63.5 and a MixEval score of 88.1. It has an Arena Elo score of 1248 and MMLU scores of 83.2 and 87.7 for MMLU-Hard, indicating balanced performance.
Model | MixEval-Hard | MixEval | Arena Elo | MMLU |
---|---|---|---|---|
Claude 3.5 Sonnet-0620 | 68.05 | 89.9 | - | 84.2 |
GPT-4o-2024-05-13 | 64.7 | 87.9 | 1287 | 85.4 |
Claude 3 Opus | 63.5 | 88.1 | 1248 | 83.2 |
GPT-4-Turbo-2024-04-09 | 62.6 | 88.8 | 1256 | 82.8 |
Gemini 1.5 Pro-API-0409 | 58.7 | 84.2 | 1258 | 79.2 |
Gemini 1.5 Pro-API-0514 | 58.3 | 84.8 | - | 84.0 |
Yi-Large-preview | 56.8 | 84.4 | 1239 | 80.9 |
LLaMA-3-70B-Instruct | 55.9 | 84.0 | 1208 | 80.5 |
Qwen-Max-0428 | 55.8 | 86.1 | 1184 | 80.6 |
Claude 3 Sonnet | 54.0 | 81.7 | 1201 | 74.7 |
Reka Core-20240415 | 52.9 | 83.3 | - | 79.3 |
DeepSeek-V2 | 51.7 | 83.7 | - | 77.3 |
Command R+ | 51.4 | 81.5 | 1189 | 78.9 |
Mistral-Large | 50.3 | 84.2 | 1156 | 80.2 |
Mistral-Medium | 47.8 | 81.9 | 1148 | 76.3 |
Gemini 1.0 Pro | 46.4 | 78.9 | 1131 | 74.9 |
Reka Flash-20240226 | 46.2 | 79.8 | 1148 | 75.4 |
Mistral-Small | 46.2 | 81.2 | - | 75.2 |
LLaMA-3-8B-Instruct | 45.6 | 75.0 | 1153 | 71.9 |
Command R | 45.2 | 77.0 | 1147 | 75.0 |
GPT-3.5-Turbo-0125 | 43.0 | 79.7 | 1102 | 74.5 |
Claude 3 Haiku | 42.8 | 79.7 | 1178 | 76.1 |
Evaluating Open LLMs with MixEval
MixEval offers a public GitHub repository for evaluating language models. This fork includes several enhancements to make integration and usage easier during model training:
- Local Model Evaluation — Supports evaluating local models during or after training with transformers.
- Hugging Face Datasets Integration — Eliminates the need for local files by integrating with Hugging Face Datasets.
- Accelerated Evaluation — Uses Hugging Face TGI or vLLM to speed up the evaluation process.
- Enhanced Output — Provides improved markdown outputs and timing for training.
- Simplified Installation — Fixed pip install for remote or CI integration.
To get started with MixEval, follow these steps:
-
Install MixEval: Use the following command to install the enhanced version of MixEval:
pip install git+https://github.com/philschmid/MixEval --upgrade
-
Prepare for Evaluation: Ensure you have a valid OpenAI Key, as GPT-3.5 will be used as the parser.
-
Evaluate Local LLMs: Run the following command to evaluate local LLMs using MixEval:
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \ --data_path hf://zeitgeist-ai/mixeval \ --model_path my/local/path \ --output_dir results/agi-5 \ --model_name local_chat \ --benchmark mixeval_hard \ --version 2024-06-01 \ --batch_size 20 \ --api_parallel_num 20
data_path
: Location of the MixEval dataset (hosted on Hugging Face).model_path
: Path to your trained model (can be a local path or a Hugging Face ID).output_dir
: Directory to store the results.model_name
: Name of the model (uselocal_chat
for the agnostic version).benchmark
: Choose betweenmixeval
ormixeval_hard
.version
: Version of MixEval.batch_size
: Batch size for the generator model.api_parallel_num
: Number of parallel requests sent to OpenAI.
-
Accelerated Evaluation with vLLM: For faster evaluation, use vLLM or Hugging Face TGI:
-
Start your LLM serving framework:
python -m vllm.entrypoints.openai.api_server --model alignment-handbook/zephyr-7b-dpo-full
-
Run MixEval in another terminal:
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \ --data_path hf://zeitgeist-ai/mixeval \ --model_name local_api \ --model_path alignment-handbook/zephyr-7b-dpo-full \ --benchmark mixeval_hard \ --version 2024-06-01 \ --batch_size 20 \ --output_dir results \ --api_parallel_num 20
-
API_URL
: URL where your model is hosted. -
API_KEY
: Optional, for authorization if needed. -
model_name
: Uselocal_api
for the hosted API model. -
model_path
: Model ID in your hosted API (e.g., Hugging Face model ID).
-
After running the evaluation, MixEval will provide a detailed breakdown of the model's performance across various metrics.
For more detailed instructions, refer to the MixEval documentation.