MixEval Benchmark for Open LLMs

by Stephen M. Walker II, Co-Founder / CEO

What is MixEval?

MixEval is a benchmark that merges existing evaluation datasets with real-world user queries to bridge the gap between academic benchmarks and practical applications.

MixEval achieves a high correlation with user-facing evaluations like LMSYS Chatbot Arena but at a fraction of the cost and time.

Key Features of MixEval

MixEval provides a robust framework for evaluating language models by combining high correlation with user evaluations and cost-effectiveness.

The benchmark utilizes dynamic data updates, comprehensive query distribution, and fair grading to ensure reliable and unbiased results.

High Correlation — MixEval achieves a 0.96 model ranking correlation with Chatbot Arena, ensuring reliable model evaluation.
Cost-Effective — Running MixEval costs around $0.6 using GPT-3.5 as a judge, significantly cheaper than other benchmarks.
Dynamic Evaluation — The benchmark uses a rapid data update pipeline to reduce contamination risk.
Comprehensive Query Distribution — Based on a large-scale web corpus, MixEval provides a less biased evaluation.
Fair Grading — Its ground-truth-based nature ensures an unbiased evaluation process.

How MixEval Works

MixEval evaluates large language models (LLMs) using a mix of established benchmarks and real-world user queries. This ensures a comprehensive and unbiased dataset.

The process starts with filtering and annotating the dataset to maintain quality. LLMs are then tested against this dataset, producing detailed performance metrics.

MixEval-Hard focuses on challenging queries to better distinguish model performance. Dynamic evaluation updates the dataset periodically to prevent data contamination, ensuring each evaluation round remains fresh and robust.

MixEval-Hard challenges focus on complex language, ambiguous contexts, and deeper reasoning, pushing models to their limits.

MixEval Versions

MixEval offers different versions to meet various evaluation needs. Each version provides unique insights into model performance.

MixEval — The standard version that balances comprehensiveness and efficiency.
MixEval-Hard — A more challenging version designed to push model improvements and better distinguish strong models.

MixEval Leaderboard

The MixEval Leaderboard ranks language models based on their performance in MixEval and MixEval-Hard benchmarks. It helps users compare models in terms of efficiency, accuracy, and overall capability.

Arena Elo and MMLU metrics are also included for reference.

Top 3 Performers (July 15, 2024)

Claude 3.5 Sonnet-0620

Claude 3.5 Sonnet-0620 tops the MixEval Leaderboard with scores of 68.05 in MixEval-Hard and 89.9 in MixEval. It excels in accuracy and reliability, with MMLU scores of 84.2 and 85.4 for MMLU-Hard.

GPT-4o-2024-05-13

GPT-4o-2024-05-13 follows with a MixEval-Hard score of 64.7 and a MixEval score of 87.9. It has an Arena Elo score of 1287 and MMLU scores of 85.4 and 86.8 for MMLU-Hard, showing strong performance across tasks.

Claude 3 Opus

Claude 3 Opus ranks third with a MixEval-Hard score of 63.5 and a MixEval score of 88.1. It has an Arena Elo score of 1248 and MMLU scores of 83.2 and 87.7 for MMLU-Hard, indicating balanced performance.

Model	MixEval-Hard	MixEval	Arena Elo	MMLU
Claude 3.5 Sonnet-0620	68.05	89.9	-	84.2
GPT-4o-2024-05-13	64.7	87.9	1287	85.4
Claude 3 Opus	63.5	88.1	1248	83.2
GPT-4-Turbo-2024-04-09	62.6	88.8	1256	82.8
Gemini 1.5 Pro-API-0409	58.7	84.2	1258	79.2
Gemini 1.5 Pro-API-0514	58.3	84.8	-	84.0
Yi-Large-preview	56.8	84.4	1239	80.9
LLaMA-3-70B-Instruct	55.9	84.0	1208	80.5
Qwen-Max-0428	55.8	86.1	1184	80.6
Claude 3 Sonnet	54.0	81.7	1201	74.7
Reka Core-20240415	52.9	83.3	-	79.3
DeepSeek-V2	51.7	83.7	-	77.3
Command R+	51.4	81.5	1189	78.9
Mistral-Large	50.3	84.2	1156	80.2
Mistral-Medium	47.8	81.9	1148	76.3
Gemini 1.0 Pro	46.4	78.9	1131	74.9
Reka Flash-20240226	46.2	79.8	1148	75.4
Mistral-Small	46.2	81.2	-	75.2
LLaMA-3-8B-Instruct	45.6	75.0	1153	71.9
Command R	45.2	77.0	1147	75.0
GPT-3.5-Turbo-0125	43.0	79.7	1102	74.5
Claude 3 Haiku	42.8	79.7	1178	76.1

Evaluating Open LLMs with MixEval

MixEval offers a public GitHub repository for evaluating language models. This fork includes several enhancements to make integration and usage easier during model training:

Local Model Evaluation — Supports evaluating local models during or after training with transformers.
Hugging Face Datasets Integration — Eliminates the need for local files by integrating with Hugging Face Datasets.
Accelerated Evaluation — Uses Hugging Face TGI or vLLM to speed up the evaluation process.
Enhanced Output — Provides improved markdown outputs and timing for training.
Simplified Installation — Fixed pip install for remote or CI integration.

To get started with MixEval, follow these steps:

Install MixEval: Use the following command to install the enhanced version of MixEval:
```
pip install git+https://github.com/philschmid/MixEval --upgrade
```
Prepare for Evaluation: Ensure you have a valid OpenAI Key, as GPT-3.5 will be used as the parser.
Evaluate Local LLMs: Run the following command to evaluate local LLMs using MixEval:
```
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_path my/local/path \
    --output_dir results/agi-5 \
    --model_name local_chat \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --api_parallel_num 20
```
- data_path: Location of the MixEval dataset (hosted on Hugging Face).
- model_path: Path to your trained model (can be a local path or a Hugging Face ID).
- output_dir: Directory to store the results.
- model_name: Name of the model (use local_chat for the agnostic version).
- benchmark: Choose between mixeval or mixeval_hard.
- version: Version of MixEval.
- batch_size: Batch size for the generator model.
- api_parallel_num: Number of parallel requests sent to OpenAI.

Accelerated Evaluation with vLLM: For faster evaluation, use vLLM or Hugging Face TGI:

Start your LLM serving framework:

python -m vllm.entrypoints.openai.api_server --model alignment-handbook/zephyr-7b-dpo-full

Run MixEval in another terminal:

MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_name local_api \
    --model_path alignment-handbook/zephyr-7b-dpo-full \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --output_dir results \
    --api_parallel_num 20

API_URL: URL where your model is hosted.
API_KEY: Optional, for authorization if needed.
model_name: Use local_api for the hosted API model.
model_path: Model ID in your hosted API (e.g., Hugging Face model ID).

After running the evaluation, MixEval will provide a detailed breakdown of the model's performance across various metrics.

For more detailed instructions, refer to the MixEval documentation.

Klu is remote-first and global

Follow us

MixEval Benchmark for Open LLMs

What is MixEval?

Key Features of MixEval

How MixEval Works

MixEval Versions

MixEval Leaderboard

Top 3 Performers (July 15, 2024)

Evaluating Open LLMs with MixEval

More terms

What is a graph database?

What is OpenCog?

It's time to build

LLMOps

Guides

LLMs