MixEval Benchmark for Open LLMs

by Stephen M. Walker II, Co-Founder / CEO

What is MixEval?

MixEval is a benchmark that merges existing evaluation datasets with real-world user queries to bridge the gap between academic benchmarks and practical applications.

MixEval achieves a high correlation with user-facing evaluations like LMSYS Chatbot Arena but at a fraction of the cost and time.

MixEval Benchmark

Key Features of MixEval

MixEval provides a robust framework for evaluating language models by combining high correlation with user evaluations and cost-effectiveness.

The benchmark utilizes dynamic data updates, comprehensive query distribution, and fair grading to ensure reliable and unbiased results.

MixEval Pipeline
  • High Correlation — MixEval achieves a 0.96 model ranking correlation with Chatbot Arena, ensuring reliable model evaluation.
  • Cost-Effective — Running MixEval costs around $0.6 using GPT-3.5 as a judge, significantly cheaper than other benchmarks.
  • Dynamic Evaluation — The benchmark uses a rapid data update pipeline to reduce contamination risk.
  • Comprehensive Query Distribution — Based on a large-scale web corpus, MixEval provides a less biased evaluation.
  • Fair Grading — Its ground-truth-based nature ensures an unbiased evaluation process.

How MixEval Works

MixEval evaluates large language models (LLMs) using a mix of established benchmarks and real-world user queries. This ensures a comprehensive and unbiased dataset.

The process starts with filtering and annotating the dataset to maintain quality. LLMs are then tested against this dataset, producing detailed performance metrics.

MixEval-Hard focuses on challenging queries to better distinguish model performance. Dynamic evaluation updates the dataset periodically to prevent data contamination, ensuring each evaluation round remains fresh and robust.

MixEval-Hard challenges focus on complex language, ambiguous contexts, and deeper reasoning, pushing models to their limits.

MixEval Versions

MixEval offers different versions to meet various evaluation needs. Each version provides unique insights into model performance.

Linear with Arena Merged
  • MixEval — The standard version that balances comprehensiveness and efficiency.
  • MixEval-Hard — A more challenging version designed to push model improvements and better distinguish strong models.

MixEval Leaderboard

The MixEval Leaderboard ranks language models based on their performance in MixEval and MixEval-Hard benchmarks. It helps users compare models in terms of efficiency, accuracy, and overall capability.

Arena Elo and MMLU metrics are also included for reference.

Correlation Breakdown Arena Elo

Top 3 Performers (July 15, 2024)

Claude 3.5 Sonnet-0620

Claude 3.5 Sonnet-0620 tops the MixEval Leaderboard with scores of 68.05 in MixEval-Hard and 89.9 in MixEval. It excels in accuracy and reliability, with MMLU scores of 84.2 and 85.4 for MMLU-Hard.

GPT-4o-2024-05-13

GPT-4o-2024-05-13 follows with a MixEval-Hard score of 64.7 and a MixEval score of 87.9. It has an Arena Elo score of 1287 and MMLU scores of 85.4 and 86.8 for MMLU-Hard, showing strong performance across tasks.

Claude 3 Opus

Claude 3 Opus ranks third with a MixEval-Hard score of 63.5 and a MixEval score of 88.1. It has an Arena Elo score of 1248 and MMLU scores of 83.2 and 87.7 for MMLU-Hard, indicating balanced performance.

ModelMixEval-HardMixEvalArena EloMMLU
Claude 3.5 Sonnet-062068.0589.9-84.2
GPT-4o-2024-05-1364.787.9128785.4
Claude 3 Opus63.588.1124883.2
GPT-4-Turbo-2024-04-0962.688.8125682.8
Gemini 1.5 Pro-API-040958.784.2125879.2
Gemini 1.5 Pro-API-051458.384.8-84.0
Yi-Large-preview56.884.4123980.9
LLaMA-3-70B-Instruct55.984.0120880.5
Qwen-Max-042855.886.1118480.6
Claude 3 Sonnet54.081.7120174.7
Reka Core-2024041552.983.3-79.3
DeepSeek-V251.783.7-77.3
Command R+51.481.5118978.9
Mistral-Large50.384.2115680.2
Mistral-Medium47.881.9114876.3
Gemini 1.0 Pro46.478.9113174.9
Reka Flash-2024022646.279.8114875.4
Mistral-Small46.281.2-75.2
LLaMA-3-8B-Instruct45.675.0115371.9
Command R45.277.0114775.0
GPT-3.5-Turbo-012543.079.7110274.5
Claude 3 Haiku42.879.7117876.1

Evaluating Open LLMs with MixEval

MixEval offers a public GitHub repository for evaluating language models. This fork includes several enhancements to make integration and usage easier during model training:

  • Local Model Evaluation — Supports evaluating local models during or after training with transformers.
  • Hugging Face Datasets Integration — Eliminates the need for local files by integrating with Hugging Face Datasets.
  • Accelerated Evaluation — Uses Hugging Face TGI or vLLM to speed up the evaluation process.
  • Enhanced Output — Provides improved markdown outputs and timing for training.
  • Simplified Installation — Fixed pip install for remote or CI integration.

To get started with MixEval, follow these steps:

  1. Install MixEval: Use the following command to install the enhanced version of MixEval:

    pip install git+https://github.com/philschmid/MixEval --upgrade
    
  2. Prepare for Evaluation: Ensure you have a valid OpenAI Key, as GPT-3.5 will be used as the parser.

  3. Evaluate Local LLMs: Run the following command to evaluate local LLMs using MixEval:

    MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
        --data_path hf://zeitgeist-ai/mixeval \
        --model_path my/local/path \
        --output_dir results/agi-5 \
        --model_name local_chat \
        --benchmark mixeval_hard \
        --version 2024-06-01 \
        --batch_size 20 \
        --api_parallel_num 20
    
    • data_path: Location of the MixEval dataset (hosted on Hugging Face).
    • model_path: Path to your trained model (can be a local path or a Hugging Face ID).
    • output_dir: Directory to store the results.
    • model_name: Name of the model (use local_chat for the agnostic version).
    • benchmark: Choose between mixeval or mixeval_hard.
    • version: Version of MixEval.
    • batch_size: Batch size for the generator model.
    • api_parallel_num: Number of parallel requests sent to OpenAI.
  4. Accelerated Evaluation with vLLM: For faster evaluation, use vLLM or Hugging Face TGI:

    • Start your LLM serving framework:

      python -m vllm.entrypoints.openai.api_server --model alignment-handbook/zephyr-7b-dpo-full
      
    • Run MixEval in another terminal:

      MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \
          --data_path hf://zeitgeist-ai/mixeval \
          --model_name local_api \
          --model_path alignment-handbook/zephyr-7b-dpo-full \
          --benchmark mixeval_hard \
          --version 2024-06-01 \
          --batch_size 20 \
          --output_dir results \
          --api_parallel_num 20
      
    • API_URL: URL where your model is hosted.

    • API_KEY: Optional, for authorization if needed.

    • model_name: Use local_api for the hosted API model.

    • model_path: Model ID in your hosted API (e.g., Hugging Face model ID).

After running the evaluation, MixEval will provide a detailed breakdown of the model's performance across various metrics.

For more detailed instructions, refer to the MixEval documentation.

More terms

What is answer set programming?

Answer Set Programming (ASP) is a form of declarative programming that is particularly suited for solving difficult search problems, many of which are NP-hard. It is based on the stable model (also known as answer set) semantics of logic programming. In ASP, problems are expressed in a way that solutions correspond to stable models, and specialized solvers are used to find these models.

Read more

What is neuromorphic engineering?

Neuromorphic engineering is a new field of AI that is inspired by the way the brain works. This type of AI is designed to mimic the way the brain processes information, making it more efficient and effective than traditional AI.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free