MixEval Benchmark for Open LLMs

by Stephen M. Walker II, Co-Founder / CEO

What is MixEval?

MixEval is a benchmark that merges existing evaluation datasets with real-world user queries to bridge the gap between academic benchmarks and practical applications.

MixEval achieves a high correlation with user-facing evaluations like LMSYS Chatbot Arena but at a fraction of the cost and time.

MixEval Benchmark

Key Features of MixEval

MixEval provides a robust framework for evaluating language models by combining high correlation with user evaluations and cost-effectiveness.

The benchmark utilizes dynamic data updates, comprehensive query distribution, and fair grading to ensure reliable and unbiased results.

MixEval Pipeline
  • High Correlation — MixEval achieves a 0.96 model ranking correlation with Chatbot Arena, ensuring reliable model evaluation.
  • Cost-Effective — Running MixEval costs around $0.6 using GPT-3.5 as a judge, significantly cheaper than other benchmarks.
  • Dynamic Evaluation — The benchmark uses a rapid data update pipeline to reduce contamination risk.
  • Comprehensive Query Distribution — Based on a large-scale web corpus, MixEval provides a less biased evaluation.
  • Fair Grading — Its ground-truth-based nature ensures an unbiased evaluation process.

How MixEval Works

MixEval evaluates large language models (LLMs) using a mix of established benchmarks and real-world user queries. This ensures a comprehensive and unbiased dataset.

The process starts with filtering and annotating the dataset to maintain quality. LLMs are then tested against this dataset, producing detailed performance metrics.

MixEval-Hard focuses on challenging queries to better distinguish model performance. Dynamic evaluation updates the dataset periodically to prevent data contamination, ensuring each evaluation round remains fresh and robust.

MixEval-Hard challenges focus on complex language, ambiguous contexts, and deeper reasoning, pushing models to their limits.

MixEval Versions

MixEval offers different versions to meet various evaluation needs. Each version provides unique insights into model performance.

Linear with Arena Merged
  • MixEval — The standard version that balances comprehensiveness and efficiency.
  • MixEval-Hard — A more challenging version designed to push model improvements and better distinguish strong models.

MixEval Leaderboard

The MixEval Leaderboard ranks language models based on their performance in MixEval and MixEval-Hard benchmarks. It helps users compare models in terms of efficiency, accuracy, and overall capability.

Arena Elo and MMLU metrics are also included for reference.

Correlation Breakdown Arena Elo

Top 3 Performers (July 15, 2024)

Claude 3.5 Sonnet-0620

Claude 3.5 Sonnet-0620 tops the MixEval Leaderboard with scores of 68.05 in MixEval-Hard and 89.9 in MixEval. It excels in accuracy and reliability, with MMLU scores of 84.2 and 85.4 for MMLU-Hard.

GPT-4o-2024-05-13

GPT-4o-2024-05-13 follows with a MixEval-Hard score of 64.7 and a MixEval score of 87.9. It has an Arena Elo score of 1287 and MMLU scores of 85.4 and 86.8 for MMLU-Hard, showing strong performance across tasks.

Claude 3 Opus

Claude 3 Opus ranks third with a MixEval-Hard score of 63.5 and a MixEval score of 88.1. It has an Arena Elo score of 1248 and MMLU scores of 83.2 and 87.7 for MMLU-Hard, indicating balanced performance.

ModelMixEval-HardMixEvalArena EloMMLU
Claude 3.5 Sonnet-062068.0589.9-84.2
GPT-4o-2024-05-1364.787.9128785.4
Claude 3 Opus63.588.1124883.2
GPT-4-Turbo-2024-04-0962.688.8125682.8
Gemini 1.5 Pro-API-040958.784.2125879.2
Gemini 1.5 Pro-API-051458.384.8-84.0
Yi-Large-preview56.884.4123980.9
LLaMA-3-70B-Instruct55.984.0120880.5
Qwen-Max-042855.886.1118480.6
Claude 3 Sonnet54.081.7120174.7
Reka Core-2024041552.983.3-79.3
DeepSeek-V251.783.7-77.3
Command R+51.481.5118978.9
Mistral-Large50.384.2115680.2
Mistral-Medium47.881.9114876.3
Gemini 1.0 Pro46.478.9113174.9
Reka Flash-2024022646.279.8114875.4
Mistral-Small46.281.2-75.2
LLaMA-3-8B-Instruct45.675.0115371.9
Command R45.277.0114775.0
GPT-3.5-Turbo-012543.079.7110274.5
Claude 3 Haiku42.879.7117876.1

Evaluating Open LLMs with MixEval

MixEval offers a public GitHub repository for evaluating language models. This fork includes several enhancements to make integration and usage easier during model training:

  • Local Model Evaluation — Supports evaluating local models during or after training with transformers.
  • Hugging Face Datasets Integration — Eliminates the need for local files by integrating with Hugging Face Datasets.
  • Accelerated Evaluation — Uses Hugging Face TGI or vLLM to speed up the evaluation process.
  • Enhanced Output — Provides improved markdown outputs and timing for training.
  • Simplified Installation — Fixed pip install for remote or CI integration.

To get started with MixEval, follow these steps:

  1. Install MixEval: Use the following command to install the enhanced version of MixEval:

    pip install git+https://github.com/philschmid/MixEval --upgrade
    
  2. Prepare for Evaluation: Ensure you have a valid OpenAI Key, as GPT-3.5 will be used as the parser.

  3. Evaluate Local LLMs: Run the following command to evaluate local LLMs using MixEval:

    MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
        --data_path hf://zeitgeist-ai/mixeval \
        --model_path my/local/path \
        --output_dir results/agi-5 \
        --model_name local_chat \
        --benchmark mixeval_hard \
        --version 2024-06-01 \
        --batch_size 20 \
        --api_parallel_num 20
    
    • data_path: Location of the MixEval dataset (hosted on Hugging Face).
    • model_path: Path to your trained model (can be a local path or a Hugging Face ID).
    • output_dir: Directory to store the results.
    • model_name: Name of the model (use local_chat for the agnostic version).
    • benchmark: Choose between mixeval or mixeval_hard.
    • version: Version of MixEval.
    • batch_size: Batch size for the generator model.
    • api_parallel_num: Number of parallel requests sent to OpenAI.
  4. Accelerated Evaluation with vLLM: For faster evaluation, use vLLM or Hugging Face TGI:

    • Start your LLM serving framework:

      python -m vllm.entrypoints.openai.api_server --model alignment-handbook/zephyr-7b-dpo-full
      
    • Run MixEval in another terminal:

      MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \
          --data_path hf://zeitgeist-ai/mixeval \
          --model_name local_api \
          --model_path alignment-handbook/zephyr-7b-dpo-full \
          --benchmark mixeval_hard \
          --version 2024-06-01 \
          --batch_size 20 \
          --output_dir results \
          --api_parallel_num 20
      
    • API_URL: URL where your model is hosted.

    • API_KEY: Optional, for authorization if needed.

    • model_name: Use local_api for the hosted API model.

    • model_path: Model ID in your hosted API (e.g., Hugging Face model ID).

After running the evaluation, MixEval will provide a detailed breakdown of the model's performance across various metrics.

For more detailed instructions, refer to the MixEval documentation.

More terms

What is Binary classification?

Binary classification is a type of supervised learning algorithm in machine learning that categorizes new observations into one of two classes. It's a fundamental task in machine learning where the goal is to predict which of two possible classes an instance of data belongs to. The output of binary classification is a binary outcome, where the result can either be positive or negative, often represented as 1 or 0, true or false, yes or no, etc.

Read more

What is AI Quality Control?

AI Quality is determined by evaluating an AI system's performance, societal impact, operational compatibility, and data quality. Performance is measured by the accuracy and generalization of the AI model's predictions, along with its robustness, fairness, and privacy. Societal impact considers ethical implications, including bias and fairness. Operational compatibility ensures the AI system integrates well within its environment, and data quality is critical for the model's predictive power and reliability.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free