LLM Leaderboard

Real-time Klu.ai data powers this leaderboard for evaluating LLM providers, enabling selection of the optimal API and model for your needs.

Last updated 10/18/2024

Most Preferred
GPT-4 Turbo (0409)
100 Klu Index
Largest Context
Claude 3 Opus
200k Context Window
Most Expensive
GPT-4 32k (0314)
$75 / Million Tokens
Least Expensive
Sonar Small
$0.02 / Million Tokens
Fastest TPS
Llama 3 8B
1211 Tokens/s
Fastest TTFT
Gemma 2 9B
0.21s TTFT
ModelCreatorBest ForSpeed (TPS)Benchmark AverageQUAKEKlu Index
GPT-4 Turbo (0409)OpenAICode & Reasoning3987.70%24.24%100
o1-previewOpenAIComplex Reasoning2990.70%39.29%99
GPT-4 Omni (0807)OpenAIAI Applications13185.40%28.79%98
Claude 3.5 SonnetAnthropicChat & Vision8082.25%31.82%97
Gemini Pro 1.5GoogleReward Model6473.61%27.27%96
Claude 3 OpusAnthropicCreative Content2377.35%19.70%91

Understanding the Klu Index Score

The Klu Index Score evaluates frontier models on accuracy, evaluations, human preference, and performance. It combines these indicators into one score, making it easier to compare models. This score helps identify models that best balance quality, cost, and speed for specific applications.

Powered by real-time Klu.ai data as of 10/18/2024, this LLM Leaderboard reveals key insights into use cases, performance, and quality. GPT-4 Turbo (0409) leads with a 100 Klu Index score. o1-preview excels in complex reasoning with a 99 Klu Index. GPT-4 Omni (0807) is optimal for AI applications with a speed of 131 TPS. Claude 3.5 Sonnet is best for chat and vision tasks, achieving an 82.25% benchmark average. Gemini Pro 1.5 is noted for reward modeling with a 73.61% benchmark average, while Claude 3 Opus excels in creative content with a 77.35% benchmark average.

This data enables optimal API provider and model selection based on specific needs, balancing factors like performance, context size, cost, and speed. The leaderboard compares 30+ frontier models based on real-world use, leading benchmarks, and cost vs. speed vs. quality performance.

This comprehensive comparison allows researchers and developers to make informed decisions when selecting an LLM, evaluating factors such as languages supported, performance, context size, speed, and cost. For those interested in maximizing efficiency and accuracy, the leaderboard serves as a vital resource, providing clear evaluations of the models’ abilities. Additionally, the ongoing updates and inclusion of real-time data ensure that the leaderboard remains relevant for anyone asking the right questions, improving NLP benchmarks, and optimizing their use of cutting-edge technologies.

Frontier Model Comparison

Select two models to compare their performance side by side.

GPT-4 Omni (0807)

  • Overall Average81.69%
  • Knowledge (MMLU)88.70%
  • Expertise (GPQA)53.60%
  • Vision (MMMU)69.10%
  • Reasoning (HellaSWAG)94.20%
  • Coding (HumanEval)90.20%
  • Reasoning (BBHard)91.30%
  • Easy Math (GSM8K)89.80%
  • Hard Math (MATH)76.60%
  • Klu Index Score98.00
  • Speed (Tokens/s)130.60
  • Stream Start (Seconds)0.50
  • Price / Million Tokens$4.16

Claude 3.5 Sonnet

  • Overall Average82.25%
  • Knowledge (MMLU)88.70%
  • Expertise (GPQA)59.40%
  • Vision (MMMU)68.30%
  • Reasoning (HellaSWAG)89.00%
  • Coding (HumanEval)92.00%
  • Reasoning (BBHard)93.10%
  • Easy Math (GSM8K)96.40%
  • Hard Math (MATH)71.10%
  • Klu Index Score97.42
  • Speed (Tokens/s)80.10
  • Stream Start (Seconds)0.84
  • Price / Million Tokens$6

Frontier Benchmarks Leaderboard

The hardest benchmarks, the most advanced models.

ModelAverageBCBMATHGPQAMMMUGAIAQUAKE
CodingUni MathKnowledgeVisionAssistantsProductivity
1o1-preview57.92%34.5%94.8%78.3%78.2%22.42%39.29%
2o1-mini55.03%27%94.8%59.4%78.2%22.42%48.34%
3Claude 3.5 Sonnet47.69%33.1%71.1%59.4%68.3%22.42%31.82%
4GPT-4o (0806)46.92%29.1%76.6%56.1%69.1%21.82%28.79%
5Gemini 1.5 Pro (0924)43.99%31.1%67.7%46.2%62.2%10.3%46.43%
6GPT-4 Turbo (0409)42.26%35.1%72.2%48%63.1%10.91%24.24%
7Claude 3 Opus39.48%29.7%60.1%50.4%59.4%17.58%19.7%
8Gemini 1.5 Flash (0924)39.35%25%54.9%39.5%56.1%13.33%47.26%
9GPT-4o Mini37.78%27%70.2%40.2%59.4%15.15%14.71%
These benchmarks assess coding skills (BigCodeBench), mathematical proficiency (MATH), expert knowledge (GPQA), vision capabilities (MMMU), AI assistants (GAIA), and productivity (QUAKE). These frontier benchmarks represent some of the most challenging tasks, pushing the boundaries of current multi-modal model capabilities.

Frontier Model Leaderboard

The standard benchmarks, the leading frontier models.

ModelAverageMMLUGPQAMMMUHellaSWAGHumanEvalBBHardGSM8KMATH
KnowledgeExpertVisionContextCodingReasoningK6 MathUni Math
1Claude 3.5 Sonnet82.25%88.70%59.4%68.3%89%92%93.10%96.40%71.10%
2GPT-4o (0513)81.69%88.70%53.6%69.1%94.20%90.20%91.30%89.80%76.60%
3GPT-4 Turbo (0409)79.10%86.50%48.0%63.1%94.20%90.20%87.60%91%72.20%
4Llama 3.1 (405B)79.01%88.60%51.1%64.5%87%89%81.3%96.80%73.80%
5Mistral Large 2 (0724)78.80%84%35.1%89.20%92%87.30%93%71%
6Claude 3 Opus77.35%86.80%50.4%59.4%95.40%84.90%86.80%95%60.10%
7Llama 3.1 70B75.65%86%46.7%60.6%87%80.50%81.30%95.10%68.0%
8Gemini 1.5 Pro73.61%81.90%46.2%62.2%92.50%71.90%84%91.70%58.50%
9GPT-4 (0314)71.15%86.40%35.7%56.8%95.30%67%83.10%92%52.90%
10GPT-4o Mini70.70%82%40.2%59.4%87.20%70.20%
11Claude 3 Sonnet69.85%79%46.4%53.1%89%73%82.90%92.30%43.10%
12Gemini 1.5 Flash68.63%78.90%39.5%56.1%81.30%67.50%89.20%68.80%67.70%
13Claude 3 Haiku66.10%75.20%40.1%50.2%85.90%75.90%73.70%88.90%38.90%
14Llama 3.1 8B64.29%73.0%32.8%74.20%72.60%61%84.50%51.90%
15Mistral Nemo 12B41.30%68%8.72%83.5%5%
These benchmarks assess general knowledge (MMLU), vision capabilities (MMMU), expert knowledge (GPQA), common-sense reasoning (HELLASWAG), coding skills (HUMANEVAL), and mathematical proficiency (GSM8K, MATH). By evaluating these capabilities, we can understand the strengths and weaknesses of different models.

All Models by Provider

ModelKlu IndexCost / 1M TokensSpeed (TPS)Speed (TTFT)Best Benchmark
OpenAI
o1-preview 99$26.252930.55s
MATH 94.8
GPT-4 Turbo (0409) 100$15390.55s
gsm8k 94.8
GPT-4 Omni (0807) 98$4.161310.5s
gsm8k 93.8
GPT-4 Omni Mini (0718) 85$0.262660.57s
MMLU 89
GPT-4 32k (0314) 85$75230.55s
gsm8k 95.0
GPT-4 (0613) 79$37.50230.55s
openbookqa 95.6
GPT-4 Vision (1106) 89$15270.53s
gsm8k 94.8
GPT-3.5 Turbo 70$0.75720.34s
openbookqa 92.3
GPT-3.5 Turbo Instruct 70$1.631300.49s
MMLU 70
Azure OpenAI
GPT-4 (0613) 79$37.50230.57s
openbookqa 95.6
GPT-4 Vision Preview (1106) 89$15410.57s
gsm8k 94.8
GPT-4 Turbo Preview (1106) 100$15330.57s
gsm8k 94.8
GPT-4 Omni (0513) 99$7.50720.5s
gsm8k 93.8
GPT-3.5 Turbo 70$0.75720.34s
openbookqa 92.3
GPT-3.5 Turbo Instruct 70$1.631370.62s
gsm8k 93.8
Azure AI
Llama 3.1 405B 89$7.99320.79s
humaneval 89
Llama 3.1 70B 86$1.29860.44s
MMLU 86
Llama 3.1 8B 69$0.302660.29s
MMLU 73
Mistral
Mistral Large 2 (0724) 88$4.5440.29s
GSM8K 93
Mistral Nemo 12B 79$0.301900.31s
hellaswag 83.5
Mistral Large 79$12241s
MMLU 81.2
Mistral Medium 76$4.05180.48s
MMLU 70.60
Mixtral 8x22B 78$1.20690.28s
average 78
Mistral 8x7B 70$0.501030.29s
MMLU 70.60
Mistral Small 70$3810.72s
MMLU 70.60
Mistral 7B 61$0.15880.28s
MMLU 70.60
Perplexity
Sonar Large 89$1411.14s
MMLU 70.60
Sonar Small 66$0.021210.91s
MMLU 70.60
Cohere
Command R+ 74$6620.44s
MMLU 75.70
Command R 62$0.751470.35s
MMLU 68.20
Anthropic
Claude Instant 67$1.90890.42s
MMLU 70.60
Claude 2.1 70$12280.59s
MMLU 70.60
Claude 3 Haiku 82$0.501200.3s
humaneval 75.90
Claude 3 Sonnet 82$6540.64s
bbhard 82.90
Claude 3.5 Sonnet 97$6800.84s
MMLU 89
Claude 3 Opus 91$30231.66s
MMLU 86.80
Groq
Gemma 2 9B 52$0.201220.21s
MMLU 64.30
Llama 3.1 8B 69$0.307440.29s
MMLU 73
Llama 3.1 70B 86$1.292490.44s
MMLU 86
Gemma 7B 52$0.0710300.88s
MMLU 64.30
Llama 3 70B 70$0.643580.4s
MMLU 70.60
Llama 3 8B 63$0.0612110.36s
hellaswag 87
Mixtral 8x7B 70$0.245520.44s
MMLU 70.60
Google
Gemini Pro 1.5 96$1.25641.88s
bbhard 75
Gemini Flash 1.5 89$1.25891.88s
bbhard 75
Gemma 7B 52$0.151230.29s
MMLU 64.30
Gemma 2 27B 83$0.30490.49s
MMLU 75.2
Gemma 2 8B 82$0.201390.29s
MMLU 71.3

Frequently asked questions