LLM Leaderboard
Real-time Klu.ai data powers this leaderboard for evaluating LLM providers, enabling selection of the optimal API and model for your needs.
The latest version of the AI model has significantly improved dataset demand and speed, ensuring more efficient chat and code generation, even across multilingual contexts like German, Chinese, and Hindi. Google's open LLM repository provides benchmarks that developers can use to identify wrong categories, especially in meta-inspired tests and other benchmarking efforts. However, latency issues remain a concern for AI models, particularly when processing large context windows or running complex comparisons between models in cost-sensitive environments. With the growing demand for datasets in various languages such as Spanish, French, Italian, and Arabic, benchmarking the quality and breadth of models against other benchmarks is essential for ensuring accurate metadata handling.
Last updated 10/18/2024
Model | Creator | Best For | Speed (TPS) | Benchmark Average | QUAKE | Klu Index |
---|---|---|---|---|---|---|
GPT-4 Turbo (0409) | OpenAI | Code & Reasoning | 39 | 87.70% | 24.24% | 100 |
o1-preview | OpenAI | Complex Reasoning | 29 | 90.70% | 39.29% | 99 |
GPT-4 Omni (0807) | OpenAI | AI Applications | 131 | 85.40% | 28.79% | 98 |
Claude 3.5 Sonnet | Anthropic | Chat & Vision | 80 | 82.25% | 31.82% | 97 |
Gemini Pro 1.5 | Reward Model | 64 | 73.61% | 27.27% | 96 | |
Claude 3 Opus | Anthropic | Creative Content | 23 | 77.35% | 19.70% | 91 |
Understanding the Klu Index Score
The Klu Index Score evaluates frontier models on accuracy, evaluations, human preference, and performance. It combines these indicators into one score, making it easier to compare models. This score helps identify models that best balance quality, cost, and speed for specific applications.
Powered by real-time Klu.ai data as of 10/18/2024, this LLM Leaderboard reveals key insights into use cases, performance, and quality. GPT-4 Turbo (0409) leads with a 100 Klu Index score. o1-preview excels in complex reasoning with a 99 Klu Index. GPT-4 Omni (0807) is optimal for AI applications with a speed of 131 TPS. Claude 3.5 Sonnet is best for chat and vision tasks, achieving an 82.25% benchmark average. Gemini Pro 1.5 is noted for reward modeling with a 73.61% benchmark average, while Claude 3 Opus excels in creative content with a 77.35% benchmark average.
This data enables optimal API provider and model selection based on specific needs, balancing factors like performance, context size, cost, and speed. The leaderboard compares 30+ frontier models based on real-world use, leading benchmarks, and cost vs. speed vs. quality performance.
This comprehensive comparison allows researchers and developers to make informed decisions when selecting an LLM, evaluating factors such as languages supported, performance, context size, speed, and cost. For those interested in maximizing efficiency and accuracy, the leaderboard serves as a vital resource, providing clear evaluations of the models’ abilities. Additionally, the ongoing updates and inclusion of real-time data ensure that the leaderboard remains relevant for anyone asking the right questions, improving NLP benchmarks, and optimizing their use of cutting-edge technologies.
Frontier Model Comparison
Select two models to compare their performance side by side.
GPT-4 Omni (0807)
- Overall Average81.69%
- Knowledge (MMLU)88.70%
- Expertise (GPQA)53.60%
- Vision (MMMU)69.10%
- Reasoning (HellaSWAG)94.20%
- Coding (HumanEval)90.20%
- Reasoning (BBHard)91.30%
- Easy Math (GSM8K)89.80%
- Hard Math (MATH)76.60%
- Klu Index Score98.00
- Speed (Tokens/s)130.60
- Stream Start (Seconds)0.50
- Price / Million Tokens$4.16
Claude 3.5 Sonnet
- Overall Average82.25%
- Knowledge (MMLU)88.70%
- Expertise (GPQA)59.40%
- Vision (MMMU)68.30%
- Reasoning (HellaSWAG)89.00%
- Coding (HumanEval)92.00%
- Reasoning (BBHard)93.10%
- Easy Math (GSM8K)96.40%
- Hard Math (MATH)71.10%
- Klu Index Score97.42
- Speed (Tokens/s)80.10
- Stream Start (Seconds)0.84
- Price / Million Tokens$6
Frontier Benchmarks Leaderboard
The hardest benchmarks, the most advanced models.
Model | Average | BCB | MATH | GPQA | MMMU | GAIA | QUAKE | |
---|---|---|---|---|---|---|---|---|
Coding | Uni Math | Knowledge | Vision | Assistants | Productivity | |||
1 | o1-preview | 57.92% | 34.5% | 94.8% | 78.3% | 78.2% | 22.42% | 39.29% |
2 | o1-mini | 55.03% | 27% | 94.8% | 59.4% | 78.2% | 22.42% | 48.34% |
3 | Claude 3.5 Sonnet | 47.69% | 33.1% | 71.1% | 59.4% | 68.3% | 22.42% | 31.82% |
4 | GPT-4o (0806) | 46.92% | 29.1% | 76.6% | 56.1% | 69.1% | 21.82% | 28.79% |
5 | Gemini 1.5 Pro (0924) | 43.99% | 31.1% | 67.7% | 46.2% | 62.2% | 10.3% | 46.43% |
6 | GPT-4 Turbo (0409) | 42.26% | 35.1% | 72.2% | 48% | 63.1% | 10.91% | 24.24% |
7 | Claude 3 Opus | 39.48% | 29.7% | 60.1% | 50.4% | 59.4% | 17.58% | 19.7% |
8 | Gemini 1.5 Flash (0924) | 39.35% | 25% | 54.9% | 39.5% | 56.1% | 13.33% | 47.26% |
9 | GPT-4o Mini | 37.78% | 27% | 70.2% | 40.2% | 59.4% | 15.15% | 14.71% |
Frontier Model Leaderboard
The standard benchmarks, the leading frontier models.
Model | Average | MMLU | GPQA | MMMU | HellaSWAG | HumanEval | BBHard | GSM8K | MATH | |
---|---|---|---|---|---|---|---|---|---|---|
Knowledge | Expert | Vision | Context | Coding | Reasoning | K6 Math | Uni Math | |||
1 | Claude 3.5 Sonnet | 82.25% | 88.70% | 59.4% | 68.3% | 89% | 92% | 93.10% | 96.40% | 71.10% |
2 | GPT-4o (0513) | 81.69% | 88.70% | 53.6% | 69.1% | 94.20% | 90.20% | 91.30% | 89.80% | 76.60% |
3 | GPT-4 Turbo (0409) | 79.10% | 86.50% | 48.0% | 63.1% | 94.20% | 90.20% | 87.60% | 91% | 72.20% |
4 | Llama 3.1 (405B) | 79.01% | 88.60% | 51.1% | 64.5% | 87% | 89% | 81.3% | 96.80% | 73.80% |
5 | Mistral Large 2 (0724) | 78.80% | 84% | 35.1% | — | 89.20% | 92% | 87.30% | 93% | 71% |
6 | Claude 3 Opus | 77.35% | 86.80% | 50.4% | 59.4% | 95.40% | 84.90% | 86.80% | 95% | 60.10% |
7 | Llama 3.1 70B | 75.65% | 86% | 46.7% | 60.6% | 87% | 80.50% | 81.30% | 95.10% | 68.0% |
8 | Gemini 1.5 Pro | 73.61% | 81.90% | 46.2% | 62.2% | 92.50% | 71.90% | 84% | 91.70% | 58.50% |
9 | GPT-4 (0314) | 71.15% | 86.40% | 35.7% | 56.8% | 95.30% | 67% | 83.10% | 92% | 52.90% |
10 | GPT-4o Mini | 70.70% | 82% | 40.2% | 59.4% | — | 87.20% | — | — | 70.20% |
11 | Claude 3 Sonnet | 69.85% | 79% | 46.4% | 53.1% | 89% | 73% | 82.90% | 92.30% | 43.10% |
12 | Gemini 1.5 Flash | 68.63% | 78.90% | 39.5% | 56.1% | 81.30% | 67.50% | 89.20% | 68.80% | 67.70% |
13 | Claude 3 Haiku | 66.10% | 75.20% | 40.1% | 50.2% | 85.90% | 75.90% | 73.70% | 88.90% | 38.90% |
14 | Llama 3.1 8B | 64.29% | 73.0% | 32.8% | — | 74.20% | 72.60% | 61% | 84.50% | 51.90% |
15 | Mistral Nemo 12B | 41.30% | 68% | 8.72% | — | 83.5% | — | — | — | 5% |
All Models by Provider
Model | Klu Index | Cost / 1M Tokens | Speed (TPS) | Speed (TTFT) | Best Benchmark | |
---|---|---|---|---|---|---|
OpenAI | ||||||
o1-preview | 99 | $26.25 | 29 | 30.55s | MATH 94.8 | |
GPT-4 Turbo (0409) | 100 | $15 | 39 | 0.55s | gsm8k 94.8 | |
GPT-4 Omni (0807) | 98 | $4.16 | 131 | 0.5s | gsm8k 93.8 | |
GPT-4 Omni Mini (0718) | 85 | $0.26 | 266 | 0.57s | MMLU 89 | |
GPT-4 32k (0314) | 85 | $75 | 23 | 0.55s | gsm8k 95.0 | |
GPT-4 (0613) | 79 | $37.50 | 23 | 0.55s | openbookqa 95.6 | |
GPT-4 Vision (1106) | 89 | $15 | 27 | 0.53s | gsm8k 94.8 | |
GPT-3.5 Turbo | 70 | $0.75 | 72 | 0.34s | openbookqa 92.3 | |
GPT-3.5 Turbo Instruct | 70 | $1.63 | 130 | 0.49s | MMLU 70 | |
Azure OpenAI | ||||||
GPT-4 (0613) | 79 | $37.50 | 23 | 0.57s | openbookqa 95.6 | |
GPT-4 Vision Preview (1106) | 89 | $15 | 41 | 0.57s | gsm8k 94.8 | |
GPT-4 Turbo Preview (1106) | 100 | $15 | 33 | 0.57s | gsm8k 94.8 | |
GPT-4 Omni (0513) | 99 | $7.50 | 72 | 0.5s | gsm8k 93.8 | |
GPT-3.5 Turbo | 70 | $0.75 | 72 | 0.34s | openbookqa 92.3 | |
GPT-3.5 Turbo Instruct | 70 | $1.63 | 137 | 0.62s | gsm8k 93.8 | |
Azure AI | ||||||
Llama 3.1 405B | 89 | $7.99 | 32 | 0.79s | humaneval 89 | |
Llama 3.1 70B | 86 | $1.29 | 86 | 0.44s | MMLU 86 | |
Llama 3.1 8B | 69 | $0.30 | 266 | 0.29s | MMLU 73 | |
Mistral | ||||||
Mistral Large 2 (0724) | 88 | $4.5 | 44 | 0.29s | GSM8K 93 | |
Mistral Nemo 12B | 79 | $0.30 | 190 | 0.31s | hellaswag 83.5 | |
Mistral Large | 79 | $12 | 24 | 1s | MMLU 81.2 | |
Mistral Medium | 76 | $4.05 | 18 | 0.48s | MMLU 70.60 | |
Mixtral 8x22B | 78 | $1.20 | 69 | 0.28s | average 78 | |
Mistral 8x7B | 70 | $0.50 | 103 | 0.29s | MMLU 70.60 | |
Mistral Small | 70 | $3 | 81 | 0.72s | MMLU 70.60 | |
Mistral 7B | 61 | $0.15 | 88 | 0.28s | MMLU 70.60 | |
Perplexity | ||||||
Sonar Large | 89 | $1 | 41 | 1.14s | MMLU 70.60 | |
Sonar Small | 66 | $0.02 | 121 | 0.91s | MMLU 70.60 | |
Cohere | ||||||
Command R+ | 74 | $6 | 62 | 0.44s | MMLU 75.70 | |
Command R | 62 | $0.75 | 147 | 0.35s | MMLU 68.20 | |
Anthropic | ||||||
Claude Instant | 67 | $1.90 | 89 | 0.42s | MMLU 70.60 | |
Claude 2.1 | 70 | $12 | 28 | 0.59s | MMLU 70.60 | |
Claude 3 Haiku | 82 | $0.50 | 120 | 0.3s | humaneval 75.90 | |
Claude 3 Sonnet | 82 | $6 | 54 | 0.64s | bbhard 82.90 | |
Claude 3.5 Sonnet | 97 | $6 | 80 | 0.84s | MMLU 89 | |
Claude 3 Opus | 91 | $30 | 23 | 1.66s | MMLU 86.80 | |
Groq | ||||||
Gemma 2 9B | 52 | $0.20 | 122 | 0.21s | MMLU 64.30 | |
Llama 3.1 8B | 69 | $0.30 | 744 | 0.29s | MMLU 73 | |
Llama 3.1 70B | 86 | $1.29 | 249 | 0.44s | MMLU 86 | |
Gemma 7B | 52 | $0.07 | 1030 | 0.88s | MMLU 64.30 | |
Llama 3 70B | 70 | $0.64 | 358 | 0.4s | MMLU 70.60 | |
Llama 3 8B | 63 | $0.06 | 1211 | 0.36s | hellaswag 87 | |
Mixtral 8x7B | 70 | $0.24 | 552 | 0.44s | MMLU 70.60 | |
Gemini Pro 1.5 | 96 | $1.25 | 64 | 1.88s | bbhard 75 | |
Gemini Flash 1.5 | 89 | $1.25 | 89 | 1.88s | bbhard 75 | |
Gemma 7B | 52 | $0.15 | 123 | 0.29s | MMLU 64.30 | |
Gemma 2 27B | 83 | $0.30 | 49 | 0.49s | MMLU 75.2 | |
Gemma 2 8B | 82 | $0.20 | 139 | 0.29s | MMLU 71.3 |