LLM Leaderboard

Real-time Klu.ai data powers this leaderboard for evaluating LLM providers, enabling selection of the optimal API and model for your needs.

Last updated 6/23/2025

Most Preferred

GPT-4 Turbo (0409)

100 Klu Index

Largest Context

Claude 3 Opus

200k Context Window

Most Expensive

GPT-4 32k (0314)

$75 / Million Tokens

Least Expensive

Sonar Small

$0.02 / Million Tokens

Fastest TPS

Llama 3 8B

1211 Tokens/s

Fastest TTFT

Gemma 2 9B

0.21s TTFT

Model	Creator	Best For	Speed (TPS)	Benchmark Average	QUAKE	Klu Index
GPT-4 Turbo (0409)	OpenAI	Code & Reasoning	39	87.70%	24.24%	100
o1-preview	OpenAI	Complex Reasoning	29	90.70%	39.29%	99
GPT-4 Omni (0807)	OpenAI	AI Applications	131	85.40%	28.79%	98
Claude 3.5 Sonnet	Anthropic	Chat & Vision	80	82.25%	31.82%	97
Gemini Pro 1.5	Google	Reward Model	64	73.61%	27.27%	96
Claude 3 Opus	Anthropic	Creative Content	23	77.35%	19.70%	91

Understanding the Klu Index Score

The Klu Index Score evaluates frontier models on accuracy, evaluations, human preference, and performance. It combines these indicators into one score, making it easier to compare models. This score helps identify models that best balance quality, cost, and speed for specific applications.

Powered by real-time Klu.ai data as of 6/23/2025, this LLM Leaderboard reveals key insights into use cases, performance, and quality. GPT-4 Turbo (0409) leads with a 100 Klu Index score. o1-preview excels in complex reasoning with a 99 Klu Index. GPT-4 Omni (0807) is optimal for AI applications with a speed of 131 TPS. Claude 3.5 Sonnet is best for chat and vision tasks, achieving an 82.25% benchmark average. Gemini Pro 1.5 is noted for reward modeling with a 73.61% benchmark average, while Claude 3 Opus excels in creative content with a 77.35% benchmark average.

This data enables optimal API provider and model selection based on specific needs, balancing factors like performance, context size, cost, and speed. The leaderboard compares 30+ frontier models based on real-world use, leading benchmarks, and cost vs. speed vs. quality performance.

This comprehensive comparison allows researchers and developers to make informed decisions when selecting an LLM, evaluating factors such as languages supported, performance, context size, speed, and cost. For those interested in maximizing efficiency and accuracy, the leaderboard serves as a vital resource, providing clear evaluations of the models’ abilities. Additionally, the ongoing updates and inclusion of real-time data ensure that the leaderboard remains relevant for anyone asking the right questions, improving NLP benchmarks, and optimizing their use of cutting-edge technologies.

Frontier Model Comparison

Select two models to compare their performance side by side.

GPT-4 Omni (0807)

Overall Average81.69%
Knowledge (MMLU)88.70%
Expertise (GPQA)53.60%
Vision (MMMU)69.10%
Reasoning (HellaSWAG)94.20%
Coding (HumanEval)90.20%
Reasoning (BBHard)91.30%
Easy Math (GSM8K)89.80%
Hard Math (MATH)76.60%
Klu Index Score98.00
Speed (Tokens/s)130.60
Stream Start (Seconds)0.50
Price / Million Tokens$4.16

Claude 3.5 Sonnet

Overall Average82.25%
Knowledge (MMLU)88.70%
Expertise (GPQA)59.40%
Vision (MMMU)68.30%
Reasoning (HellaSWAG)89.00%
Coding (HumanEval)92.00%
Reasoning (BBHard)93.10%
Easy Math (GSM8K)96.40%
Hard Math (MATH)71.10%
Klu Index Score97.42
Speed (Tokens/s)80.10
Stream Start (Seconds)0.84
Price / Million Tokens$6

Frontier Benchmarks Leaderboard

The hardest benchmarks, the most advanced models.

	Model	Average	BCB	MATH	GPQA	MMMU	GAIA	QUAKE
			Coding	Uni Math	Knowledge	Vision	Assistants	Productivity
1	o1-preview	57.92%	34.5%	94.8%	78.3%	78.2%	22.42%	39.29%
2	o1-mini	55.03%	27%	94.8%	59.4%	78.2%	22.42%	48.34%
3	Claude 3.5 Sonnet	47.69%	33.1%	71.1%	59.4%	68.3%	22.42%	31.82%
4	GPT-4o (0806)	46.92%	29.1%	76.6%	56.1%	69.1%	21.82%	28.79%
5	Gemini 1.5 Pro (0924)	43.99%	31.1%	67.7%	46.2%	62.2%	10.3%	46.43%
6	GPT-4 Turbo (0409)	42.26%	35.1%	72.2%	48%	63.1%	10.91%	24.24%
7	Claude 3 Opus	39.48%	29.7%	60.1%	50.4%	59.4%	17.58%	19.7%
8	Gemini 1.5 Flash (0924)	39.35%	25%	54.9%	39.5%	56.1%	13.33%	47.26%
9	GPT-4o Mini	37.78%	27%	70.2%	40.2%	59.4%	15.15%	14.71%

These benchmarks assess coding skills (BigCodeBench), mathematical proficiency (MATH), expert knowledge (GPQA), vision capabilities (MMMU), AI assistants (GAIA), and productivity (QUAKE). These frontier benchmarks represent some of the most challenging tasks, pushing the boundaries of current multi-modal model capabilities.

Frontier Model Leaderboard

The standard benchmarks, the leading frontier models.

	Model	Average	MMLU	GPQA	MMMU	HellaSWAG	HumanEval	BBHard	GSM8K	MATH
			Knowledge	Expert	Vision	Context	Coding	Reasoning	K6 Math	Uni Math
1	Claude 3.5 Sonnet	82.25%	88.70%	59.4%	68.3%	89%	92%	93.10%	96.40%	71.10%
2	GPT-4o (0513)	81.69%	88.70%	53.6%	69.1%	94.20%	90.20%	91.30%	89.80%	76.60%
3	GPT-4 Turbo (0409)	79.10%	86.50%	48.0%	63.1%	94.20%	90.20%	87.60%	91%	72.20%
4	Llama 3.1 (405B)	79.01%	88.60%	51.1%	64.5%	87%	89%	81.3%	96.80%	73.80%
5	Mistral Large 2 (0724)	78.80%	84%	35.1%	—	89.20%	92%	87.30%	93%	71%
6	Claude 3 Opus	77.35%	86.80%	50.4%	59.4%	95.40%	84.90%	86.80%	95%	60.10%
7	Llama 3.1 70B	75.65%	86%	46.7%	60.6%	87%	80.50%	81.30%	95.10%	68.0%
8	Gemini 1.5 Pro	73.61%	81.90%	46.2%	62.2%	92.50%	71.90%	84%	91.70%	58.50%
9	GPT-4 (0314)	71.15%	86.40%	35.7%	56.8%	95.30%	67%	83.10%	92%	52.90%
10	GPT-4o Mini	70.70%	82%	40.2%	59.4%	—	87.20%	—	—	70.20%
11	Claude 3 Sonnet	69.85%	79%	46.4%	53.1%	89%	73%	82.90%	92.30%	43.10%
12	Gemini 1.5 Flash	68.63%	78.90%	39.5%	56.1%	81.30%	67.50%	89.20%	68.80%	67.70%
13	Claude 3 Haiku	66.10%	75.20%	40.1%	50.2%	85.90%	75.90%	73.70%	88.90%	38.90%
14	Llama 3.1 8B	64.29%	73.0%	32.8%	—	74.20%	72.60%	61%	84.50%	51.90%
15	Mistral Nemo 12B	41.30%	68%	8.72%	—	83.5%	—	—	—	5%

These benchmarks assess general knowledge (MMLU), vision capabilities (MMMU), expert knowledge (GPQA), common-sense reasoning (HELLASWAG), coding skills (HUMANEVAL), and mathematical proficiency (GSM8K, MATH). By evaluating these capabilities, we can understand the strengths and weaknesses of different models.

All Models by Provider

Model	Klu Index	Cost / 1M Tokens	Speed (TPS)	Speed (TTFT)	Best Benchmark
OpenAI
o1-preview	99	$26.25	29	30.55s	MATH 94.8
GPT-4 Turbo (0409)	100	$15	39	0.55s	gsm8k 94.8
GPT-4 Omni (0807)	98	$4.16	131	0.5s	gsm8k 93.8
GPT-4 Omni Mini (0718)	85	$0.26	266	0.57s	MMLU 89
GPT-4 32k (0314)	85	$75	23	0.55s	gsm8k 95.0
GPT-4 (0613)	79	$37.50	23	0.55s	openbookqa 95.6
GPT-4 Vision (1106)	89	$15	27	0.53s	gsm8k 94.8
GPT-3.5 Turbo	70	$0.75	72	0.34s	openbookqa 92.3
GPT-3.5 Turbo Instruct	70	$1.63	130	0.49s	MMLU 70
Azure OpenAI
GPT-4 (0613)	79	$37.50	23	0.57s	openbookqa 95.6
GPT-4 Vision Preview (1106)	89	$15	41	0.57s	gsm8k 94.8
GPT-4 Turbo Preview (1106)	100	$15	33	0.57s	gsm8k 94.8
GPT-4 Omni (0513)	99	$7.50	72	0.5s	gsm8k 93.8
GPT-3.5 Turbo	70	$0.75	72	0.34s	openbookqa 92.3
GPT-3.5 Turbo Instruct	70	$1.63	137	0.62s	gsm8k 93.8
Azure AI
Llama 3.1 405B	89	$7.99	32	0.79s	humaneval 89
Llama 3.1 70B	86	$1.29	86	0.44s	MMLU 86
Llama 3.1 8B	69	$0.30	266	0.29s	MMLU 73
Mistral
Mistral Large 2 (0724)	88	$4.5	44	0.29s	GSM8K 93
Mistral Nemo 12B	79	$0.30	190	0.31s	hellaswag 83.5
Mistral Large	79	$12	24	1s	MMLU 81.2
Mistral Medium	76	$4.05	18	0.48s	MMLU 70.60
Mixtral 8x22B	78	$1.20	69	0.28s	average 78
Mistral 8x7B	70	$0.50	103	0.29s	MMLU 70.60
Mistral Small	70	$3	81	0.72s	MMLU 70.60
Mistral 7B	61	$0.15	88	0.28s	MMLU 70.60
Perplexity
Sonar Large	89	$1	41	1.14s	MMLU 70.60
Sonar Small	66	$0.02	121	0.91s	MMLU 70.60
Cohere
Command R+	74	$6	62	0.44s	MMLU 75.70
Command R	62	$0.75	147	0.35s	MMLU 68.20
Anthropic
Claude Instant	67	$1.90	89	0.42s	MMLU 70.60
Claude 2.1	70	$12	28	0.59s	MMLU 70.60
Claude 3 Haiku	82	$0.50	120	0.3s	humaneval 75.90
Claude 3 Sonnet	82	$6	54	0.64s	bbhard 82.90
Claude 3.5 Sonnet	97	$6	80	0.84s	MMLU 89
Claude 3 Opus	91	$30	23	1.66s	MMLU 86.80
Groq
Gemma 2 9B	52	$0.20	122	0.21s	MMLU 64.30
Llama 3.1 8B	69	$0.30	744	0.29s	MMLU 73
Llama 3.1 70B	86	$1.29	249	0.44s	MMLU 86
Gemma 7B	52	$0.07	1030	0.88s	MMLU 64.30
Llama 3 70B	70	$0.64	358	0.4s	MMLU 70.60
Llama 3 8B	63	$0.06	1211	0.36s	hellaswag 87
Mixtral 8x7B	70	$0.24	552	0.44s	MMLU 70.60
Google
Gemini Pro 1.5	96	$1.25	64	1.88s	bbhard 75
Gemini Flash 1.5	89	$1.25	89	1.88s	bbhard 75
Gemma 7B	52	$0.15	123	0.29s	MMLU 64.30
Gemma 2 27B	83	$0.30	49	0.49s	MMLU 75.2
Gemma 2 8B	82	$0.20	139	0.29s	MMLU 71.3

Frequently asked questions

When evaluating large language models (LLMs), it's crucial to consider benchmark data that showcases each model's abilities across various use cases. Our leaderboard provides a comprehensive comparison of different models, including popular choices like Anthropic Claude Haiku and OpenAI GPT-3.5 Turbo, based on essential metrics such as output quality, tokens used, and performance on specific datasets.

To ensure a fair comparison, we use a consistent set of criteria and prompts across all models. This allows us to assess their average output and benchmark their performance against one another. However, it's important to note that no single benchmark can perfectly capture an LLM's capabilities across all use cases.

When referring to our leaderboard, keep in mind that the benchmarks shown are just a starting point for comparison. The best way to determine which model suits your needs is to experiment with the models using your specific use case and dataset. If you're interested in implementing a particular dataset or benchmark for comparison, please contact us, and we'll be happy to assist you.

As we continue to add new models and datasets to our leaderboard, the breadth of our comparisons will expand, providing you with even more information to make informed decisions. While the benchmarks can help you sort and find suitable models, it's essential to understand that LLMs are constantly changing and improving. What may be the best choice today could be outperformed by a newer model tomorrow.

Ultimately, the key to finding the right LLM for your use case lies in understanding your specific requirements and comparing the available models based on relevant benchmarks and metrics. Our leaderboard aims to make this process easier by providing a comprehensive overview of the current state of LLMs and their performance across various datasets.

Klu is remote-first and global

Follow us

LLM Leaderboard

Frontier Model Comparison

GPT-4 Omni (0807)

Claude 3.5 Sonnet

Frontier Benchmarks Leaderboard

Frontier Model Leaderboard

All Models by Provider

Frequently asked questions

What is the leading LLM?

What is the Klu Index Score?

How is speed measured?

What are common benchmarks?

What is the LLM Leaderboard?

How are models ranked on the LLM Leaderboard?

Can I contribute to the LLM Leaderboard?

What are the latest advancements in open source LLMs?