Klu raises $1.7M to empower AI Teams  

LLM Leaderboard

Real-time Klu.ai data powers this leaderboard for evaluating LLM providers, enabling selection of the optimal API and model for your needs.

Last updated 4/6/2024

Most Preferred
Claude 3 Opus
100 Klu Index
Largest Context
Claude 3 Opus
200k Context Window
Most Expensive
GPT-4 32k (0314)
$75.00 / Million Tokens
Least Expensive
PPLX-7B-Online
$0.07 / Million Tokens
Fastest TPS
Mistral Medium
18.4 Tokens/s
Fastest TTFT
Mistral Small
0.25s TTFT

The LLM Leaderboard's comprehensive analysis reveals Claude 3 Opus as the top recommendation for creative content applications. With the highest Klu Index Score of 100, Claude 3 Opus excels in creativity, user preference, and overall performance. While its cost is $30.00 per million tokens, the model's exceptional quality and speed of 22.8 tokens per second make it the most favored choice for enhancing AI-driven creative solutions.

Exploring alternatives? Consider these models: GPT-4 Turbo / Vision (1106) leads for functionality calling and vision applications with a Klu Index of 99, offering 26.5 tokens/s at $15/m tokens. GPT-4 Turbo (0125) is recommended for AI applications, with a speed of 18.6 tokens/s at $15/m tokens. For customer chats, Claude 3 Sonnet balances speed (54.1 tokens/s) with cost at $6/m tokens. GPT-4 32k offers large-context reasoning capabilities at $75/million tokens. Additionally, Claude 3 Haiku stands out for web applications, offering a cost-effective rate of $0.50/million tokens, but will require multiple deployments to avoid request limitations.

Top Models by Klu Index Score

ModelCreatorBest ForSpeed (TPS)Klu Index Score
Claude 3 OpusAnthropicCreative Content22.8100.00
GPT-4 Turbo / Vision (1106)OpenAIVision Applications26.599.00
GPT-4 Turbo (0125)OpenAIAI Applications18.699.00
Claude 3 SonnetAnthropicCustomer Chat54.187.97
GPT-4 32k (0314)OpenAILarge-context Reasoning2385.12
Claude 3 HaikuAnthropicWeb Apps12083.81

Understanding the Klu Index Score

The Klu Index Score is a composite metric designed to evaluate the performance of frontier models across various dimensions such as accuracy, evaluations, human preference, and performance. It combines multiple performance indicators into a single, comprehensive score, allowing for an easier comparison between different models. This score is particularly useful for identifying which models provide the best balance of quality, cost, and speed for specific applications.

All Models by Creator

ModelKlu IndexCost / 1M TokensSpeed (TPS)Speed (TTFT)Best Benchmark
Anthropic
Claude 3 Opus 100$30.0022.81.66s
hellaswag 95.40
Claude 3 Sonnet 87.97$6.0054.10.64s
GSM8K 92.30
Claude 3 Haiku 83.81$0.501200.3s
GSM8K 88.90
Claude 1.0 76.59$21.5031.90.53s
Claude 2.0 72.21$12.0031.90.53s
Claude 2.1 69.80$12.0028.40.59s
Claude Instant 67.40$1.9089.30.42s
OpenAI
GPT-4 Turbo / Vision (1106) 99.00$15.0026.50.53s
MMMU 56.8
GPT-4 Turbo (0125) 99.00$15.0018.60.55s
GPT-4 32k (0314) 85.12$75.00230.55s
hellaswag 95.30
GPT-4 (0613) 79.21$37.50230.55s
hellaswag 95.30
GPT-3.5 Turbo 69.58$0.7571.80.34s
hellaswag 85.50
GPT-3.5 Turbo Instruct $1.63129.70.49s
Google
Gemini Pro 1.0 71.99$1.25891.88s
hellaswag 84.70
Gemini Pro 1.5 73.22$1.25891.88s
hellaswag 84.70
Gemini Ultra
GSM8K 94.40
Gemma 7B 52.30$0.15123.30.29s
hellaswag 81.2
Mistral
Mistral Large 78.99$12.00241s
hellaswag 89.2
Mistral Medium 76.37$4.0518.40.26s
Mistral 8x7B 69.58$0.50102.90.29s
hellaswag 84.40
Mistral Small 69.58$3.00810.25s
hellaswag 84.40
Mistral 7B 60.61$0.1588.40.28s
Groq
Mistral 8x7B 69.58$0.27466.50.29s
hellaswag 84.40
Llama 2 Chat 70B 62.58$0.75791.80.29s
hellaswag 87
Gemma 7B 34.31$0.10102.90.29s
hellaswag 81.3
Meta
Llama 2 Chat (70B) 62.58$1.0070.20.31s
hellaswag 87
Llama 2 Chat (13B) 54.27$0.3056.70.33s
hellaswag 80.7
Llama 2 Chat (7B) 51.20$0.2085.20.4s
hellaswag 77.22
Code Llama (70B) $0.9530.20.29s
OpenChat
OpenChat 3.5 60.39$0.1784.20.29s
Perplexity
PPLX-70B Online 59.74$0.7053.81.14s
PPLX-7B-Online 52.52$0.07135.70.91s
Cohere
Command $1.6328.80.33s
Command Light $0.3847.90.3s

Frequently asked questions

When evaluating large language models (LLMs), it's crucial to consider benchmark data that showcases each model's abilities across various use cases. Our leaderboard provides a comprehensive comparison of different models, including popular choices like Anthropic Claude Haiku and OpenAI GPT-3.5 Turbo, based on essential metrics such as output quality, tokens used, and performance on specific datasets.

To ensure a fair comparison, we use a consistent set of criteria and prompts across all models. This allows us to assess their average output and benchmark their performance against one another. However, it's important to note that no single benchmark can perfectly capture an LLM's capabilities across all use cases.

When referring to our leaderboard, keep in mind that the benchmarks shown are just a starting point for comparison. The best way to determine which model suits your needs is to experiment with the models using your specific use case and dataset. If you're interested in implementing a particular dataset or benchmark for comparison, please contact us, and we'll be happy to assist you.

As we continue to add new models and datasets to our leaderboard, the breadth of our comparisons will expand, providing you with even more information to make informed decisions. While the benchmarks can help you sort and find suitable models, it's essential to understand that LLMs are constantly changing and improving. What may be the best choice today could be outperformed by a newer model tomorrow.

Ultimately, the key to finding the right LLM for your use case lies in understanding your specific requirements and comparing the available models based on relevant benchmarks and metrics. Our leaderboard aims to make this process easier by providing a comprehensive overview of the current state of LLMs and their performance across various datasets.