LLM Leaderboard
Real-time Klu.ai data powers this leaderboard for evaluating LLM providers, enabling selection of the optimal API and model for your needs.
Last updated 4/6/2024
The LLM Leaderboard's comprehensive analysis reveals Claude 3 Opus as the top recommendation for creative content applications. With the highest Klu Index Score of 100, Claude 3 Opus excels in creativity, user preference, and overall performance. While its cost is $30.00 per million tokens, the model's exceptional quality and speed of 22.8 tokens per second make it the most favored choice for enhancing AI-driven creative solutions.
Exploring alternatives? Consider these models: GPT-4 Turbo / Vision (1106) leads for functionality calling and vision applications with a Klu Index of 99, offering 26.5 tokens/s at $15/m tokens. GPT-4 Turbo (0125) is recommended for AI applications, with a speed of 18.6 tokens/s at $15/m tokens. For customer chats, Claude 3 Sonnet balances speed (54.1 tokens/s) with cost at $6/m tokens. GPT-4 32k offers large-context reasoning capabilities at $75/million tokens. Additionally, Claude 3 Haiku stands out for web applications, offering a cost-effective rate of $0.50/million tokens, but will require multiple deployments to avoid request limitations.
Top Models by Klu Index Score
Model | Creator | Best For | Speed (TPS) | Klu Index Score |
---|---|---|---|---|
Claude 3 Opus | Anthropic | Creative Content | 22.8 | 100.00 |
GPT-4 Turbo / Vision (1106) | OpenAI | Vision Applications | 26.5 | 99.00 |
GPT-4 Turbo (0125) | OpenAI | AI Applications | 18.6 | 99.00 |
Claude 3 Sonnet | Anthropic | Customer Chat | 54.1 | 87.97 |
GPT-4 32k (0314) | OpenAI | Large-context Reasoning | 23 | 85.12 |
Claude 3 Haiku | Anthropic | Web Apps | 120 | 83.81 |
Understanding the Klu Index Score
The Klu Index Score is a composite metric designed to evaluate the performance of frontier models across various dimensions such as accuracy, evaluations, human preference, and performance. It combines multiple performance indicators into a single, comprehensive score, allowing for an easier comparison between different models. This score is particularly useful for identifying which models provide the best balance of quality, cost, and speed for specific applications.
All Models by Creator
Model | Klu Index | Cost / 1M Tokens | Speed (TPS) | Speed (TTFT) | Best Benchmark | Edit |
---|---|---|---|---|---|---|
Anthropic | ||||||
Claude 3 Opus | 100 | $30.00 | 22.8 | 1.66s | hellaswag 95.40 | Edit, Claude 3 Opus |
Claude 3 Sonnet | 87.97 | $6.00 | 54.1 | 0.64s | GSM8K 92.30 | Edit, Claude 3 Sonnet |
Claude 3 Haiku | 83.81 | $0.50 | 120 | 0.3s | GSM8K 88.90 | Edit, Claude 3 Haiku |
Claude 1.0 | 76.59 | $21.50 | 31.9 | 0.53s | — | Edit, Claude 1.0 |
Claude 2.0 | 72.21 | $12.00 | 31.9 | 0.53s | — | Edit, Claude 2.0 |
Claude 2.1 | 69.80 | $12.00 | 28.4 | 0.59s | — | Edit, Claude 2.1 |
Claude Instant | 67.40 | $1.90 | 89.3 | 0.42s | — | Edit, Claude Instant |
OpenAI | ||||||
GPT-4 Turbo / Vision (1106) | 99.00 | $15.00 | 26.5 | 0.53s | MMMU 56.8 | Edit, GPT-4 Turbo / Vision (1106) |
GPT-4 Turbo (0125) | 99.00 | $15.00 | 18.6 | 0.55s | — | Edit, GPT-4 Turbo (0125) |
GPT-4 32k (0314) | 85.12 | $75.00 | 23 | 0.55s | hellaswag 95.30 | Edit, GPT-4 32k (0314) |
GPT-4 (0613) | 79.21 | $37.50 | 23 | 0.55s | hellaswag 95.30 | Edit, GPT-4 (0613) |
GPT-3.5 Turbo | 69.58 | $0.75 | 71.8 | 0.34s | hellaswag 85.50 | Edit, GPT-3.5 Turbo |
GPT-3.5 Turbo Instruct | — | $1.63 | 129.7 | 0.49s | — | Edit, GPT-3.5 Turbo Instruct |
Gemini Pro 1.0 | 71.99 | $1.25 | 89 | 1.88s | hellaswag 84.70 | Edit, Gemini Pro 1.0 |
Gemini Pro 1.5 | 73.22 | $1.25 | 89 | 1.88s | hellaswag 84.70 | Edit, Gemini Pro 1.5 |
Gemini Ultra | — | – | – | GSM8K 94.40 | Edit, Gemini Ultra | |
Gemma 7B | 52.30 | $0.15 | 123.3 | 0.29s | hellaswag 81.2 | Edit, Gemma 7B |
Mistral | ||||||
Mistral Large | 78.99 | $12.00 | 24 | 1s | hellaswag 89.2 | Edit, Mistral Large |
Mistral Medium | 76.37 | $4.05 | 18.4 | 0.26s | — | Edit, Mistral Medium |
Mistral 8x7B | 69.58 | $0.50 | 102.9 | 0.29s | hellaswag 84.40 | Edit, Mistral 8x7B |
Mistral Small | 69.58 | $3.00 | 81 | 0.25s | hellaswag 84.40 | Edit, Mistral Small |
Mistral 7B | 60.61 | $0.15 | 88.4 | 0.28s | — | Edit, Mistral 7B |
Groq | ||||||
Mistral 8x7B | 69.58 | $0.27 | 466.5 | 0.29s | hellaswag 84.40 | Edit, Mistral 8x7B |
Llama 2 Chat 70B | 62.58 | $0.75 | 791.8 | 0.29s | hellaswag 87 | Edit, Llama 2 Chat 70B |
Gemma 7B | 34.31 | $0.10 | 102.9 | 0.29s | hellaswag 81.3 | Edit, Gemma 7B |
Meta | ||||||
Llama 2 Chat (70B) | 62.58 | $1.00 | 70.2 | 0.31s | hellaswag 87 | Edit, Llama 2 Chat (70B) |
Llama 2 Chat (13B) | 54.27 | $0.30 | 56.7 | 0.33s | hellaswag 80.7 | Edit, Llama 2 Chat (13B) |
Llama 2 Chat (7B) | 51.20 | $0.20 | 85.2 | 0.4s | hellaswag 77.22 | Edit, Llama 2 Chat (7B) |
Code Llama (70B) | — | $0.95 | 30.2 | 0.29s | — | Edit, Code Llama (70B) |
OpenChat | ||||||
OpenChat 3.5 | 60.39 | $0.17 | 84.2 | 0.29s | — | Edit, OpenChat 3.5 |
Perplexity | ||||||
PPLX-70B Online | 59.74 | $0.70 | 53.8 | 1.14s | — | Edit, PPLX-70B Online |
PPLX-7B-Online | 52.52 | $0.07 | 135.7 | 0.91s | — | Edit, PPLX-7B-Online |
Cohere | ||||||
Command | — | $1.63 | 28.8 | 0.33s | — | Edit, Command |
Command Light | — | $0.38 | 47.9 | 0.3s | — | Edit, Command Light |
Frequently asked questions
When evaluating large language models (LLMs), it's crucial to consider benchmark data that showcases each model's abilities across various use cases. Our leaderboard provides a comprehensive comparison of different models, including popular choices like Anthropic Claude Haiku and OpenAI GPT-3.5 Turbo, based on essential metrics such as output quality, tokens used, and performance on specific datasets.
To ensure a fair comparison, we use a consistent set of criteria and prompts across all models. This allows us to assess their average output and benchmark their performance against one another. However, it's important to note that no single benchmark can perfectly capture an LLM's capabilities across all use cases.
When referring to our leaderboard, keep in mind that the benchmarks shown are just a starting point for comparison. The best way to determine which model suits your needs is to experiment with the models using your specific use case and dataset. If you're interested in implementing a particular dataset or benchmark for comparison, please contact us, and we'll be happy to assist you.
As we continue to add new models and datasets to our leaderboard, the breadth of our comparisons will expand, providing you with even more information to make informed decisions. While the benchmarks can help you sort and find suitable models, it's essential to understand that LLMs are constantly changing and improving. What may be the best choice today could be outperformed by a newer model tomorrow.
Ultimately, the key to finding the right LLM for your use case lies in understanding your specific requirements and comparing the available models based on relevant benchmarks and metrics. Our leaderboard aims to make this process easier by providing a comprehensive overview of the current state of LLMs and their performance across various datasets.