LMSYS Chatbot Arena Leaderboard
by Stephen M. Walker II, Co-Founder / CEO
What is the LMSYS Chatbot Arena Leaderboard?
The LMSYS Chatbot Arena Leaderboard is a comprehensive ranking platform that assesses the performance of large language models (LLMs) in conversational tasks. It uses a combination of human feedback and automated scoring to evaluate models like GPT-4, Claude, and others, providing a clear view of their strengths and weaknesses in real-world applications.
The leaderboard is updated regularly, reflecting the latest advancements in AI technology. It includes models from leading organizations such as OpenAI, Anthropic, Google, and Meta, showcasing their capabilities in engaging and informative conversations.
Key features of the LMSYS Chatbot Arena Leaderboard include:
- Diverse Evaluation Metrics — The leaderboard uses multiple metrics for evaluation, including Chatbot Arena Elo, which is based on user feedback, and other performance indicators.
- Regular Updates — The leaderboard is frequently updated to reflect the latest developments in LLM technology, ensuring that it remains a relevant and valuable resource for AI researchers and developers.
- Community Engagement — Users can participate in the evaluation process by providing feedback on chatbot interactions, contributing to the dynamic nature of the leaderboard.
LMSYS Chatbot Arena Leaderboard (September 2024)
Model | Arena Elo rating | MT-bench (score) | MMLU | License |
---|---|---|---|---|
o1-preview | 1355 | +12/-11 | 2991 | Proprietary |
ChatGPT-4o-latest (2024-09-03) | 1335 | +5/-6 | 10213 | Proprietary |
o1-mini | 1324 | +12/-9 | 3009 | Proprietary |
Gemini-1.5-Pro-Exp-0827 | 1299 | +5/-4 | 28229 | Proprietary |
Grok-2-08-13 | 1294 | +4/-4 | 23999 | Proprietary |
GPT-4o-2024-05-13 | 1285 | +3/-3 | 90695 | Proprietary |
GPT-4o-mini-2024-07-18 | 1273 | +3/-3 | 30434 | Proprietary |
Claude 3.5 Sonnet | 1269 | +3/-3 | 62977 | Proprietary |
Gemini-1.5-Flash-Exp-0827 | 1269 | +4/-4 | 22264 | Proprietary |
Grok-2-Mini-08-13 | 1267 | +4/-5 | 22041 | Proprietary |
Gemini Advanced App (2024-05-14) | 1267 | +3/-3 | 52218 | Proprietary |
Meta-Llama-3.1-405b-Instruct-fp8 | 1266 | +4/-4 | 31280 | Llama 3.1 Community |
Meta-Llama-3.1-405b-Instruct-bf16 | 1264 | +6/-8 | 5865 | Llama 3.1 Community |
GPT-4o-2024-08-06 | 1263 | +4/-3 | 22562 | Proprietary |
Gemini-1.5-Pro-001 | 1259 | +3/-3 | 80656 | Proprietary |
GPT-4-Turbo-2024-04-09 | 1257 | +3/-2 | 92973 | Proprietary |
GPT-4-1106-preview | 1251 | 9.40 | 2023/4 | Proprietary |
Mistral-Large-2407 | 1250 | — | 2024/7 | Mistral Research |
Athene-70b | 1250 | — | 2024/7 | CC-BY-NC-4.0 |
Meta-Llama-3.1-70b-Instruct | 1249 | — | 2023/12 | Llama 3.1 Community |
Claude 3 Opus | 1248 | 9.45 | 87.1 | Proprietary |
GPT-4-0125-preview | 1245 | 9.38 | — | Proprietary |
Yi-Large-preview | 1240 | — | — | Proprietary |
Gemini-1.5-Flash-001 | 1227 | — | 78.9 | Proprietary |
Deepseek-v2-API-0628 | 1219 | — | — | DeepSeek |
Gemma-2-27b-it | 1218 | — | — | Gemma license |
Yi-Large | 1212 | — | — | Proprietary |
Gemini App (2024-01-24) | 1209 | — | — | Proprietary |
Nemotron-4-340B-Instruct | 1209 | — | — | NVIDIA Open Model |
GLM-4-0520 | 1207 | — | — | Proprietary |
Llama-3-70b-Instruct | 1206 | — | 82.0 | Llama 3 Community |
Claude 3 Sonnet | 1201 | 9.22 | 87.0 | Proprietary |
Vision Leaderboard
The Vision Leaderboard ranks top LLMs based on their performance in vision-based conversations. It evaluates models using metrics like Arena Elo, MT-bench score, and MMLU, providing insights into their strengths and weaknesses.
Model | Arena Score | Organization | Knowledge Cutoff |
---|---|---|---|
Gemini-1.5-Pro-Exp-0827 | 1231 (+9/-6) | 2023/11 | |
GPT-4o-2024-05-13 | 1209 (+6/-6) | OpenAI | 2023/10 |
Gemini-1.5-Flash-Exp-0827 | 1208 (+11/-12) | 2023/11 | |
Claude 3.5 Sonnet | 1191 (+6/-4) | Anthropic | 2024/4 |
Gemini-1.5-Pro-001 | 1151 (+8/-6) | 2023/11 | |
GPT-4-Turbo-2024-04-09 | 1151 (+7/-4) | OpenAI | 2023/12 |
GPT-4o-mini-2024-07-18 | 1120 (+6/-5) | OpenAI | 2023/10 |
Gemini-1.5-Flash-8b-Exp-0827 | 1110 (+9/-10) | 2023/11 | |
Qwen2-VL-72B | 1085 (+26/-19) | Alibaba | Unknown |
Claude 3 Opus | 1075 (+5/-6) | Anthropic | 2023/8 |
Gemini-1.5-Flash-001 | 1072 (+7/-6) | 2023/11 | |
InternVL2-26b | 1068 (+8/-7) | OpenGVLab | 2024/7 |
Claude 3 Sonnet | 1048 (+6/-6) | Anthropic | 2023/8 |
Yi-Vision | 1039 (+15/-15) | 01 AI | 2024/7 |
qwen2-vl-7b-instruct | 1037 (+23/-21) | Alibaba | Unknown |
Reka-Flash-Preview-20240611 | 1024 (+8/-6) | Reka AI | Unknown |
Reka-Core-20240501 | 1015 (+5/-6) | Reka AI | Unknown |
InternVL2-4b | 1010 (+9/-8) | OpenGVLab | 2024/7 |
LLaVA-v1.6-34B | 1000 (+9/-7) | LLaVA | 2024/1 |
Claude 3 Haiku | 1000 (+7/-6) | Anthropic | 2023/8 |
LLaVA-OneVision-qwen2-72b-ov-sft | 992 (+16/-13) | LLaVA | 2024/8 |
CogVLM2-llama3-chat-19b | 990 (+13/-12) | Zhipu AI | 2024/7 |
MiniCPM-v 2_6 | 976 (+15/-13) | OpenBMB | 2024/7 |
Phi-3.5-vision-instruct | 916 (+11/-10) | Microsoft | 2024/8 |
Phi-3-Vision-128k-Instruct | 874 (+15/-12) | Microsoft | 2024/3 |
The updated leaderboard continues to evaluate a wide spectrum of models from renowned AI research organizations such as OpenAI, Anthropic, Google, Meta, and Reka AI. It provides a comprehensive overview of model performance, considering metrics like Arena Elo rating, MT-bench score, and MMLU score. This latest update underscores the ongoing competition and rapid innovation in AI, with new models consistently pushing the boundaries of performance benchmarks. As of September 2024, the LMSYS Chatbot Arena Leaderboard remains an essential resource for tracking the state-of-the-art in LLM capabilities, offering valuable insights into the evolving landscape of artificial intelligence.
How does the LMSYS Chatbot Arena Leaderboard work?
The LMSYS Chatbot Arena Leaderboard evaluates LLMs through a combination of user feedback and automated scoring systems. Participants can engage with chatbots and provide feedback, which is then used to calculate the Arena Elo rating. This process ensures that the leaderboard reflects both human preferences and objective performance metrics.
The leaderboard is an essential resource for developers and researchers, offering insights into the strengths and weaknesses of various models. It helps identify areas for improvement and guides the development of more advanced conversational AI systems.
What is the purpose of the LMSYS Chatbot Arena Leaderboard?
The purpose of the LMSYS Chatbot Arena Leaderboard is to provide a transparent and dynamic evaluation of LLMs in conversational settings. By incorporating user feedback and automated scoring, it offers a comprehensive view of model performance, helping to drive innovation and improvement in AI technology.
The leaderboard is designed to foster collaboration and knowledge sharing among AI researchers and developers, promoting the development of more effective and engaging conversational models.
Future Directions for the LMSYS Chatbot Arena Leaderboard
Future directions for the LMSYS Chatbot Arena Leaderboard include expanding the range of evaluation metrics, incorporating more diverse conversational scenarios, and enhancing user engagement. By continuously evolving, the leaderboard aims to remain at the forefront of AI evaluation, providing valuable insights into the capabilities of the latest LLMs.