LMSYS Chatbot Arena Leaderboard

by Stephen M. Walker II, Co-Founder / CEO

What is the LMSYS Chatbot Arena Leaderboard?

The LMSYS Chatbot Arena Leaderboard is a comprehensive ranking platform that assesses the performance of large language models (LLMs) in conversational tasks. It uses a combination of human feedback and automated scoring to evaluate models like GPT-4, Claude, and others, providing a clear view of their strengths and weaknesses in real-world applications.

The leaderboard is updated regularly, reflecting the latest advancements in AI technology. It includes models from leading organizations such as OpenAI, Anthropic, Google, and Meta, showcasing their capabilities in engaging and informative conversations.

Key features of the LMSYS Chatbot Arena Leaderboard include:

Diverse Evaluation Metrics — The leaderboard uses multiple metrics for evaluation, including Chatbot Arena Elo, which is based on user feedback, and other performance indicators.
Regular Updates — The leaderboard is frequently updated to reflect the latest developments in LLM technology, ensuring that it remains a relevant and valuable resource for AI researchers and developers.
Community Engagement — Users can participate in the evaluation process by providing feedback on chatbot interactions, contributing to the dynamic nature of the leaderboard.

LMSYS Chatbot Arena Leaderboard (September 2024)

Model	Arena Elo rating	MT-bench (score)	MMLU	License
o1-preview	1355	+12/-11	2991	Proprietary
ChatGPT-4o-latest (2024-09-03)	1335	+5/-6	10213	Proprietary
o1-mini	1324	+12/-9	3009	Proprietary
Gemini-1.5-Pro-Exp-0827	1299	+5/-4	28229	Proprietary
Grok-2-08-13	1294	+4/-4	23999	Proprietary
GPT-4o-2024-05-13	1285	+3/-3	90695	Proprietary
GPT-4o-mini-2024-07-18	1273	+3/-3	30434	Proprietary
Claude 3.5 Sonnet	1269	+3/-3	62977	Proprietary
Gemini-1.5-Flash-Exp-0827	1269	+4/-4	22264	Proprietary
Grok-2-Mini-08-13	1267	+4/-5	22041	Proprietary
Gemini Advanced App (2024-05-14)	1267	+3/-3	52218	Proprietary
Meta-Llama-3.1-405b-Instruct-fp8	1266	+4/-4	31280	Llama 3.1 Community
Meta-Llama-3.1-405b-Instruct-bf16	1264	+6/-8	5865	Llama 3.1 Community
GPT-4o-2024-08-06	1263	+4/-3	22562	Proprietary
Gemini-1.5-Pro-001	1259	+3/-3	80656	Proprietary
GPT-4-Turbo-2024-04-09	1257	+3/-2	92973	Proprietary
GPT-4-1106-preview	1251	9.40	2023/4	Proprietary
Mistral-Large-2407	1250	—	2024/7	Mistral Research
Athene-70b	1250	—	2024/7	CC-BY-NC-4.0
Meta-Llama-3.1-70b-Instruct	1249	—	2023/12	Llama 3.1 Community
Claude 3 Opus	1248	9.45	87.1	Proprietary
GPT-4-0125-preview	1245	9.38	—	Proprietary
Yi-Large-preview	1240	—	—	Proprietary
Gemini-1.5-Flash-001	1227	—	78.9	Proprietary
Deepseek-v2-API-0628	1219	—	—	DeepSeek
Gemma-2-27b-it	1218	—	—	Gemma license
Yi-Large	1212	—	—	Proprietary
Gemini App (2024-01-24)	1209	—	—	Proprietary
Nemotron-4-340B-Instruct	1209	—	—	NVIDIA Open Model
GLM-4-0520	1207	—	—	Proprietary
Llama-3-70b-Instruct	1206	—	82.0	Llama 3 Community
Claude 3 Sonnet	1201	9.22	87.0	Proprietary

Vision Leaderboard

The Vision Leaderboard ranks top LLMs based on their performance in vision-based conversations. It evaluates models using metrics like Arena Elo, MT-bench score, and MMLU, providing insights into their strengths and weaknesses.

Model	Arena Score	Organization	Knowledge Cutoff
Gemini-1.5-Pro-Exp-0827	1231 (+9/-6)	Google	2023/11
GPT-4o-2024-05-13	1209 (+6/-6)	OpenAI	2023/10
Gemini-1.5-Flash-Exp-0827	1208 (+11/-12)	Google	2023/11
Claude 3.5 Sonnet	1191 (+6/-4)	Anthropic	2024/4
Gemini-1.5-Pro-001	1151 (+8/-6)	Google	2023/11
GPT-4-Turbo-2024-04-09	1151 (+7/-4)	OpenAI	2023/12
GPT-4o-mini-2024-07-18	1120 (+6/-5)	OpenAI	2023/10
Gemini-1.5-Flash-8b-Exp-0827	1110 (+9/-10)	Google	2023/11
Qwen2-VL-72B	1085 (+26/-19)	Alibaba	Unknown
Claude 3 Opus	1075 (+5/-6)	Anthropic	2023/8
Gemini-1.5-Flash-001	1072 (+7/-6)	Google	2023/11
InternVL2-26b	1068 (+8/-7)	OpenGVLab	2024/7
Claude 3 Sonnet	1048 (+6/-6)	Anthropic	2023/8
Yi-Vision	1039 (+15/-15)	01 AI	2024/7
qwen2-vl-7b-instruct	1037 (+23/-21)	Alibaba	Unknown
Reka-Flash-Preview-20240611	1024 (+8/-6)	Reka AI	Unknown
Reka-Core-20240501	1015 (+5/-6)	Reka AI	Unknown
InternVL2-4b	1010 (+9/-8)	OpenGVLab	2024/7
LLaVA-v1.6-34B	1000 (+9/-7)	LLaVA	2024/1
Claude 3 Haiku	1000 (+7/-6)	Anthropic	2023/8
LLaVA-OneVision-qwen2-72b-ov-sft	992 (+16/-13)	LLaVA	2024/8
CogVLM2-llama3-chat-19b	990 (+13/-12)	Zhipu AI	2024/7
MiniCPM-v 2_6	976 (+15/-13)	OpenBMB	2024/7
Phi-3.5-vision-instruct	916 (+11/-10)	Microsoft	2024/8
Phi-3-Vision-128k-Instruct	874 (+15/-12)	Microsoft	2024/3

The updated leaderboard continues to evaluate a wide spectrum of models from renowned AI research organizations such as OpenAI, Anthropic, Google, Meta, and Reka AI. It provides a comprehensive overview of model performance, considering metrics like Arena Elo rating, MT-bench score, and MMLU score. This latest update underscores the ongoing competition and rapid innovation in AI, with new models consistently pushing the boundaries of performance benchmarks. As of September 2024, the LMSYS Chatbot Arena Leaderboard remains an essential resource for tracking the state-of-the-art in LLM capabilities, offering valuable insights into the evolving landscape of artificial intelligence.

How does the LMSYS Chatbot Arena Leaderboard work?

The LMSYS Chatbot Arena Leaderboard evaluates LLMs through a combination of user feedback and automated scoring systems. Participants can engage with chatbots and provide feedback, which is then used to calculate the Arena Elo rating. This process ensures that the leaderboard reflects both human preferences and objective performance metrics.

The leaderboard is an essential resource for developers and researchers, offering insights into the strengths and weaknesses of various models. It helps identify areas for improvement and guides the development of more advanced conversational AI systems.

What is the purpose of the LMSYS Chatbot Arena Leaderboard?

The purpose of the LMSYS Chatbot Arena Leaderboard is to provide a transparent and dynamic evaluation of LLMs in conversational settings. By incorporating user feedback and automated scoring, it offers a comprehensive view of model performance, helping to drive innovation and improvement in AI technology.

The leaderboard is designed to foster collaboration and knowledge sharing among AI researchers and developers, promoting the development of more effective and engaging conversational models.

Future Directions for the LMSYS Chatbot Arena Leaderboard

Future directions for the LMSYS Chatbot Arena Leaderboard include expanding the range of evaluation metrics, incorporating more diverse conversational scenarios, and enhancing user engagement. By continuously evolving, the leaderboard aims to remain at the forefront of AI evaluation, providing valuable insights into the capabilities of the latest LLMs.

Klu is remote-first and global

Follow us

LMSYS Chatbot Arena Leaderboard

What is the LMSYS Chatbot Arena Leaderboard?

LMSYS Chatbot Arena Leaderboard (September 2024)

Vision Leaderboard

How does the LMSYS Chatbot Arena Leaderboard work?

What is the purpose of the LMSYS Chatbot Arena Leaderboard?

Future Directions for the LMSYS Chatbot Arena Leaderboard

More terms

What is a decision boundary?

Human in the Loop (HITL)

It's time to build

LLMOps

Guides

LLMs