LMSYS Chatbot Arena Leaderboard

by Stephen M. Walker II, Co-Founder / CEO

What is the LMSYS Chatbot Arena Leaderboard?

The LMSYS Chatbot Arena Leaderboard is a comprehensive ranking platform that assesses the performance of large language models (LLMs) in conversational tasks. It uses a combination of human feedback and automated scoring to evaluate models like GPT-4, Claude, and others, providing a clear view of their strengths and weaknesses in real-world applications.

The leaderboard is updated regularly, reflecting the latest advancements in AI technology. It includes models from leading organizations such as OpenAI, Anthropic, Google, and Meta, showcasing their capabilities in engaging and informative conversations.

Key features of the LMSYS Chatbot Arena Leaderboard include:

  • Diverse Evaluation Metrics — The leaderboard uses multiple metrics for evaluation, including Chatbot Arena Elo, which is based on user feedback, and other performance indicators.
  • Regular Updates — The leaderboard is frequently updated to reflect the latest developments in LLM technology, ensuring that it remains a relevant and valuable resource for AI researchers and developers.
  • Community Engagement — Users can participate in the evaluation process by providing feedback on chatbot interactions, contributing to the dynamic nature of the leaderboard.

LMSYS Chatbot Arena Leaderboard (September 2024)

ModelArena Elo ratingMT-bench (score)MMLULicense
o1-preview1355+12/-112991Proprietary
ChatGPT-4o-latest (2024-09-03)1335+5/-610213Proprietary
o1-mini1324+12/-93009Proprietary
Gemini-1.5-Pro-Exp-08271299+5/-428229Proprietary
Grok-2-08-131294+4/-423999Proprietary
GPT-4o-2024-05-131285+3/-390695Proprietary
GPT-4o-mini-2024-07-181273+3/-330434Proprietary
Claude 3.5 Sonnet1269+3/-362977Proprietary
Gemini-1.5-Flash-Exp-08271269+4/-422264Proprietary
Grok-2-Mini-08-131267+4/-522041Proprietary
Gemini Advanced App (2024-05-14)1267+3/-352218Proprietary
Meta-Llama-3.1-405b-Instruct-fp81266+4/-431280Llama 3.1 Community
Meta-Llama-3.1-405b-Instruct-bf161264+6/-85865Llama 3.1 Community
GPT-4o-2024-08-061263+4/-322562Proprietary
Gemini-1.5-Pro-0011259+3/-380656Proprietary
GPT-4-Turbo-2024-04-091257+3/-292973Proprietary
GPT-4-1106-preview12519.402023/4Proprietary
Mistral-Large-240712502024/7Mistral Research
Athene-70b12502024/7CC-BY-NC-4.0
Meta-Llama-3.1-70b-Instruct12492023/12Llama 3.1 Community
Claude 3 Opus12489.4587.1Proprietary
GPT-4-0125-preview12459.38Proprietary
Yi-Large-preview1240Proprietary
Gemini-1.5-Flash-001122778.9Proprietary
Deepseek-v2-API-06281219DeepSeek
Gemma-2-27b-it1218Gemma license
Yi-Large1212Proprietary
Gemini App (2024-01-24)1209Proprietary
Nemotron-4-340B-Instruct1209NVIDIA Open Model
GLM-4-05201207Proprietary
Llama-3-70b-Instruct120682.0Llama 3 Community
Claude 3 Sonnet12019.2287.0Proprietary

Vision Leaderboard

The Vision Leaderboard ranks top LLMs based on their performance in vision-based conversations. It evaluates models using metrics like Arena Elo, MT-bench score, and MMLU, providing insights into their strengths and weaknesses.

ModelArena ScoreOrganizationKnowledge Cutoff
Gemini-1.5-Pro-Exp-08271231 (+9/-6)Google2023/11
GPT-4o-2024-05-131209 (+6/-6)OpenAI2023/10
Gemini-1.5-Flash-Exp-08271208 (+11/-12)Google2023/11
Claude 3.5 Sonnet1191 (+6/-4)Anthropic2024/4
Gemini-1.5-Pro-0011151 (+8/-6)Google2023/11
GPT-4-Turbo-2024-04-091151 (+7/-4)OpenAI2023/12
GPT-4o-mini-2024-07-181120 (+6/-5)OpenAI2023/10
Gemini-1.5-Flash-8b-Exp-08271110 (+9/-10)Google2023/11
Qwen2-VL-72B1085 (+26/-19)AlibabaUnknown
Claude 3 Opus1075 (+5/-6)Anthropic2023/8
Gemini-1.5-Flash-0011072 (+7/-6)Google2023/11
InternVL2-26b1068 (+8/-7)OpenGVLab2024/7
Claude 3 Sonnet1048 (+6/-6)Anthropic2023/8
Yi-Vision1039 (+15/-15)01 AI2024/7
qwen2-vl-7b-instruct1037 (+23/-21)AlibabaUnknown
Reka-Flash-Preview-202406111024 (+8/-6)Reka AIUnknown
Reka-Core-202405011015 (+5/-6)Reka AIUnknown
InternVL2-4b1010 (+9/-8)OpenGVLab2024/7
LLaVA-v1.6-34B1000 (+9/-7)LLaVA2024/1
Claude 3 Haiku1000 (+7/-6)Anthropic2023/8
LLaVA-OneVision-qwen2-72b-ov-sft992 (+16/-13)LLaVA2024/8
CogVLM2-llama3-chat-19b990 (+13/-12)Zhipu AI2024/7
MiniCPM-v 2_6976 (+15/-13)OpenBMB2024/7
Phi-3.5-vision-instruct916 (+11/-10)Microsoft2024/8
Phi-3-Vision-128k-Instruct874 (+15/-12)Microsoft2024/3

The updated leaderboard continues to evaluate a wide spectrum of models from renowned AI research organizations such as OpenAI, Anthropic, Google, Meta, and Reka AI. It provides a comprehensive overview of model performance, considering metrics like Arena Elo rating, MT-bench score, and MMLU score. This latest update underscores the ongoing competition and rapid innovation in AI, with new models consistently pushing the boundaries of performance benchmarks. As of September 2024, the LMSYS Chatbot Arena Leaderboard remains an essential resource for tracking the state-of-the-art in LLM capabilities, offering valuable insights into the evolving landscape of artificial intelligence.

How does the LMSYS Chatbot Arena Leaderboard work?

The LMSYS Chatbot Arena Leaderboard evaluates LLMs through a combination of user feedback and automated scoring systems. Participants can engage with chatbots and provide feedback, which is then used to calculate the Arena Elo rating. This process ensures that the leaderboard reflects both human preferences and objective performance metrics.

The leaderboard is an essential resource for developers and researchers, offering insights into the strengths and weaknesses of various models. It helps identify areas for improvement and guides the development of more advanced conversational AI systems.

What is the purpose of the LMSYS Chatbot Arena Leaderboard?

The purpose of the LMSYS Chatbot Arena Leaderboard is to provide a transparent and dynamic evaluation of LLMs in conversational settings. By incorporating user feedback and automated scoring, it offers a comprehensive view of model performance, helping to drive innovation and improvement in AI technology.

The leaderboard is designed to foster collaboration and knowledge sharing among AI researchers and developers, promoting the development of more effective and engaging conversational models.

Future Directions for the LMSYS Chatbot Arena Leaderboard

Future directions for the LMSYS Chatbot Arena Leaderboard include expanding the range of evaluation metrics, incorporating more diverse conversational scenarios, and enhancing user engagement. By continuously evolving, the leaderboard aims to remain at the forefront of AI evaluation, providing valuable insights into the capabilities of the latest LLMs.

More terms

What is the computational complexity of common AI algorithms?

The computational complexity of common AI algorithms varies depending on the specific algorithm. For instance, the computational complexity of a simple linear regression algorithm is O(n), where n is the number of features. Conversely, the computational complexity of more complex algorithms like deep learning neural networks is significantly higher and can reach O(n^2) or even O(n^3) in some cases, where n is the number of nodes in the network. It's important to note that a higher computational complexity often means the algorithm requires more resources and time to train and run, which can impact the efficiency and effectiveness of the AI model.

Read more

What is data science?

Data science is a multidisciplinary field that uses scientific methods, processes, and systems to extract knowledge and insights from data in various forms, both structured and unstructured. It combines principles and practices from fields such as mathematics, statistics, artificial intelligence, and computer engineering.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free