Anthropic Claude 3

by Stephen M. Walker II, Co-Founder / CEO

Top tip

If you're already using Claude for AI tasks, you'll find Claude 3 to be a significant upgrade, though it may not offer substantial real-world advantages over GPT-4 Turbo.

Anthropic recently unveiled the Claude 3 model family, establishing new benchmarks in a variety of cognitive tasks. This lineup includes Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, each model offering enhanced performance levels for different user needs in terms of intelligence, speed, and cost.

  • Claude 3 Haiku — The fastest and most compact model, designed for near-instant responsiveness, particularly suited for simple queries and requests.
  • Claude 3 Sonnet — Offers a balance between speed and intelligence, excelling in tasks that require rapid responses like knowledge retrieval or sales automation. It is twice as fast as previous Claude models.
  • Claude 3 Opus — The most intelligent model in the family, outperforming peers on benchmarks such as undergraduate and graduate-level knowledge and reasoning, basic mathematics, and more. It exhibits near-human comprehension and fluency in complex tasks.
Klu Anthropic Claude 3 Model Family Sizes

Availability

Anthropic's Claude 3 series introduces a significant leap in Claude-series capabilities, featuring multimodal understanding that processes both text and visual inputs, including photos, charts, graphs, and technical diagrams. This advancement allows for a broader application in various fields, enhancing the AI's ability to interact with and understand complex information.

Klu Claude AI Logic

Accessible in 159 countries via claude.ai and its API, the Claude 3 models, including Opus and Sonnet, have already begun to impact the market, with plans to introduce Haiku soon. Anthropic claims models stand out in performing live customer chats, auto-completions, and data extraction tasks, demonstrating superior capabilities in benchmark tests when compared to competitors like OpenAI's GPT-4. Early tests leave some room for skepticism.

Claude 3 Benchmark Performance

According to Anthropic's published data, the Claude 3 model family has redefined industry standards, outshining predecessors like GPT-4 and Gemini Ultra across various cognitive tasks. The Claude 3 Opus model, the most sophisticated in the lineup, showcases exceptional performance in key AI benchmarks such as MMLU (undergraduate-level knowledge), GPQA (graduate-level reasoning), GSM8K (basic mathematics), and HumanEval (coding), indicating its prowess in understanding complex tasks, mathematical reasoning, and coding, with near-human comprehension and fluency. Additionally, its advanced computer vision capabilities allow for effective information extraction from visual materials.

Klu Claude 3 Vision

Comparatively, Claude 3 Opus surpasses leading models from OpenAI and Google in benchmarks, setting a new standard for conversational AI. It excels in undergraduate and graduate knowledge, and grade school math, while the Claude 3 Sonnet model impresses with its ability to understand scientific diagrams, showcasing its utility for enterprise operations and data analysis.

TestClaude 3 OpusGPT-4Gemini 1.0 Ultra
Undergraduate level knowledge MMLU86.8% 5-shot86.4% 5-shot83.7% 5-shot
Graduate level reasoning GPQA, Diamond50.4% 0-shot CoT35.7% 0-shot CoT
Grade school math GSM8K95.0% 0-shot CoT92.0% 5-shot CoT94.4% Maij@32
Math problem-solving MATH60.1% 0-shot CoT52.9% 4-shot53.2% 4-shot
Multilingual math MGSM90.7% 0-shot74.5% 8-shot79.0% 8-shot
Code HumanEval84.9% 0-shot67.0% 0-shot74.4% 0-shot
Reasoning over text DROP, F1 score83.1% 3-shot80.9% 3-shot82.4% Variable shots
Mixed evaluations BIG-Bench-Hard86.8% 3-shot CoT83.1% 3-shot CoT83.6% 3-shot CoT
Knowledge Q&A ARC-Challenge96.4% 25-shot96.3% 25-shot
Common Knowledge HellaSwag95.4% 10-shot95.3% 10-shot87.8% 10-shot

Claude 3's multimodal nature, capable of processing both text and images, significantly expands its application range. Anthropic emphasizes AI safety, actively working to mitigate bias and ensure neutrality, making Claude 3 models a top choice for both enterprise and consumer applications due to their performance, safety features, and broad applicability.

Klu Claude 3 Vision
TestClaude 3 OpusGPT-4 VisionGemini 1.0 Ultra
Math & reasoning MMMU (val)59.4%56.8%59.4%
Document visual Q&A ANLS score, test89.3%88.4%90.9%
Math MathVista (testmini)50.5% CoT49.9%53.0%
Science diagrams AI2D, test88.1%78.2%79.5%
Chart Q&A Relaxed accuracy (test)80.8% 0-shot CoT78.5% 4-shot CoT80.8%

Claude 3 Model Series

The Claude 3 models, particularly Opus, excel in common AI evaluation benchmarks like MMLU, GPQA, and GSM8K, demonstrating near-human comprehension and fluency in complex tasks. These models also improve in analysis, forecasting, content creation, code generation, and multilingual communication.

The economic potential of Claude 3, particularly the Opus model, has been highlighted in its role as an economic analyst, suggesting its utility in specialized professional domains. When testing ourselves though and comparing against GPT-4 Turbo, we find little evidence to back this claim. With over 10,000 organizations already leveraging Amazon Bedrock for generative AI applications, the introduction of Claude 3 models is poised to further accelerate adoption.

Accuracy

Klu Claude 3 Hard Questions

Claude historically has issues with accuracy due to its ability to be very creative. This is most noticed when iterating on complex storylines. For Claude 3, accuracy has also been improved, with Opus showing a twofold increase in correct answers for complex, factual questions. The models will soon support citations for verifying answers.

Context length up to 1 million tokens

Klu Claude 3 Retrieval Recall

Initially, the Claude 3 models will have a 200K context window, with the potential for inputs over 1 million tokens for select customers. The models demonstrate robust recall capabilities, with Opus achieving near-perfect recall in the 'Needle In A Haystack' evaluation.

Speed

Claude 3 models deliver near-instantaneous results for live chats, auto-completions, and data extraction, with Haiku being the fastest and most cost-effective for its intelligence level. Sonnet offers double the speed of its predecessors with higher intelligence, while Opus matches their speed but with significantly enhanced intelligence.

Vision

These models also feature advanced vision capabilities, processing various visual formats and offering this capability to enterprise customers with visually encoded knowledge bases.

Instruction Following, Reduced Refusals & AI Safety

The Claude 3 models are user-friendly, adept at following complex instructions, and capable of producing structured outputs like JSON for various applications.

Claude 2 suffers from refusing to answer non-harmful prompts. This is due to the aggressive AI Safety measures deployed by Anthropic. Significant improvements have been made in reducing unnecessary refusals, with the new models showing a better understanding of prompts and less frequent refusals.

Klu Claude 3 Refusals vs Claude 2

Anthropic's commitment to responsible AI development is evident in the Claude 3 models, which prioritize reducing biases and enhancing safety across various risk scenarios. Despite significant advancements, these models currently operate at AI Safety Level 2, under continuous monitoring to evaluate and manage risk levels effectively. The design of Claude 3 models emphasizes neutrality and bias mitigation, achieving a lower bias rate compared to previous versions. They are also equipped with advanced safety features, enabling them to handle sensitive prompts with greater efficacy.

Planned updates

The initial release of the Claude 3 models features a 200K context window, with plans to extend input capabilities up to 1 million tokens for certain customers. To remain competitive with Google Gemini 1.5's upcoming release, we expect to see Anthropic continue to push their generally available context window lengths.

Upcoming enhancements include the introduction of Tool Use (function calling), interactive coding (REPL), and advanced agentic capabilities.

Anthropic claims that the intelligence of the Claude 3 models is far from reaching its peak. Updates will be rolled out frequently in the coming months, focusing on expanding the models' functionalities, especially for enterprise applications and large-scale deployments.

Integration and support for Claude 3 models are streamlined for ease of use, with availability on major platforms like Amazon Bedrock and Google Cloud's Vertex AI. This ease of integration, coupled with significant backing from Amazon and Google, underscores the models' potential for widespread application across industries.

Pricing

Each model has distinct features and costs, catering to different needs:

  • Opus is the most intelligent, ideal for complex tasks with a cost of $15 per million tokens for input and $75 for output.
  • Sonnet offers a balance of intelligence and speed, suitable for enterprise workloads at $3 per million tokens for input and $15 for output.
  • Haiku provides the fastest response for simple queries at $0.25 per million tokens for input and $1.25 for output.

Opus and Sonnet are currently available, with Haiku to follow. Sonnet powers the free experience on claude.ai, with Opus for Claude Pro subscribers. Availability extends to Amazon Bedrock and Google Cloud’s Vertex AI Model Garden.

Anthropic plans frequent updates for the Claude 3 family, introducing features for enterprise use and large-scale deployments. The company aims to align AI development with positive societal outcomes and encourages feedback to enhance Claude's utility.

More terms

What is Grouped Query Attention (GQA)?

Grouped Query Attention (GQA) is a technique used in large language models to speed up the inference time. It groups queries together and computes their attention jointly, reducing the computational complexity and making the model more efficient.

Read more

What is systems neuroscience?

Systems neuroscience in the context of artificial intelligence (AI) refers to an interdisciplinary approach that combines insights from neuroscience—the study of the nervous system and the brain—with AI development. The goal is to create AI models that can perceive, learn, and adapt in complex and dynamic environments by emulating the functions of neural assemblies and various subsystems found in biological organisms.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free