Anthropic Claude 3

by Stephen M. Walker II, Co-Founder / CEO

Top tip

If you're already using Claude for AI tasks, you'll find Claude 3 to be a significant upgrade, though it may not offer substantial real-world advantages over GPT-4 Turbo.

Anthropic recently unveiled the Claude 3 model family, establishing new benchmarks in a variety of cognitive tasks. This lineup includes Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, each model offering enhanced performance levels for different user needs in terms of intelligence, speed, and cost.

Claude 3 Haiku — The fastest and most compact model, designed for near-instant responsiveness, particularly suited for simple queries and requests.
Claude 3 Sonnet — Offers a balance between speed and intelligence, excelling in tasks that require rapid responses like knowledge retrieval or sales automation. It is twice as fast as previous Claude models.
Claude 3 Opus — The most intelligent model in the family, outperforming peers on benchmarks such as undergraduate and graduate-level knowledge and reasoning, basic mathematics, and more. It exhibits near-human comprehension and fluency in complex tasks.

Klu Anthropic Claude 3 Model Family Sizes

Availability

Anthropic's Claude 3 series introduces a significant leap in Claude-series capabilities, featuring multimodal understanding that processes both text and visual inputs, including photos, charts, graphs, and technical diagrams. This advancement allows for a broader application in various fields, enhancing the AI's ability to interact with and understand complex information.

Accessible in 159 countries via claude.ai and its API, the Claude 3 models, including Opus and Sonnet, have already begun to impact the market, with plans to introduce Haiku soon. Anthropic claims models stand out in performing live customer chats, auto-completions, and data extraction tasks, demonstrating superior capabilities in benchmark tests when compared to competitors like OpenAI's GPT-4. Early tests leave some room for skepticism.

Claude 3 Benchmark Performance

According to Anthropic's published data, the Claude 3 model family has redefined industry standards, outshining predecessors like GPT-4 and Gemini Ultra across various cognitive tasks. The Claude 3 Opus model, the most sophisticated in the lineup, showcases exceptional performance in key AI benchmarks such as MMLU (undergraduate-level knowledge), GPQA (graduate-level reasoning), GSM8K (basic mathematics), and HumanEval (coding), indicating its prowess in understanding complex tasks, mathematical reasoning, and coding, with near-human comprehension and fluency. Additionally, its advanced computer vision capabilities allow for effective information extraction from visual materials.

Comparatively, Claude 3 Opus surpasses leading models from OpenAI and Google in benchmarks, setting a new standard for conversational AI. It excels in undergraduate and graduate knowledge, and grade school math, while the Claude 3 Sonnet model impresses with its ability to understand scientific diagrams, showcasing its utility for enterprise operations and data analysis.

Test	Claude 3 Opus	GPT-4	Gemini 1.0 Ultra
Undergraduate level knowledge MMLU	86.8% 5-shot	86.4% 5-shot	83.7% 5-shot
Graduate level reasoning GPQA, Diamond	50.4% 0-shot CoT	35.7% 0-shot CoT	—
Grade school math GSM8K	95.0% 0-shot CoT	92.0% 5-shot CoT	94.4% Maij@32
Math problem-solving MATH	60.1% 0-shot CoT	52.9% 4-shot	53.2% 4-shot
Multilingual math MGSM	90.7% 0-shot	74.5% 8-shot	79.0% 8-shot
Code HumanEval	84.9% 0-shot	67.0% 0-shot	74.4% 0-shot
Reasoning over text DROP, F1 score	83.1% 3-shot	80.9% 3-shot	82.4% Variable shots
Mixed evaluations BIG-Bench-Hard	86.8% 3-shot CoT	83.1% 3-shot CoT	83.6% 3-shot CoT
Knowledge Q&A ARC-Challenge	96.4% 25-shot	96.3% 25-shot	—
Common Knowledge HellaSwag	95.4% 10-shot	95.3% 10-shot	87.8% 10-shot

Claude 3's multimodal nature, capable of processing both text and images, significantly expands its application range. Anthropic emphasizes AI safety, actively working to mitigate bias and ensure neutrality, making Claude 3 models a top choice for both enterprise and consumer applications due to their performance, safety features, and broad applicability.

Test	Claude 3 Opus	GPT-4 Vision	Gemini 1.0 Ultra
Math & reasoning MMMU (val)	59.4%	56.8%	59.4%
Document visual Q&A ANLS score, test	89.3%	88.4%	90.9%
Math MathVista (testmini)	50.5% CoT	49.9%	53.0%
Science diagrams AI2D, test	88.1%	78.2%	79.5%
Chart Q&A Relaxed accuracy (test)	80.8% 0-shot CoT	78.5% 4-shot CoT	80.8%

Claude 3 Model Series

The Claude 3 models, particularly Opus, excel in common AI evaluation benchmarks like MMLU, GPQA, and GSM8K, demonstrating near-human comprehension and fluency in complex tasks. These models also improve in analysis, forecasting, content creation, code generation, and multilingual communication.

The economic potential of Claude 3, particularly the Opus model, has been highlighted in its role as an economic analyst, suggesting its utility in specialized professional domains. When testing ourselves though and comparing against GPT-4 Turbo, we find little evidence to back this claim. With over 10,000 organizations already leveraging Amazon Bedrock for generative AI applications, the introduction of Claude 3 models is poised to further accelerate adoption.

Accuracy

Claude historically has issues with accuracy due to its ability to be very creative. This is most noticed when iterating on complex storylines. For Claude 3, accuracy has also been improved, with Opus showing a twofold increase in correct answers for complex, factual questions. The models will soon support citations for verifying answers.

Context length up to 1 million tokens

Initially, the Claude 3 models will have a 200K context window, with the potential for inputs over 1 million tokens for select customers. The models demonstrate robust recall capabilities, with Opus achieving near-perfect recall in the 'Needle In A Haystack' evaluation.

Speed

Claude 3 models deliver near-instantaneous results for live chats, auto-completions, and data extraction, with Haiku being the fastest and most cost-effective for its intelligence level. Sonnet offers double the speed of its predecessors with higher intelligence, while Opus matches their speed but with significantly enhanced intelligence.

Vision

These models also feature advanced vision capabilities, processing various visual formats and offering this capability to enterprise customers with visually encoded knowledge bases.

Instruction Following, Reduced Refusals & AI Safety

The Claude 3 models are user-friendly, adept at following complex instructions, and capable of producing structured outputs like JSON for various applications.

Claude 2 suffers from refusing to answer non-harmful prompts. This is due to the aggressive AI Safety measures deployed by Anthropic. Significant improvements have been made in reducing unnecessary refusals, with the new models showing a better understanding of prompts and less frequent refusals.

Anthropic's commitment to responsible AI development is evident in the Claude 3 models, which prioritize reducing biases and enhancing safety across various risk scenarios. Despite significant advancements, these models currently operate at AI Safety Level 2, under continuous monitoring to evaluate and manage risk levels effectively. The design of Claude 3 models emphasizes neutrality and bias mitigation, achieving a lower bias rate compared to previous versions. They are also equipped with advanced safety features, enabling them to handle sensitive prompts with greater efficacy.

Planned updates

The initial release of the Claude 3 models features a 200K context window, with plans to extend input capabilities up to 1 million tokens for certain customers. To remain competitive with Google Gemini 1.5's upcoming release, we expect to see Anthropic continue to push their generally available context window lengths.

Upcoming enhancements include the introduction of Tool Use (function calling), interactive coding (REPL), and advanced agentic capabilities.

Anthropic claims that the intelligence of the Claude 3 models is far from reaching its peak. Updates will be rolled out frequently in the coming months, focusing on expanding the models' functionalities, especially for enterprise applications and large-scale deployments.

Integration and support for Claude 3 models are streamlined for ease of use, with availability on major platforms like Amazon Bedrock and Google Cloud's Vertex AI. This ease of integration, coupled with significant backing from Amazon and Google, underscores the models' potential for widespread application across industries.

Pricing

Each model has distinct features and costs, catering to different needs:

Opus is the most intelligent, ideal for complex tasks with a cost of $15 per million tokens for input and $75 for output.
Sonnet offers a balance of intelligence and speed, suitable for enterprise workloads at $3 per million tokens for input and $15 for output.
Haiku provides the fastest response for simple queries at $0.25 per million tokens for input and $1.25 for output.

Opus and Sonnet are currently available, with Haiku to follow. Sonnet powers the free experience on claude.ai, with Opus for Claude Pro subscribers. Availability extends to Amazon Bedrock and Google Cloud’s Vertex AI Model Garden.

Anthropic plans frequent updates for the Claude 3 family, introducing features for enterprise use and large-scale deployments. The company aims to align AI development with positive societal outcomes and encourages feedback to enhance Claude's utility.

Klu is remote-first and global

Follow us

Anthropic Claude 3

Availability

Claude 3 Benchmark Performance

Claude 3 Model Series

Accuracy

Context length up to 1 million tokens

Speed

Vision

Instruction Following, Reduced Refusals & AI Safety

Planned updates

Pricing

More terms

What are the different types of dialogue systems?

Stephen Cole Kleene

It's time to build

LLMOps

Guides

LLMs