GPT-4 Getting Worse

Stephen M. Walker II · Co-Founder / CEO

Top tip

Automated Evaluations are now available, enabling comprehensive assessment of GPT-4's performance for your specific use cases.

Is GPT-4 Getting Worse?

The question of whether GPT-4 is getting worse is complex and multifaceted.

But, yes, it is.

MODEL	RELEASE	BENCHMARKS	RETRIEVAL	COMPSCORE
GPT-4	0314	3.50	3.94	3.59
GPT-4 32k	0314	3.50	3.94	3.58
GPT-4	0613	3.50	3.90	3.58
GPT-4 32k	0613	3.50	3.90	3.56
GPT-4 Turbo	2024-04-09	3.50	3.98	3.48
GPT-4o	2024-05-13	3.42	3.96	3.44
GPT-3.5 Turbo	0301	3.02	3.89	3.42
GPT-3.5 Turbo	0613	3.01	3.88	3.33

GPT-4's speed, vision capabilities, and ability to follow complex ideas are improving. However, when examining raw capabilities across all known benchmarks, the original GPT-4 release maintains its position.

While some users have reported perceived declines in performance, particularly in specific tasks or contexts, it's important to consider several factors.

Model updates, changes in training data, and evolving user expectations can all influence perceptions of performance.

Additionally, OpenAI continuously works on improving and fine-tuning their models, as evidenced by the release of GPT-4 Omni (GPT-4o) on May 13, 2024.

This latest model showcases advanced multimodal capabilities, processing and generating text, audio, image, and video inputs and outputs, which may address some of the concerns raised by users.

Therefore, while there may be instances where GPT-4's performance appears to fluctuate, ongoing advancements and updates aim to enhance its overall capabilities and user experience.

Reported Issues

Many users have reported experiencing issues with GPT-4 that weren't present in earlier versions:

Decreased reasoning capabilities and logical errors
Difficulty maintaining context and following instructions
Reduced accuracy in specialized tasks like coding
Less nuanced understanding of prompts

Quantitative Study

A study by researchers from Stanford and UC Berkeley has attempted to quantify these changes:

Code generation: The percentage of directly executable code dropped from 52.0% in March to 10.0% in June for GPT-4
Prime number identification: GPT-4's accuracy decreased from 97.6% in March to 2.4% in June

These changes are likely due to two reasons: performance improvements via model pruning, quantization and other techniques, and additional RLHF to handle edge cases, safety issues, and minimize extra token output.

Possible Explanations

Several factors might contribute to these perceived and measured changes:

Increased user base straining the system
Modifications to improve speed at the cost of accuracy
Changes in content moderation and safety measures
Ongoing model updates and fine-tuning

Differing Opinions

It's important to note that experiences vary, and not all users report a decline in quality. Some find GPT-4 to be faster and more human-like in its responses, even if less accurate in certain areas.

Lack of Official Communication

OpenAI has not provided detailed explanations for these changes, leading to speculation and frustration among users. This lack of transparency has made it difficult for users to understand the reasons behind the perceived changes in performance.

Implications

The reported decline in quality has several implications:

Users may need to verify GPT-4's outputs more carefully
Businesses relying on GPT-4 APIs may need to reassess their strategies
There's an increased interest in open-source alternatives

While GPT-4 remains a powerful tool, users should be aware of its limitations and potential inconsistencies. As with any AI technology, it's crucial to approach its outputs critically and verify important information from authoritative sources.

More terms

Continue exploring the glossary.

Learn how teams define, measure, and improve LLM systems.

Glossary term

What is the Nvidia H100?

The Nvidia H100 is a high-performance computing device designed for data centers. It offers unprecedented performance, scalability, and security, making it a game-changer for large-scale AI and HPC workloads.

Read term

Glossary term

Why is Data Management Crucial for LLMOps?

Data management is a critical aspect of Large Language Model Operations (LLMOps). It involves the collection, cleaning, storage, and monitoring of data used in training and operating large language models. Effective data management ensures the quality, availability, and reliability of this data, which is crucial for the performance of the models. Without proper data management, models may produce inaccurate or unreliable results, hindering their effectiveness. This article explores why data management is so crucial for LLMOps and how it can be effectively implemented.

Read term

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

GPT-4 Getting Worse

Is GPT-4 Getting Worse?

Continue exploring the glossary.

What is the Nvidia H100?

Why is Data Management Crucial for LLMOps?

It's time to build

LLMOps

Guides

LLMs