GPT-4 Getting Worse
by Stephen M. Walker II, Co-Founder / CEO
Top tip
Automated Evaluations are now available, enabling comprehensive assessment of GPT-4's performance for your specific use cases.
Is GPT-4 Getting Worse?
The question of whether GPT-4 is getting worse is complex and multifaceted.
But, yes, it is.
MODEL | RELEASE | BENCHMARKS | RETRIEVAL | COMPSCORE |
---|---|---|---|---|
GPT-4 | 0314 | 3.50 | 3.94 | 3.59 |
GPT-4 32k | 0314 | 3.50 | 3.94 | 3.58 |
GPT-4 | 0613 | 3.50 | 3.90 | 3.58 |
GPT-4 32k | 0613 | 3.50 | 3.90 | 3.56 |
GPT-4 Turbo | 2024-04-09 | 3.50 | 3.98 | 3.48 |
GPT-4o | 2024-05-13 | 3.42 | 3.96 | 3.44 |
GPT-3.5 Turbo | 0301 | 3.02 | 3.89 | 3.42 |
GPT-3.5 Turbo | 0613 | 3.01 | 3.88 | 3.33 |
GPT-4's speed, vision capabilities, and ability to follow complex ideas are improving. However, when examining raw capabilities across all known benchmarks, the original GPT-4 release maintains its position.
While some users have reported perceived declines in performance, particularly in specific tasks or contexts, it's important to consider several factors.
Model updates, changes in training data, and evolving user expectations can all influence perceptions of performance.
Additionally, OpenAI continuously works on improving and fine-tuning their models, as evidenced by the release of GPT-4 Omni (GPT-4o) on May 13, 2024.
This latest model showcases advanced multimodal capabilities, processing and generating text, audio, image, and video inputs and outputs, which may address some of the concerns raised by users.
Therefore, while there may be instances where GPT-4's performance appears to fluctuate, ongoing advancements and updates aim to enhance its overall capabilities and user experience.
Reported Issues
Many users have reported experiencing issues with GPT-4 that weren't present in earlier versions:
- Decreased reasoning capabilities and logical errors
- Difficulty maintaining context and following instructions
- Reduced accuracy in specialized tasks like coding
- Less nuanced understanding of prompts
Quantitative Study
A study by researchers from Stanford and UC Berkeley has attempted to quantify these changes:
- Code generation: The percentage of directly executable code dropped from 52.0% in March to 10.0% in June for GPT-4
- Prime number identification: GPT-4's accuracy decreased from 97.6% in March to 2.4% in June
These changes are likely due to two reasons: performance improvements via model pruning, quantization and other techniques, and additional RLHF to handle edge cases, safety issues, and minimize extra token output.
Possible Explanations
Several factors might contribute to these perceived and measured changes:
- Increased user base straining the system
- Modifications to improve speed at the cost of accuracy
- Changes in content moderation and safety measures
- Ongoing model updates and fine-tuning
Differing Opinions
It's important to note that experiences vary, and not all users report a decline in quality. Some find GPT-4 to be faster and more human-like in its responses, even if less accurate in certain areas.
Lack of Official Communication
OpenAI has not provided detailed explanations for these changes, leading to speculation and frustration among users. This lack of transparency has made it difficult for users to understand the reasons behind the perceived changes in performance.
Implications
The reported decline in quality has several implications:
- Users may need to verify GPT-4's outputs more carefully
- Businesses relying on GPT-4 APIs may need to reassess their strategies
- There's an increased interest in open-source alternatives
While GPT-4 remains a powerful tool, users should be aware of its limitations and potential inconsistencies. As with any AI technology, it's crucial to approach its outputs critically and verify important information from authoritative sources.