GPT-4 Getting Worse

by Stephen M. Walker II, Co-Founder / CEO

Top tip

Automated Evaluations are now available, enabling comprehensive assessment of GPT-4's performance for your specific use cases.

Is GPT-4 Getting Worse?

The question of whether GPT-4 is getting worse is complex and multifaceted.

But, yes, it is.

MODELRELEASEBENCHMARKSRETRIEVALCOMPSCORE
GPT-403143.503.943.59
GPT-4 32k03143.503.943.58
GPT-406133.503.903.58
GPT-4 32k06133.503.903.56
GPT-4 Turbo2024-04-093.503.983.48
GPT-4o2024-05-133.423.963.44
GPT-3.5 Turbo03013.023.893.42
GPT-3.5 Turbo06133.013.883.33

GPT-4's speed, vision capabilities, and ability to follow complex ideas are improving. However, when examining raw capabilities across all known benchmarks, the original GPT-4 release maintains its position.

While some users have reported perceived declines in performance, particularly in specific tasks or contexts, it's important to consider several factors.

Model updates, changes in training data, and evolving user expectations can all influence perceptions of performance.

Additionally, OpenAI continuously works on improving and fine-tuning their models, as evidenced by the release of GPT-4 Omni (GPT-4o) on May 13, 2024.

This latest model showcases advanced multimodal capabilities, processing and generating text, audio, image, and video inputs and outputs, which may address some of the concerns raised by users.

Therefore, while there may be instances where GPT-4's performance appears to fluctuate, ongoing advancements and updates aim to enhance its overall capabilities and user experience.

Reported Issues

Many users have reported experiencing issues with GPT-4 that weren't present in earlier versions:

  • Decreased reasoning capabilities and logical errors
  • Difficulty maintaining context and following instructions
  • Reduced accuracy in specialized tasks like coding
  • Less nuanced understanding of prompts

Quantitative Study

A study by researchers from Stanford and UC Berkeley has attempted to quantify these changes:

  • Code generation: The percentage of directly executable code dropped from 52.0% in March to 10.0% in June for GPT-4
  • Prime number identification: GPT-4's accuracy decreased from 97.6% in March to 2.4% in June

These changes are likely due to two reasons: performance improvements via model pruning, quantization and other techniques, and additional RLHF to handle edge cases, safety issues, and minimize extra token output.

Possible Explanations

Several factors might contribute to these perceived and measured changes:

  1. Increased user base straining the system
  2. Modifications to improve speed at the cost of accuracy
  3. Changes in content moderation and safety measures
  4. Ongoing model updates and fine-tuning

Differing Opinions

It's important to note that experiences vary, and not all users report a decline in quality. Some find GPT-4 to be faster and more human-like in its responses, even if less accurate in certain areas.

Lack of Official Communication

OpenAI has not provided detailed explanations for these changes, leading to speculation and frustration among users. This lack of transparency has made it difficult for users to understand the reasons behind the perceived changes in performance.

Implications

The reported decline in quality has several implications:

  • Users may need to verify GPT-4's outputs more carefully
  • Businesses relying on GPT-4 APIs may need to reassess their strategies
  • There's an increased interest in open-source alternatives

While GPT-4 remains a powerful tool, users should be aware of its limitations and potential inconsistencies. As with any AI technology, it's crucial to approach its outputs critically and verify important information from authoritative sources.

More terms

What is Human Intelligence?

Human Intelligence refers to the mental quality that consists of the abilities to learn from experience, adapt to new situations, understand and handle abstract concepts, and use knowledge to manipulate one's environment. It is a complex ability influenced by various factors, including genetics, environment, culture, and education.

Read more

What is Argument Mining?

Argument mining, also known as argumentation mining, is a research area within the field of natural language processing (NLP). Its primary goal is the automatic extraction and identification of argumentative structures from natural language text. These argumentative structures include the premise, conclusions, the argument scheme, and the relationship between the main and subsidiary argument, or the main and counter-argument within discourse.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free