Evaluating 2024 Frontier Model Capabilities Pt.01
by Stephen M. Walker II, Co-Founder / CEO
Introduction
At Klu, we help thousands of teams to build magical, cutting-edge LLM-powered applications, serving millions of monthly requests. And for years, we've believed in the capabilities of LLMs, building internal tools and apps before starting Klu.
Our platform addresses the challenges we faced obtaining optimal model performance and now supports advanced capabilities such as evaluation on RAG pipelines, fine-tuned models, function calls, and agentic vision use cases.
Despite the power of these models, we often wonder: why aren't there more breakout apps and revenue-generating hits like ChatGPT or Github Copilot?
To answer this question, we set out to investigate:
- What are the real-world limits of frontier models?
- How much is luck vs. the prompt vs. the model?
- When does RAG lower model performance?
- How misleading are current model benchmarks?
To do this, we created QUAKE, a private benchmark based on the most common tasks and use cases. QUAKE demonstrates that for LLMs to be commercially useful, they need to be as reliable as a college-educated person in everyday work tasks.
This includes writing good content, analyzing data, filing reports, helping customers, making decisions, or even just answering quick questions like "who's online right now?"
What we found surprised us.
Great Expectations
The current economic mood surrounding GenAI shifted significantly in recent months from excitement to show me the money.
With initial excitement, sky-high valuations, and predictions of widespread disruption, reality has begun to set in: we are in the early innings of a new technology.
Many companies are finding it challenging to adopt, much less develop useful, loved, revenue-generating applications using these models.
According to Goldman Sachs Report released a few weeks back, tech giants are projected to invest over $1 trillion in AI infrastructure. Goldman analysts skeptically raise questions about the return on such substantial expenditure.
Right or wrong, one thing is true: GenAI revenues are not significantly offsetting these costs.
Where is the revenue when these systems are so powerful they trigger calls for bans or regulation? Is there a flawed assumption in the public perception of frontier models?
Leading benchmarks like MMLU (Knowledge Benchmark) and MMMU (Multimodal Benchmark) show impressive results, with top models approaching or even surpassing human expert performance in some areas.
Frontier models like Claude 3.5 Sonnet (86.50) and GPT-4o (84.20) nearly match human experts (89.8) in MMLU. In vision tasks (MMMU), even the best models (GPT-4o at 69.1) lag behind humans (non-expert: 82.6, expert: 88.6). On GPQA, models outperform non-experts (22.1%) but not experts (81.3%).
However, these benchmarks do not accurately reflect real-world performance on practical tasks. If you break out 2023's GPT-4 Technical Report, even OpenAI admits their leading model shows gaps in real-world performance, stating GPT-4 is "less capable than humans in many real-world scenarios."
To better understand this gap between standardized eval performance and real-world capabilities, we created our own benchmark.
QUAKE Benchmark
We developed QUAKE with several observations in mind.
Current benchmarks don't differentiate enough between text and vision capabilities, they focus heavily on multiple-choice selection, and don't test for how models are used today. Before we have working assistants, agents, or artificial super intelligence (ASI), we wanted to explore today's real-world scenarios.
QUAKE — Qualitative Utility Assessment of Knowledge and Engineering — evaluates LLMs on realistic tasks that an average college-educated person encounters in the workplace. These tasks span various categories, including content creation, data analysis, and customer support.
This includes tasks such as identifying famous individuals, reviewing financial reports, and completing CAPTCHA-like puzzles. The human-labeled dataset spans various professional domains, including sales, health, computer science, and finance, requiring skills in vision, instruction following, reading comprehension, and analysis.
We use the Klu.ai evaluation suite to store the dataset with text and images, running hundreds of tests across six frontier models. On average, 2024 frontier models score 16% on hard, and 28% under human review – answers adjusted where the model was correct, but didn't match an expected format or structure.
Google's Gemini 1.5 Flash demonstrated surprising capability, achieving a 24% score, 34.85% from human reviewers, and 42.67% from LLM reviewers – but fails all, but 1 vision task.
Anthropic's Claude 3.5 Sonnet ranked second with a 31.82% score from human reviewers, particularly excelling in vision tasks.
Meanwhile, OpenAI's GPT-4o stands out with its LLM Review score of 46.14% — indicating that despite mostly incorrect answers, it came closest to overall success.
Note on scoring methodology: we use three scoring methods...
- Hard Score: Percentage of error-free task completions
- Human Review: Grade-curved for minor issues
- LLM Review: LLM provides points for effort
Our findings reveal a significant gap between benchmark performance and real-world task completion, with frontier LLMs averaging just a 28% pass rate on QUAKE tasks, underscoring the practical challenges these models face.
On several tasks, we were able to push a model's success rate from 30% to 90-100% with task-specific system messages. This underscores the importance of prompt engineering in fully leveraging model capabilities.
With QUAKE, we enable a more accurate model improvement forecast. This was made possible by backtesting 2022 and 2023 models, including Davinci-003, GPT-3.5 Turbo, GPT-4, and Claude 1. By normalizing OpenAI's data, we observe that, on average, a revised model version is released every 110 days with a performance improvement of 18.7% over the previous release.
2022-2026 QUAKE Performance Forecast
Given this forecast, we anticipate a November GPT-5 model release that significantly outperforms GPT-4, but a smaller bump over GPT-4o. Rumors suggest that our forecast may be underestimating the improvements in GPT-5.
In reality, interim model releases trend to flat or negative perfromance (eg. GPT-4 0314 vs GPT-4 0613), with significant lifts between minor or major releases (eg. GPT-3.5 vs GPT-4, GPT-4 vs. GPT-4o).
Observations
QUAKE evaluations reveal significant insights into the strengths and weaknesses of current frontier models. These models excel in specific text or data generation but often struggle with mundane, real-world applications and consistency. They require multiple attempts or highly-specialized prompts to perform well.
This highlights the need for continued refinement with targeted improvements. We believe this significant effort is what holds most products back currently – either due to gaps in skill or data needed to make these improvements, or use cases that consistently fail despite best efforts.
Prompting to Success
Example Grid System Prompt
Generic prompts yield generic responses. Surprising? No. However, setting model Temperature to 1, we observe Claude Sonnet 3.5 and GPT-4o provide exceptional responses ~30% of the time, excelling in accuracy, content, and structure for various tasks.
This means without specific prompt engineering, the majority of outputs are boring and useless, but surprisingly great responses are rarely seen due to the unpredictable setting.
The above system prompt with two examples increases Sonnet's vision accuracy from 25% to 94% on a single task. Priming the task and providing room to generate vastly improves performance.
Mixture of Answers
Example Logic Output
Examining multiple outputs for the same question, we observe that most logic failures are due to premature convergence.
For example, GPT-4o (at Temperature 1) correctly answers a novel logic question 30% of the time, but will select the correct answer if included in a few-shot sequence every time.
The model knows the correct answer, but fails to generate it a majority of the time.
Context Overload
LLM Review Explanation
Most models struggle with long-context tasks. Only GPT-4o and Claude 3 Opus consistently, accurately answer 10 questions from 120,000 tokens. Additionally, adding retrieved documents with strong length or structure biases the models towards them, rather than purely utilizing for knowledge.
Gemini Pro with 1-2M context windows consistently failed to deliver accurate results. Gemini 1.5 Flash showed improvements in July, but failed 60% of the time with incorrect or incomplete answers.
Claude 3.5 Sonnet and similar-size Claude models continue generating tokens from the end of the context without addressing questions. We also observe this in smaller open-source models (fine-tuned, quantized).
Vision is Vibes
Vision Task Evaluation
Vision is in its early day. None of the models can look at Slack and tell you who is online.
Vision model performance relies heavily on vibes. What does this mean? Vision transformers break images down into tiles and predict the corresponding text.
This works well for OCR or checking an entire image for "hot dog" or "not hot dog," but it performs poorly in spatial recognition, such as accurately describing a grid or the objects within certain squares.
What Does This Mean?
Hopefully the QUAKE benchmark serves as a reality check for the hype, calls for regulation, and outsized expectations. Yes, you can talk with a PDF. But are the anwsers and conclusions correct? That's another story.
Frontier models score exceptionally high on standardized benchmarks, but a future model solving QUAKE would mark a significant milestone in monetizable tasks performed by generative models.
Today's models successfully replace low-skill, outsourced labor tasks (such as English-language blog writers in Afghanistan) but need expertise and craftsmanship to produce world-class products, focused on small, specific outputs.
This means agentic products such as AutoGPT, Adept, Devon, Lindy, or Rabbit R1 are unlikely to work until built upon future model breakthroughs.
A few closing thoughts...
-
The Performance Gap — Frontier models pass only 28% of practical tasks, despite scoring in the 80th percentile on benchmarks. This gap likely explains the lack of revenue or breakout application success.
-
Optimization Required — Task-specific prompts boost performance from 30% to 90-100%, emphasizing the importance of prompt engineering and optimization techniques.
-
Rethinking Benchmarks — Current benchmarks do not reflect real-world use. It's time for new, better evaluations.
These insights have implications for all stakeholders in this future...
-
Builders — Focus on targeted optimizations and experiments
-
Businesses — Github Copilot is 3 years old, made by the most valuable company in the world – set realistic timelines and expectations
-
Investors — Consider performance improvements over the past two years, and the 98.5% cost reduction since GPT-2
-
Government Officials — Rethink a rush to regulate and don't believe the hype (yet)
-
My Fellow Humans — Training on someone's data is the only way it gets good
Frontier models have immense potential, but translating their capabilities into reliable, commercially viable applications remains challenging and requires work. For the folks that want to learn more about QUAKE, we have a second part coming soon.
Coming In Pt.02
In the next part of this series, we dive deep into the technical aspects and findings from our evaluations. We provide a comprehensive technical breakdown, including detailed prompt/model specifications and the design of our experiments. We also explore the intricacies of the QUAKE dataset, highlighting its diverse range of tasks and the challenges it presents.
Technical Breakdown
- Model Specifications — We provide an in-depth look at the specifications of each model we evaluate, including architecture, training data, and performance metrics.
- Experiment Design — We explain in detail how we set up our experiments, including the methodologies and tools we use to ensure accurate and reproducible results.
- QUAKE Dataset — We explore the QUAKE dataset, discussing its composition, the variety of tasks it includes, and the specific challenges it poses for different models.
Use Cases
We examine specific use cases to identify which models excel in particular tasks. This includes a discussion on the selection criteria for these use cases and an analysis of the models that emerge as winners in each category.
- Use Case Selection — We discuss the criteria we use to select the use cases, ensuring they are representative of real-world applications and challenges.
- Use Case Winners — We identify which models excel in specific tasks, providing a clear picture of their strengths and weaknesses.
Optimization Improvements
Furthermore, we discuss various optimization improvements. We explore the impact of including text versus images in prompts, and how this affects model performance. We also cover the significant role of prompt engineering, demonstrating how tailored prompts can drastically enhance output quality. Additionally, we look into the benefits of retrieval-augmented generation (RAG) and the impact of fine-tuning on model performance.
- Text in Prompt vs. Image — We analyze how including text versus images in prompts affects model performance, with examples and performance metrics.
- Prompt Engineering Impact — We demonstrate the significant role of prompt engineering, showing how tailored prompts can drastically enhance output quality.
- Retrieval-Augmented Generation (RAG) Impact — We explore the benefits – and drawbacks – of RAG, including how it may hurt the relevance and accuracy of model outputs.
- Fine-Tuning Impact — We look into how fine-tuning models on specific datasets can enhance their performance, with case studies and performance comparisons.
Stay tuned...