Klu raises $1.7M to empower AI Teams  

GAIA Benchmark (General AI Assistants)

by Stephen M. Walker II, Co-Founder / CEO

What is the GAIA Benchmark (General AI Assistants)?

GAIA, or General AI Assistants, is a benchmark designed to evaluate the performance of AI systems. It was introduced to push the boundaries of what we expect from AI, examining not just accuracy but the ability to navigate complex, layered queries. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.

The GAIA benchmark is composed of 466 questions designed and annotated by humans. These questions are text-based and sometimes come with a file (such as an image or a spreadsheet). They cover various assistant use cases such as daily personal tasks, science, and more. The questions are conceptually simple for humans yet challenging for most advanced AIs. For instance, human respondents obtain 92% accuracy versus 15% for GPT-4 equipped with plugins.

GAIA's benchmarking approach is unique in that it not only measures the 'what' in terms of correct answers but also the 'how'. It's akin to evaluating a student not just on the answer they provide but on their work showing how. The benchmark categorizes questions into levels, with each subsequent level representing an increase in complexity and cognitive demand. It employs a range of metrics to assess an AI's proficiency, such as accuracy, reasoning, and time taken to respond. The tasks mimic real-world applications, testing an AI's ability to understand and operate in the human world.

GAIA's philosophy departs from the current trend in AI benchmarks, suggesting to target tasks that are more grounded in real-world interactions. It does not specify possible APIs, and relies on interactions with the real world. This approach is seen as revolutionary in the realm of AI Assistants as it moves away from the siloed, task-specific evaluation methods.

GAIA Leaderboard

Model nameAverage score (%)Level 1 score (%)Level 2 score (%)Level 3 score (%)OrganisationModel family
GPT4 + manually selected plugins14.630.39.70GAIA authorsGPT4
GPT4 Turbo9.720.755.810GAIA authorsGPT4
GPT46.0615.092.330GAIA authorsGPT4
AutoGPT44.8513.2103.85AutoGPTAutoGPT + GPT4
GPT3.54.857.554.650GAIA authorsGPT3

Example GAIA Question and Response

The benchmark consists of 466 questions across three levels of complexity, defined by the number of steps required to solve the task. The questions are conceptually simple for humans yet challenging for most advanced AIs. For instance, human respondents obtain 92% accuracy, while GPT-4 equipped with plugins only achieves 15%.

A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?

The evaluation in GAIA is automated, fast, and factual. Each question calls for an answer that is either a string (one or a few words), a number, or a comma-separated list of strings or floats, unless specified otherwise. There is only one correct answer, and evaluation is done via quasi exact match between a model's answer and the ground truth.

According to github, when was Regression added to the oldest closed numpy.polynomial issue that has the Regression label in MM/DD/YY?

However, GAIA does not evaluate the trace leading to the answer. Different paths could lead to the correct answer, and there is no obvious and simple way to grade these paths.

The Metropolitan Museum of Art has a portrait in its collection with an accession number of 29.100.5. Of the consecrators and co-consecrators of this portrait's subject as a bishop, what is the name of the one who never became pope?

The GAIA benchmark is designed to test the capabilities of AI systems in a way that is closer to real-world tasks, rather than focusing on tasks that are difficult for humans. The researchers behind GAIA believe that the successful resolution of GAIA would be an important milestone towards the next generation of AI.

To maintain the integrity of the benchmark, the researchers have released the questions while retaining answers to 300 of them to power a leaderboard. The remaining 166 questions and answers were released as a development set.

As of the latest data, the leading GAIA score belongs to GPT-4 with manually selected plugins, at 30% accuracy. This performance disparity between humans and AI systems underscores the challenges that AI still faces in handling tasks that are simple for humans but complex for AI systems.

The goals and structure of the GAIA

Designed to overcome the limitations of Large Language Model (LLM) evaluations, GAIA encompasses a range of use cases, from everyday tasks to scientific inquiries. The benchmark's philosophy contrasts with traditional AI benchmarks by focusing on tasks that are deceptively simple for humans but intricate for AI, aiming to push AI towards the next generation of capabilities.

The GAIA benchmark assesses AI systems against real-world tasks through 466 questions that test fundamental abilities like reasoning, multi-modality handling, web browsing, and tool-use proficiency. These tasks, while straightforward for humans, present significant challenges for AI. The benchmark's automated, swift, and precise evaluation process requires answers in the form of strings, numbers, or comma-separated lists.

To foster competition and progress, the GAIA leaderboard is powered by 300 retained questions, with the remaining 166 questions serving as a development set. Currently, GPT-4 with manually selected plugins leads with a 30% accuracy rate, highlighting the gap between human and AI task performance and the ongoing challenges in AI's evolution.

What are the key features of the GAIA benchmark?

The GAIA (General AI Assistants) benchmark is a framework designed to evaluate AI systems, particularly their ability to function as general assistants. Here are its key features:

  • Real-World Questions: GAIA includes 466 human-designed and annotated questions that are text-based and may include files like images or spreadsheets. These questions are intended to reflect real-world challenges.

  • Automated and Factual Evaluation: The benchmark is structured for automated, fast, and factual evaluation. Answers are expected to be strings, numbers, or lists, with only one correct answer for each question, allowing for quasi-exact match evaluation.

  • Levels of Difficulty: The benchmark is structured around three levels of difficulty, each representing a more sophisticated understanding and cognitive demand.

  • Fundamental Abilities: GAIA tests for fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency. These abilities are crucial for AI systems to navigate complex, layered queries.

  • Performance Disparity: There is a significant performance gap between humans and AI on GAIA, with humans achieving 92% accuracy compared to 15% for GPT-4 equipped with plugins. This contrasts with other benchmarks where AI may outperform humans in specialized domains.

  • Focus on Process: GAIA evaluates not only the correctness of the answers but also the process by which the AI arrives at those answers, akin to assessing a student's work.

  • Real-World Interaction: Unlike other benchmarks that may focus on an AI's ability to use specific APIs, GAIA emphasizes interactions with the real world, which is considered a more general and challenging approach.

  • Milestone for AI Research: Solving GAIA is seen as a significant milestone in AI research, indicating progress towards Artificial General Intelligence (AGI).

These features make GAIA a comprehensive and challenging benchmark that aims to push the boundaries of what AI systems can achieve in terms of general assistance and real-world problem-solving.

How does the GAIA work?

The GAIA benchmark requires AI systems to demonstrate a variety of fundamental abilities to solve its questions effectively. These abilities include:

  • Reasoning: The capacity to process information and make inferences or deductions based on the given data.
  • Multi-modality handling: The ability to interpret and integrate information from various modalities, such as text, images, and spreadsheets.
  • Web browsing: The skill to navigate the internet to find information that can help in answering questions.
  • Tool-use proficiency: The general capability to utilize tools, which could include software applications or online services, to perform tasks or solve problems.

These abilities are essential because GAIA's questions are designed to be conceptually simple for humans but challenging for AI, requiring more than just retrieval of information from training data. They necessitate an understanding and operation within the human world, simulating real-world applications and interactions.

What are some examples of multi-modality handling in GAIA benchmark questions?

The GAIA benchmark includes questions that require multi-modality handling, which means the AI must be able to interpret and integrate information from various modalities, such as text, images, and spreadsheets. The GAIA benchmark is designed to be challenging for AI systems, requiring them to demonstrate a range of fundamental abilities, including multi-modality handling, to effectively answer the questions.

What is the difference between GAIA and other AI benchmarks?

The GAIA benchmark differs from other AI benchmarks in several key ways:

  1. Focus on Real-World Interactions: Unlike other benchmarks that focus on tasks that are difficult for humans or that test current model capabilities, GAIA focuses on real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency. These questions are conceptually simple for humans yet challenging for most advanced AIs.

  2. Evaluation of General Results: GAIA does not specify possible APIs, and relies on interactions with the real world. This contrasts with other benchmarks that risk evaluating how well the assistants have learned to use specific APIs, instead of more general results grounded in real-world interactions.

  3. Performance Disparity: GAIA questions have shown a notable performance disparity between humans and AI. For instance, human respondents obtain 92% accuracy versus 15% for GPT-4 equipped with plugins. This contrasts with the recent trend of large language models (LLMs) outperforming humans on tasks such as law or chemistry.

  4. Assessment of Cognitive Abilities: GAIA goes beyond simple tasks to embrace the full spectrum of human cognitive abilities. It employs a range of metrics to assess an AI's proficiency, such as accuracy, reasoning, and time taken to respond.

  5. Revolutionary Approach: In the realm of AI Assistants, GAIA's approach is seen as revolutionary as it moves away from the siloed, task-specific evaluation methods.

GAIA benchmarking emphasizes real-world interaction tasks, moving away from conventional AI benchmarks that focus on narrow, task-specific challenges.

What are the benefits of GAIA?

GAIA (General AI Assistants) is a benchmark designed to evaluate the performance of AI Assistants against a series of tasks and scenarios that reflect real-world challenges. Here are some of its benefits:

  1. Human-like Reasoning: GAIA aims to evaluate whether AI systems can demonstrate human-like reasoning. It proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.

  2. Emphasizes Shared Human Values: GAIA could help guide AI in a direction that emphasizes shared human values like empathy, creativity, and ethical judgment.

  3. Structured Evaluation: GAIA categorizes questions into levels, with each subsequent level representing an increase in complexity and cognitive demand.

  4. Diverse Metrics: GAIA not only measures the 'what' in terms of correct answers but also the 'how' in terms of the process used to arrive at the answer.

  5. Milestone in AI Research: Solving GAIA would represent a significant milestone in AI research. The leading GAIA score currently belongs to GPT-4 with manually selected plugins, at 30% accuracy.

What are the limitations of GAIA?

GAIA's evaluation focuses solely on the final answer, not accounting for the various methods an AI might use to arrive at that conclusion. This means that while different approaches may yield the correct result, GAIA does not differentiate or assess the processes behind them.

GAIA also has some additional limitations:

  1. Reproducibility Issues: The capabilities of models closed behind APIs might change over time, making an evaluation done at some point in time not reproducible. For example, ChatGPT plugins and their capabilities change regularly, and are not accessible through the GPT Assistants API yet.

  2. Limited to Single Correct Response: GAIA is robust to the randomness of token generation since only the final answer, that admits a single correct response, is evaluated. This could limit its ability to evaluate AI systems in scenarios where multiple correct responses are possible.

GAIA's design to evaluate AI through real-world interactions rather than predefined APIs presents both strengths and limitations. While it ensures that AI systems are tested on their ability to navigate and interpret the human world, it may not fully assess their proficiency with specific APIs.

Additionally, the benchmark's relevance may diminish over time due to changes in the data it was trained on or the availability of online resources. A significant performance gap is observed in GAIA's results, with human respondents achieving 92% accuracy compared to only 15% for GPT-4 with plugins, highlighting the current limitations of AI in understanding and responding to complex, real-world tasks.

More terms

What is Embedding in AI?

Embedding is a technique that involves converting categorical variables into a form that can be provided to machine learning algorithms to improve model performance.

Read more

What is knowledge representation and reasoning?

Knowledge representation and reasoning (KRR) is a subfield of artificial intelligence that focuses on creating computational models to represent and reason with human-like intelligence. The goal of KRR is to enable computers to understand, interpret, and use knowledge in the same way humans do.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free