Klu raises $1.7M to empower AI Teams  

RAGAS

by Stephen M. Walker II, Co-Founder / CEO

RAGAS provides a suite of metrics to evaluate different aspects of RAG systems without relying on ground truth human annotations. These metrics are divided into two categories: retrieval and generation.

Metrics of RAGAS

  1. Retrieval Metrics — These metrics evaluate the performance of the retrieval system. They include:

    • Context Relevancy — This measures the signal-to-noise ratio in the retrieved contexts.
    • Context Recall — This measures the ability of the retriever to retrieve all the necessary information needed to answer the question. It is calculated by using the provided ground truth answer and an LLM to check if each statement from it can be found in the retrieved context.
  2. Generation Metrics — These metrics evaluate the performance of the generation system. They include:

    • Faithfulness — This measures hallucinations, or the generation of information not present in the context.
    • Answer Relevancy — This measures how to-the-point the answers are to the question.

The harmonic mean of these four aspects gives you the RAGAS score, which is a single measure of the performance of your QA system across all the important aspects.

What is the LLM used for in RAGAS metrics?

The RAGAS library utilizes LLMs to measure key performance metrics such as faithfulness, answer relevancy, and context relevancy within the LangChain framework.

For faithfulness, LLMs assess the factual accuracy of answers by comparing them to the provided context. Answer relevancy is measured by determining how well the answer addresses the posed question. Context relevancy is evaluated by analyzing the signal-to-noise ratio in the retrieved contexts.

By leveraging LLMs, ragas effectively measures these metrics and mitigates inherent biases. It employs LLMs to parse the statements made in a generated answer and verify their support by the context, enabling reference-free evaluation. This approach reduces costs and provides developers and researchers with a robust tool for improving the accuracy and relevance of AI systems.

How to Use RAGAS

To use RAGAS, you need a few questions and, if you're using context recall, a reference answer. Most of the measurements do not require any labeled data, making it easier for users to run it without worrying about building a human-annotated test dataset first.

Here's a Python code snippet showing how to use RAGAS for evaluation:

from ragas.metrics import faithfulness, answer_relevancy, context_relevancy, context_recall
from ragas.langchain import RagasEvaluatorChain

# make eval chains
eval_chains = {m.name: RagasEvaluatorChain(metric=m) for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]}

# evaluate
for name, eval_chain in eval_chains.items():
    score_name = f"{name}_score"
    print(f"{score_name}: {eval_chain(result)[score_name]}")

In this code, RagasEvaluatorChain is used to create evaluator chains for each metric. The __call__() method of the evaluator chain is then used with the outputs from the QA chain to run the evaluations.

RAGAS is a powerful framework for evaluating RAG pipelines, providing actionable metrics using as little annotated data as possible, cheaper, and faster. It helps developers ensure their QA systems are robust and ready for deployment.

More terms

What is cognitive computing?

Cognitive computing refers to the development of computer systems that can simulate human thought processes, including perception, reasoning, learning, and problem-solving. These systems use artificial intelligence techniques such as machine learning, natural language processing, and data analytics to process large amounts of information and make decisions based on patterns and relationships within the data. Cognitive computing is often used in applications such as healthcare, finance, and customer service, where it can help humans make more informed decisions by providing insights and recommendations based on complex data analysis.

Read more

What is game theory?

Game theory in the context of artificial intelligence (AI) is a mathematical framework used to model and analyze the strategic interactions between different agents, where an agent can be any entity capable of making decisions, such as a computer program or a robot. In AI, game theory is particularly relevant for multi-agent systems, where multiple AI agents interact with each other, each seeking to maximize their own utility or payoff.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free