Retrieval-Augmented Generation (RAG) systems are only as strong as their ability to retrieve the right context and generate answers grounded in that context. While building a RAG pipeline is one challenge, evaluating whether it is actually performing well is another.

This is where RAGAS(Retrieval Augmented Generation Assessment) becomes extremely useful.

RAGAS provides a set of evaluation metrics that help measure both the retriever quality and the generator quality in a RAG pipeline. Instead of relying on vague judgment, these metrics give you a structured way to identify exactly where your system is failing and what needs improvement.

In this article, we will walk through the key RAGAS metrics, what each one measures, how to interpret them, and how to debug low scores.

Why RAG Evaluation Matters

In a RAG system, a poor answer may happen because:

the retriever fetched irrelevant chunks,
the retriever missed critical information,
the LLM hallucinated,
the answer did not directly address the question,
or the model got distracted by noisy context.

Without proper metrics, it is difficult to know which part of the pipeline is causing the issue.

RAGAS helps break this problem into measurable components.

The 6 Core RAGAS Metrics

1. Faithfulness

Faithfulness checks whether the generated answer stays grounded in the retrieved context.
Its primary purpose is to detect hallucinations — situations where the LLM introduces facts that are not supported by the source material.

Simple way to think about it

Imagine a journalist writing an article based only on interview notes. If the journalist adds details that were never mentioned in the notes, those details are unfaithful to the source. The same logic applies here.

What it measures

Whether answer claims are supported by retrieved context
Whether the model is inventing unsupported details

Ideal score

1.0

Formula

Supported claims / Total claims

If the score is low

A low faithfulness score usually means the model is hallucinating.

What to improve

Strengthen the prompt to enforce context grounding
Reduce temperature
Use a stronger LLM
Limit generation freedom when strict factuality is required

2. Answer Relevancy

Answer Relevancy measures whether the generated answer actually addresses the user’s question.

An answer may be factually correct and still be irrelevant.

Example

If someone asks, “What is the capital of France?” and the answer is “The Eiffel Tower is beautiful,” the statement may be true, but it does not answer the question.

How RAGAS approaches this

RAGAS uses a smart reverse-engineering method:

It generates hypothetical questions from the answer
It compares those generated questions with the original question using embeddings
It computes similarity to determine how relevant the answer is

What it measures

Whether the answer is on-topic
Whether the answer addresses the intent of the question

Ideal score

1.0

If the score is low

The answer is likely drifting away from the actual question.

What to improve

Review your prompt template
Ensure question intent is clearly preserved
Validate whether the system properly handles different query types such as what, why, when, and who

3. Context Precision

Context Precision measures whether the most relevant retrieved chunks are ranked higher in the retrieval results.

This is not just about retrieving relevant content. It is about retrieving it in the right order.

Simple way to think about it

Imagine a librarian gives you five books to answer a question. Context Precision asks whether the most useful book was placed at the top of the stack.

What it measures

Ranking quality of retrieved chunks
Whether top-ranked chunks are the most useful

How it works

Each retrieved chunk is checked for relevance
Precision is calculated at different ranks
More relevant chunks near the top produce a higher score

Ideal score

1.0

If the score is low

It usually means irrelevant chunks are being ranked too high.

What to improve

Improve your embedding model
Add a re-ranker
Tune similarity thresholds
Improve chunk quality and chunk boundaries

4. Context Recall

Context Recall measures whether the retriever fetched all the necessary information required to answer the question.

This metric focuses on retrieval completeness.

Simple way to think about it

Suppose you are studying from a textbook for an exam. Context Recall asks whether you covered all the chapters needed to answer the exam questions, or whether you missed some important sections.

What it measures

Whether all required supporting information was retrieved
Whether the retriever missed important facts

How it works

Break the reference answer into claims
Check whether each claim is supported by the retrieved context
Compute: claims found / total claims

Ideal score

1.0

If the score is low

The retriever is likely missing important information.

What to improve

Increase top-k retrieval
Improve chunking strategy
Use better embeddings
Expand retrieval scope
Consider hybrid retrieval if semantic search alone is insufficient

5. Entity Recall

Entity Recall checks whether the retrieved context contains the important entities mentioned in the reference answer.

These entities may include:

people
organizations
places
dates
products
events

Simple way to think about it

If the correct answer mentions Einstein, 1905, and Princeton, did your retrieved documents include those entities?

What it measures

Whether key named entities were successfully retrieved
Whether important contextual anchors are present in the retrieval output

Ideal score

1.0

If the score is low

Important entities are not being surfaced by retrieval.

What to improve

Use entity-aware chunking
Add keyword-based retrieval alongside semantic retrieval
Improve metadata filtering
Increase retrieval coverage for entity-heavy questions

6. Noise Sensitivity

Noise Sensitivity measures how much irrelevant information in the retrieved context causes the model to make mistakes.

This metric reflects the robustness of the overall RAG system.

Simple way to think about it

Imagine taking an open-book exam, but someone mixed random Wikipedia pages into your notes. Noise Sensitivity measures how often those irrelevant pages confuse you into writing wrong answers.

Important note

Unlike the other metrics, lower is better here.

0.0 = excellent robustness
1.0 = poor robustness

What it measures

Whether irrelevant chunks confuse the model
Whether noisy context causes wrong claims in the answer

Ideal score

0.0

If the score is high

The model is overly influenced by irrelevant context.

What to improve

Filter noisy chunks before generation
Use re-ranking
Improve prompt robustness
Reduce overly broad retrieval
Strengthen chunk quality controls

Quick Reference Table

How to Interpret These Metrics Together

One of the biggest mistakes in RAG evaluation is looking at only one metric.

A system may have:

high context recall but low faithfulness,
high answer relevancy but poor context precision,
or strong retrieval but high noise sensitivity.

Each metric tells only part of the story.

A practical grouping

Generator-focused metrics

Faithfulness
Answer Relevancy

Retriever-focused metrics

Context Precision
Context Recall
Entity Recall

System robustness metric

Noise Sensitivity

This grouping helps you quickly identify whether the issue is in:

generation,
retrieval,
or end-to-end robustness.

Debugging Guide: What to Do When Scores Are Low

Example: Evaluating a Simple RAG Sample

Let us take a simple example.

Question

What is the Eiffel Tower and where is it located?

Generated Response

“The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was built in 1889.”

Reference Answer

“The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It was constructed from 1887 to 1889.”

Retrieved Contexts

The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
The tower was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair.
The Eiffel Tower is named after Gustave Eiffel, whose company designed and built the tower.
Paris is known for its café culture and fashion industry.

In this example:

Faithfulness should be reasonably high because the answer is supported by context
Answer Relevancy should be high because the response addresses the question directly
Context Precision depends on whether the most relevant chunks are ranked at the top
Context Recall checks whether all facts needed for the reference are present
Entity Recall checks whether key entities such as Eiffel Tower and Paris are present
Noise Sensitivity measures whether the unrelated Paris sentence causes confusion

This is a great example because it includes both useful context and a small amount of noise, making it ideal for demonstration.

Key Insights

1. Faithfulness and Answer Relevancy evaluate the Generator

These tell you whether the LLM is producing grounded and question-focused answers.

2. Context Precision, Context Recall, and Entity Recall evaluate the Retriever

These help assess retrieval quality, completeness, and entity coverage.

3. Noise Sensitivity evaluates robustness

This tells you how well the full system handles irrelevant context.

4. Intermediate evaluation is critical

Looking only at final answers is not enough. You need to inspect retrieval quality and grounding behavior separately.

5. The best evaluation comes from combining metrics

No single metric fully explains RAG quality. The real value comes from using these signals together.

Final Thoughts:

If you are building enterprise-grade RAG systems, these metrics can help you move from guesswork to measurable quality improvement. They not only show whether a RAG pipeline is working, but also pinpoint whether the problem lies in retrieval, ranking, generation, or robustness.

That makes RAGAS not just an evaluation tool, but a debugging framework for building better GenAI applications.

Git code for the Implementation of all these Metrices: https://github.com/LeelaPrasadG/rag_langchain/blob/main/12_RAGAS_Metrics_Deep_Dive.ipynb

Reference: https://medium.com/@danushidk507/evaluation-with-ragas-873a574b86a9

Sunday, 8 March 2026

A Practical Guide to Faithfulness, Relevancy, Retrieval Quality, and Robustness

Why RAG Evaluation Matters

The 6 Core RAGAS Metrics

1. Faithfulness

Simple way to think about it

What it measures

Ideal score

Formula

If the score is low

What to improve

2. Answer Relevancy

Example

How RAGAS approaches this

What it measures

Ideal score

If the score is low

What to improve

3. Context Precision

Simple way to think about it

What it measures

How it works

Ideal score

If the score is low

What to improve

4. Context Recall

Simple way to think about it

What it measures

How it works

Ideal score

If the score is low

What to improve

5. Entity Recall

Simple way to think about it

What it measures

Ideal score

If the score is low

What to improve

6. Noise Sensitivity

Simple way to think about it

Important note

What it measures

Ideal score

If the score is high

What to improve

Quick Reference Table

How to Interpret These Metrics Together

A practical grouping

Debugging Guide: What to Do When Scores Are Low

Example: Evaluating a Simple RAG Sample

Question

Generated Response

Reference Answer

Retrieved Contexts

Key Insights

1. Faithfulness and Answer Relevancy evaluate the Generator

2. Context Precision, Context Recall, and Entity Recall evaluate the Retriever

3. Noise Sensitivity evaluates robustness

4. Intermediate evaluation is critical

5. The best evaluation comes from combining metrics

Final Thoughts:

No comments:

Post a Comment

Significance of Fallback Mechanism