Retrieval-Augmented Generation (RAG) systems are only as strong as their ability to retrieve the right context and generate answers grounded in that context. While building a RAG pipeline is one challenge, evaluating whether it is actually performing well is another.
This is where RAGAS(Retrieval Augmented Generation Assessment) becomes extremely useful.
RAGAS provides a set of evaluation metrics that help measure both the retriever quality and the generator quality in a RAG pipeline. Instead of relying on vague judgment, these metrics give you a structured way to identify exactly where your system is failing and what needs improvement.
In this article, we will walk through the key RAGAS metrics, what each one measures, how to interpret them, and how to debug low scores.
Why RAG Evaluation Matters
In a RAG system, a poor answer may happen because:
-
the retriever fetched irrelevant chunks,
-
the retriever missed critical information,
-
the LLM hallucinated,
-
the answer did not directly address the question,
-
or the model got distracted by noisy context.
Without proper metrics, it is difficult to know which part of the pipeline is causing the issue.
RAGAS helps break this problem into measurable components.
The 6 Core RAGAS Metrics
1. Faithfulness
Faithfulness checks whether the generated answer stays grounded in the retrieved context.
Its primary purpose is to detect hallucinations — situations where the LLM introduces facts that are not supported by the source material.
Simple way to think about it
Imagine a journalist writing an article based only on interview notes. If the journalist adds details that were never mentioned in the notes, those details are unfaithful to the source. The same logic applies here.
What it measures
-
Whether answer claims are supported by retrieved context
-
Whether the model is inventing unsupported details
Ideal score
1.0
Formula
Supported claims / Total claims
If the score is low
A low faithfulness score usually means the model is hallucinating.
What to improve
-
Strengthen the prompt to enforce context grounding
-
Reduce temperature
-
Use a stronger LLM
-
Limit generation freedom when strict factuality is required
2. Answer Relevancy
Answer Relevancy measures whether the generated answer actually addresses the user’s question.
An answer may be factually correct and still be irrelevant.
Example
If someone asks, “What is the capital of France?” and the answer is “The Eiffel Tower is beautiful,” the statement may be true, but it does not answer the question.
How RAGAS approaches this
RAGAS uses a smart reverse-engineering method:
-
It generates hypothetical questions from the answer
-
It compares those generated questions with the original question using embeddings
-
It computes similarity to determine how relevant the answer is
What it measures
-
Whether the answer is on-topic
-
Whether the answer addresses the intent of the question
Ideal score
1.0
If the score is low
The answer is likely drifting away from the actual question.
What to improve
-
Review your prompt template
-
Ensure question intent is clearly preserved
-
Validate whether the system properly handles different query types such as what, why, when, and who
3. Context Precision
Context Precision measures whether the most relevant retrieved chunks are ranked higher in the retrieval results.
This is not just about retrieving relevant content. It is about retrieving it in the right order.
Simple way to think about it
Imagine a librarian gives you five books to answer a question. Context Precision asks whether the most useful book was placed at the top of the stack.
What it measures
-
Ranking quality of retrieved chunks
-
Whether top-ranked chunks are the most useful
How it works
-
Each retrieved chunk is checked for relevance
-
Precision is calculated at different ranks
-
More relevant chunks near the top produce a higher score
Ideal score
1.0
If the score is low
It usually means irrelevant chunks are being ranked too high.
What to improve
-
Improve your embedding model
-
Add a re-ranker
-
Tune similarity thresholds
-
Improve chunk quality and chunk boundaries
4. Context Recall
Context Recall measures whether the retriever fetched all the necessary information required to answer the question.
This metric focuses on retrieval completeness.
Simple way to think about it
Suppose you are studying from a textbook for an exam. Context Recall asks whether you covered all the chapters needed to answer the exam questions, or whether you missed some important sections.
What it measures
-
Whether all required supporting information was retrieved
-
Whether the retriever missed important facts
How it works
-
Break the reference answer into claims
-
Check whether each claim is supported by the retrieved context
-
Compute: claims found / total claims
Ideal score
1.0
If the score is low
The retriever is likely missing important information.
What to improve
-
Increase top-k retrieval
-
Improve chunking strategy
-
Use better embeddings
-
Expand retrieval scope
-
Consider hybrid retrieval if semantic search alone is insufficient
5. Entity Recall
Entity Recall checks whether the retrieved context contains the important entities mentioned in the reference answer.
These entities may include:
-
people
-
organizations
-
places
-
dates
-
products
-
events
Simple way to think about it
If the correct answer mentions Einstein, 1905, and Princeton, did your retrieved documents include those entities?
What it measures
-
Whether key named entities were successfully retrieved
-
Whether important contextual anchors are present in the retrieval output
Ideal score
1.0
If the score is low
Important entities are not being surfaced by retrieval.
What to improve
-
Use entity-aware chunking
-
Add keyword-based retrieval alongside semantic retrieval
-
Improve metadata filtering
-
Increase retrieval coverage for entity-heavy questions
6. Noise Sensitivity
Noise Sensitivity measures how much irrelevant information in the retrieved context causes the model to make mistakes.
This metric reflects the robustness of the overall RAG system.
Simple way to think about it
Imagine taking an open-book exam, but someone mixed random Wikipedia pages into your notes. Noise Sensitivity measures how often those irrelevant pages confuse you into writing wrong answers.
Important note
Unlike the other metrics, lower is better here.
-
0.0 = excellent robustness
-
1.0 = poor robustness
What it measures
-
Whether irrelevant chunks confuse the model
-
Whether noisy context causes wrong claims in the answer
Ideal score
0.0
If the score is high
The model is overly influenced by irrelevant context.
What to improve
-
Filter noisy chunks before generation
-
Use re-ranking
-
Improve prompt robustness
-
Reduce overly broad retrieval
-
Strengthen chunk quality controls
Quick Reference Table
How to Interpret These Metrics Together
One of the biggest mistakes in RAG evaluation is looking at only one metric.
A system may have:
-
high context recall but low faithfulness,
-
high answer relevancy but poor context precision,
-
or strong retrieval but high noise sensitivity.
Each metric tells only part of the story.
A practical grouping
Generator-focused metrics
-
Faithfulness
-
Answer Relevancy
Retriever-focused metrics
-
Context Precision
-
Context Recall
-
Entity Recall
System robustness metric
-
Noise Sensitivity
This grouping helps you quickly identify whether the issue is in:
-
generation,
-
retrieval,
-
or end-to-end robustness.
Debugging Guide: What to Do When Scores Are Low
Example: Evaluating a Simple RAG Sample
Let us take a simple example.
Question
What is the Eiffel Tower and where is it located?
Generated Response
“The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was built in 1889.”
Reference Answer
“The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It was constructed from 1887 to 1889.”
Retrieved Contexts
-
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
-
The tower was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair.
-
The Eiffel Tower is named after Gustave Eiffel, whose company designed and built the tower.
-
Paris is known for its café culture and fashion industry.
In this example:
-
Faithfulness should be reasonably high because the answer is supported by context
-
Answer Relevancy should be high because the response addresses the question directly
-
Context Precision depends on whether the most relevant chunks are ranked at the top
-
Context Recall checks whether all facts needed for the reference are present
-
Entity Recall checks whether key entities such as Eiffel Tower and Paris are present
-
Noise Sensitivity measures whether the unrelated Paris sentence causes confusion
This is a great example because it includes both useful context and a small amount of noise, making it ideal for demonstration.
Key Insights
1. Faithfulness and Answer Relevancy evaluate the Generator
These tell you whether the LLM is producing grounded and question-focused answers.
2. Context Precision, Context Recall, and Entity Recall evaluate the Retriever
These help assess retrieval quality, completeness, and entity coverage.
3. Noise Sensitivity evaluates robustness
This tells you how well the full system handles irrelevant context.
4. Intermediate evaluation is critical
Looking only at final answers is not enough. You need to inspect retrieval quality and grounding behavior separately.
5. The best evaluation comes from combining metrics
No single metric fully explains RAG quality. The real value comes from using these signals together.
Final Thoughts:
If you are building enterprise-grade RAG systems, these metrics can help you move from guesswork to measurable quality improvement. They not only show whether a RAG pipeline is working, but also pinpoint whether the problem lies in retrieval, ranking, generation, or robustness.
That makes RAGAS not just an evaluation tool, but a debugging framework for building better GenAI applications.
Git code for the Implementation of all these Metrices: https://github.com/LeelaPrasadG/rag_langchain/blob/main/12_RAGAS_Metrics_Deep_Dive.ipynb
Reference: https://medium.com/@danushidk507/evaluation-with-ragas-873a574b86a9
No comments:
Post a Comment