Saturday, 16 May 2026

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the LLM dynamically decides when to call a Wikipedia tool to answer factual questions. 


The agent is wired as a stateful graph — looping between reasoning and tool use until it has a confident answer. To trace every LLM call, tool invocation, and token usage in real time, LangSmith is integrated by passing a LangChainTracer directly via config={"callbacks": [tracer]} — a more reliable approach than relying on environment variables in Jupyter notebooks. 


Check out the full code and setup on GitHub.


Sourcehttps://www.youtube.com/watch?v=3h4Wc19LhL8

Sunday, 3 May 2026

Mastering LLM Specialization: Fine-Tuning, LoRA, and QLoRA

In the rapidly evolving landscape of Artificial Intelligence, a base model is often just the starting point. To transform a general-purpose LLM into a domain expert, we look toward Fine-Tuning.


What is Fine-Tuning?

Fine-tuning is the process of training a pre-existing LLM to achieve specialized capabilities. While a base model has broad knowledge, fine-tuning allows you to tailor its tone, behavior, and specific knowledge to meet niche domain needs.

Key Benefits:

  • Nuanced Communication: Effectively handles emotions, cultural slangs, and specific dialects.

  • Brand Alignment: A company chatbot can be fine-tuned to represent a specific brand voice (e.g., showing high empathy or professional rigor).

  • Specialized Knowledge: Enhances the model’s deep understanding of proprietary or industry-specific data.

Example: Using Llama-3.2-1B as a base model and fine-tuning it specifically to follow commands results in the Llama-3.2-1B-Instruct model.


RAG vs. Fine-Tuning

Most enterprises choose between—or combine—Retrieval-Augmented Generation (RAG) and Fine-Tuning.

FeatureRAGFine-Tuning
ProsEasy to update knowledge; handles dynamic data; lower cost.Consistent style; faster runtime; tailored for specific tasks.
ConsSlower (requires retrieval step); depends on retrieval quality.Expensive and time-consuming to retrain.

The Industry Standard: Many enterprises now utilize a hybrid approach, leveraging RAG for dynamic data access and Fine-Tuning for behavioral consistency.


PEFT & LoRA: Efficiency Reimagined

PEFT (Parameter-Efficient Fine-Tuning) is a technique that keeps the original model parameters frozen and only adds/trains new, smaller layers.

LoRA (Low-Rank Adaptation) is the most prominent PEFT technique. It represents updates to the model weights mathematically as:

W' = W_{frozen} + \Delta W$

Instead of updating the entire weight matrix, LoRA learns a "low-rank" representation of the changes, drastically reducing the number of trainable parameters.


Understanding Quantization

Quantization is the process of converting high-precision numbers (like Float32) into lower-precision formats (like Int8 or 4-bit) to reduce memory and compute requirements.

The Impact on Memory (e.g., Llama 7B):

  • At 4 Bytes per parameter: $7  Billion \times 4 Bytes = 28 GB

  • At 1 Byte per parameter: $7 Billion \times 1 Byte = 7 GB

This efficiency is vital for edge devices with memory constraints, such as drones or mobile hardware.


QLoRA: The Best of Both Worlds

QLoRA (Quantized Low-Rank Adaptation) combines quantization with LoRA to save memory without sacrificing intelligence. It relies on three major innovations:

  1. 4-bit NormalFloat (NF4): Compresses weights to 4-bit using a mathematical distribution optimized for neural networks.

  2. Double Quantization: Quantizes the quantization constants themselves, saving an additional 0.5 bits per parameter.

  3. Paged Optimizers: Uses CPU RAM as a buffer during GPU memory spikes to prevent training crashes.


FeatureLoRAQLoRA
Model Precision16-bit (Half-precision)4-bit (Compressed)
Memory (7B Model)~28 GB VRAM~12–16 GB VRAM
HardwareA100 or high-end WorkstationConsumer GPUs (RTX 3090/4090)
Accuracy95-100% of full fine-tuning~90-95% (Very close)
Best ForMaximum accuracy on enterprise GPUsBudget-conscious scaling or 70B models

Sunday, 19 April 2026

Deploy Production-Ready AI Agent with AWS Bedrock AgentCore & LangGraph


Here is a LangGraph-powered RAG agent with persistent short- and long-term memory, deployed as a containerized runtime on AWS Bedrock AgentCore, that answers user queries by retrieving context from a FAISS vector store and personalizing responses using conversation history across sessions.

Please follow the code and ReadMe for the Implementation 

https://github.com/LeelaPrasadG/Bedrock_lang_RAG_agentcore

Important Points:

  1. From local script to cloud agent in minutes — With agentcore configure + agentcore launch, a LangGraph agent running on your laptop becomes a fully managed, auto-scaled runtime on AWS — no infrastructure code needed.
  2. Two deployment paths, one codebase — The same 01_agentcore_runtime.py can be deployed as a direct code deploy (fast prototyping) or a Docker container (production). One flag switches between them: --deployment-type container.
  3. Memory that actually persists — AgentCoreMemorySaver gives short-term per-thread history; AgentCoreMemoryStore gives long-term semantic memory across sessions. Together they let the agent remember what a user said three conversations ago — something vanilla LLMs can't do.
  4. MicroVM isolation per session — Every unique runtimeSessionId spins up a fresh MicroVM on AWS. This means tenant isolation and clean state without any extra work from the developer.
  5. @app.entrypoint is the only contract — Bedrock AgentCore only cares about one decorator. Everything else — HTTP server, routing, container lifecycle — is handled by the runtime. Your business logic stays clean.
  6. Invoke from anywhere via boto3 — Once deployed, any Python app can call your agent using client.invoke_agent_runtime() — no special SDK, no API Gateway setup, just standard AWS credentials.
  7. Built-in observability out of the box — The auto-generated Dockerfile installs aws-opentelemetry-distro and wraps the entrypoint with opentelemetry-instrument. Traces flow to AWS X-Ray automatically.
  8. FAISS + OpenAI Embeddings for RAG — The FAQ knowledge base uses FAISS for local vector search, keeping retrieval fast and cost-free at runtime. Only the final LLM call incurs API cost.

Referencehttps://www.youtube.com/watch?v=cTBGIKAckKE&t=2193s


Sunday, 29 March 2026

Building Enterprise RAG: .NET, Semantic Kernel, and Weaviate


As LLMs evolve toward GPT-5.4, the challenge for .NET developers isn't just "connecting to an AI"—it's building a reliable, secure, and scalable Retrieval-Augmented Generation (RAG) pipeline.

In this post, I’ll walk through a demo implementation using Semantic Kernel to orchestrate the flow, Weaviate as the vector memory, and a critical layer of Data Sanitization.

The Architecture

A production-ready RAG application consists of three main stages:

  1. Sanitization: Cleaning raw data to remove noise and protect PII.

  2. Ingestion: Embedding the clean data and storing it in Weaviate.

  3. Orchestration: Using Semantic Kernel to retrieve context and generate answers via GPT-5.4.


1. The Gateway: Data Sanitization

Before data ever touches a vector database, it must be "sanitized." This prevents "Garbage In, Garbage Out" and ensures compliance.

Why Sanitize?

  • Lower Costs: Removing HTML/boilerplate reduces token usage.

  • Better Accuracy: Cleaner text leads to higher-quality embeddings.

  • Security: Redacting PII ensures sensitive data isn't leaked to the LLM.

C#
// Simple Sanitization Utility
public string Sanitize(string rawText) {
    var clean = Regex.Replace(rawText, "<.*?>", string.Empty); // Remove HTML
    clean = Regex.Replace(clean, @"\s+", " "); // Normalize whitespace
    return clean.Trim();
}

2. The Memory: Weaviate + Semantic Kernel

Weaviate provides a highly scalable vector store that integrates seamlessly with .NET via the Semantic Kernel connectors. By using AddWeaviateVectorStore, we can perform sub-second semantic searches across millions of documents.

C#
var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-5.4", apiKey)
    .AddWeaviateVectorStore(endpoint, apiKey)
    .Build();

3. The RAG Flow in Action

The magic happens when Semantic Kernel acts as the "brain," retrieving only the most relevant snippets from Weaviate to ground the GPT-5.4 response in your private data.

Key Benefits of this Stack:

  • Type Safety: Leverage C#’s strong typing for AI plugins.

  • Dependency Injection: Seamlessly integrate AI services into your existing ASP.NET Core apps.

  • Performance: Weaviate’s efficiency paired with the native speed of .NET 8/9.


Source Code: https://github.com/LeelaPrasadG/RAG_Weaviate_SemanticKernel_CSharp

Sunday, 8 March 2026

A Practical Guide to Faithfulness, Relevancy, Retrieval Quality, and Robustness


Retrieval-Augmented Generation (RAG) systems are only as strong as their ability to retrieve the right context and generate answers grounded in that context. While building a RAG pipeline is one challenge, evaluating whether it is actually performing well is another.

This is where RAGAS(Retrieval Augmented Generation Assessment) becomes extremely useful.

RAGAS provides a set of evaluation metrics that help measure both the retriever quality and the generator quality in a RAG pipeline. Instead of relying on vague judgment, these metrics give you a structured way to identify exactly where your system is failing and what needs improvement.

In this article, we will walk through the key RAGAS metrics, what each one measures, how to interpret them, and how to debug low scores.


Why RAG Evaluation Matters

In a RAG system, a poor answer may happen because:

  • the retriever fetched irrelevant chunks,

  • the retriever missed critical information,

  • the LLM hallucinated,

  • the answer did not directly address the question,

  • or the model got distracted by noisy context.

Without proper metrics, it is difficult to know which part of the pipeline is causing the issue.

RAGAS helps break this problem into measurable components.


The 6 Core RAGAS Metrics

1. Faithfulness

Faithfulness checks whether the generated answer stays grounded in the retrieved context.
Its primary purpose is to detect hallucinations — situations where the LLM introduces facts that are not supported by the source material.

Simple way to think about it

Imagine a journalist writing an article based only on interview notes. If the journalist adds details that were never mentioned in the notes, those details are unfaithful to the source. The same logic applies here.

What it measures

  • Whether answer claims are supported by retrieved context

  • Whether the model is inventing unsupported details

Ideal score

1.0

Formula

Supported claims / Total claims

If the score is low

A low faithfulness score usually means the model is hallucinating.

What to improve

  • Strengthen the prompt to enforce context grounding

  • Reduce temperature

  • Use a stronger LLM

  • Limit generation freedom when strict factuality is required


2. Answer Relevancy

Answer Relevancy measures whether the generated answer actually addresses the user’s question.

An answer may be factually correct and still be irrelevant.

Example

If someone asks, “What is the capital of France?” and the answer is “The Eiffel Tower is beautiful,” the statement may be true, but it does not answer the question.

How RAGAS approaches this

RAGAS uses a smart reverse-engineering method:

  1. It generates hypothetical questions from the answer

  2. It compares those generated questions with the original question using embeddings

  3. It computes similarity to determine how relevant the answer is

What it measures

  • Whether the answer is on-topic

  • Whether the answer addresses the intent of the question

Ideal score

1.0

If the score is low

The answer is likely drifting away from the actual question.

What to improve

  • Review your prompt template

  • Ensure question intent is clearly preserved

  • Validate whether the system properly handles different query types such as what, why, when, and who


3. Context Precision

Context Precision measures whether the most relevant retrieved chunks are ranked higher in the retrieval results.

This is not just about retrieving relevant content. It is about retrieving it in the right order.

Simple way to think about it

Imagine a librarian gives you five books to answer a question. Context Precision asks whether the most useful book was placed at the top of the stack.

What it measures

  • Ranking quality of retrieved chunks

  • Whether top-ranked chunks are the most useful

How it works

  • Each retrieved chunk is checked for relevance

  • Precision is calculated at different ranks

  • More relevant chunks near the top produce a higher score

Ideal score

1.0

If the score is low

It usually means irrelevant chunks are being ranked too high.

What to improve

  • Improve your embedding model

  • Add a re-ranker

  • Tune similarity thresholds

  • Improve chunk quality and chunk boundaries


4. Context Recall

Context Recall measures whether the retriever fetched all the necessary information required to answer the question.

This metric focuses on retrieval completeness.

Simple way to think about it

Suppose you are studying from a textbook for an exam. Context Recall asks whether you covered all the chapters needed to answer the exam questions, or whether you missed some important sections.

What it measures

  • Whether all required supporting information was retrieved

  • Whether the retriever missed important facts

How it works

  1. Break the reference answer into claims

  2. Check whether each claim is supported by the retrieved context

  3. Compute: claims found / total claims

Ideal score

1.0

If the score is low

The retriever is likely missing important information.

What to improve

  • Increase top-k retrieval

  • Improve chunking strategy

  • Use better embeddings

  • Expand retrieval scope

  • Consider hybrid retrieval if semantic search alone is insufficient


5. Entity Recall

Entity Recall checks whether the retrieved context contains the important entities mentioned in the reference answer.

These entities may include:

  • people

  • organizations

  • places

  • dates

  • products

  • events

Simple way to think about it

If the correct answer mentions Einstein, 1905, and Princeton, did your retrieved documents include those entities?

What it measures

  • Whether key named entities were successfully retrieved

  • Whether important contextual anchors are present in the retrieval output

Ideal score

1.0

If the score is low

Important entities are not being surfaced by retrieval.

What to improve

  • Use entity-aware chunking

  • Add keyword-based retrieval alongside semantic retrieval

  • Improve metadata filtering

  • Increase retrieval coverage for entity-heavy questions


6. Noise Sensitivity

Noise Sensitivity measures how much irrelevant information in the retrieved context causes the model to make mistakes.

This metric reflects the robustness of the overall RAG system.

Simple way to think about it

Imagine taking an open-book exam, but someone mixed random Wikipedia pages into your notes. Noise Sensitivity measures how often those irrelevant pages confuse you into writing wrong answers.

Important note

Unlike the other metrics, lower is better here.

  • 0.0 = excellent robustness

  • 1.0 = poor robustness

What it measures

  • Whether irrelevant chunks confuse the model

  • Whether noisy context causes wrong claims in the answer

Ideal score

0.0

If the score is high

The model is overly influenced by irrelevant context.

What to improve

  • Filter noisy chunks before generation

  • Use re-ranking

  • Improve prompt robustness

  • Reduce overly broad retrieval

  • Strengthen chunk quality controls


    Quick Reference Table

 

How to Interpret These Metrics Together

One of the biggest mistakes in RAG evaluation is looking at only one metric.

A system may have:

  • high context recall but low faithfulness,

  • high answer relevancy but poor context precision,

  • or strong retrieval but high noise sensitivity.

Each metric tells only part of the story.

A practical grouping

Generator-focused metrics

  • Faithfulness

  • Answer Relevancy

Retriever-focused metrics

  • Context Precision

  • Context Recall

  • Entity Recall

System robustness metric

  • Noise Sensitivity

This grouping helps you quickly identify whether the issue is in:

  • generation,

  • retrieval,

  • or end-to-end robustness.

   Debugging Guide: What to Do When Scores Are Low

     


Example: Evaluating a Simple RAG Sample

Let us take a simple example.

Question

What is the Eiffel Tower and where is it located?

Generated Response

“The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was built in 1889.”

Reference Answer

“The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It was constructed from 1887 to 1889.”

Retrieved Contexts

  1. The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.

  2. The tower was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair.

  3. The Eiffel Tower is named after Gustave Eiffel, whose company designed and built the tower.

  4. Paris is known for its café culture and fashion industry.

In this example:

  • Faithfulness should be reasonably high because the answer is supported by context

  • Answer Relevancy should be high because the response addresses the question directly

  • Context Precision depends on whether the most relevant chunks are ranked at the top

  • Context Recall checks whether all facts needed for the reference are present

  • Entity Recall checks whether key entities such as Eiffel Tower and Paris are present

  • Noise Sensitivity measures whether the unrelated Paris sentence causes confusion

This is a great example because it includes both useful context and a small amount of noise, making it ideal for demonstration.


Key Insights

1. Faithfulness and Answer Relevancy evaluate the Generator

These tell you whether the LLM is producing grounded and question-focused answers.

2. Context Precision, Context Recall, and Entity Recall evaluate the Retriever

These help assess retrieval quality, completeness, and entity coverage.

3. Noise Sensitivity evaluates robustness

This tells you how well the full system handles irrelevant context.

4. Intermediate evaluation is critical

Looking only at final answers is not enough. You need to inspect retrieval quality and grounding behavior separately.

5. The best evaluation comes from combining metrics

No single metric fully explains RAG quality. The real value comes from using these signals together.


Final Thoughts:

If you are building enterprise-grade RAG systems, these metrics can help you move from guesswork to measurable quality improvement. They not only show whether a RAG pipeline is working, but also pinpoint whether the problem lies in retrieval, ranking, generation, or robustness.

That makes RAGAS not just an evaluation tool, but a debugging framework for building better GenAI applications.


Git code for the Implementation of all these Metriceshttps://github.com/LeelaPrasadG/rag_langchain/blob/main/12_RAGAS_Metrics_Deep_Dive.ipynb


Referencehttps://medium.com/@danushidk507/evaluation-with-ragas-873a574b86a9

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...