Tuesday, 30 December 2025

A simple RAG Example using Langchain

RAG Implementation can easily be orchestrated with Langchain using LCEL.

LangChain sequence is chained with Pipe and it is call LCEL.


Before jumping into the example of RAG + FAISS + Langchain, here is another example that illustrates the usage of Langchain that invokes a chain inside another chain.


pip install langchain langchain-openai

export OPENAI_API_KEY="your-api-key"

Python Code
This code shows a chain that takes a topic, generates a paragraph, and then summarizes it:
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 1. Initialize the model
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)

# 2. Define the first prompt template (generate content)
prompt_generator = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that generates a short paragraph on a given topic."),
    ("human", "{topic}")
])

# 3. Define the second prompt template (summarize content)
prompt_summarizer = ChatPromptTemplate.from_messages([
    ("system", "Summarize the following content concisely in one sentence."),
    ("human", "{content}")
])

# 4. Define the output parser
output_parser = StrOutputParser()

# 5. Create the chains using the pipe (|) operator
# First chain: generate content
generate_chain = prompt_generator | llm | output_parser

# Second chain: summarize the output of the first chain
full_chain = {"content": generate_chain} | prompt_summarizer | llm | output_parser

# 6. Invoke the chain
response = full_chain.invoke({"topic": "The importance of the internet"})

print(response)

RAG + FAISS + Langchain

This article walks through a real-world pipeline that starts with PDF documents and ends with accurate, context-aware LLM responses.


Here is the Code implementation of RAG + FAISS + Langchain under https://github.com/LeelaPrasadG/rag_langchain/blob/main/simple_rag_langchain.ipynb


PDF → Text → Chunks → Embeddings → Vector DB

                                                                  ↓

User Query → Embedding → Retrieval → LLM → Answer


1️⃣ Reading Data from PDFs

The first step is extracting raw text from PDFs. Once extracted, the raw text is not immediately suitable for LLMs due to context length limits. This is where chunking becomes critical.

2️⃣ Text Chunking Strategy

Why Chunking Matters

LLMs and embedding models have context size constraints. Sending entire documents:

  • Increases cost

  • Reduces retrieval precision

  • Causes irrelevant context leakage

Chunking solves this by splitting documents into semantically meaningful pieces.


Chunk Size 

  • Chunk Size: Number of characters per chunk

    • Example: 1024 characters

    • Roughly equals 200–250 tokens / words

  • Chunk Overlap:

    • Recommended: 10–15%

    • Preserves context between adjacent chunks


RecursiveCharacterTextSplitter

A commonly used strategy is RecursiveCharacterTextSplitter, which:

  • Attempts paragraph-level splits first

  • Falls back to sentence or character-level splits

  • Maintains semantic continuity

This approach balances context preservation with retrieval accuracy.


3️⃣ Creating Embeddings from Chunks

Once chunks are created, each chunk is passed to an Embedding Model.

What Are Embeddings?

Embeddings are numeric vector representations of text that capture semantic meaning.


Embedding Models

  • OpenAIEmbeddings()

    • API-based

    • Usage is billed

  • Other providers may include open-source or managed alternatives

Once embeddings are generated:

  • They can be stored locally or in a Vector DB

  • Re-embedding is NOT required every time

  • This avoids repeated API costs and latency

4️⃣ Storing Embeddings in a Vector Database

Vector Databases store:

  • Embeddings (vectors)

  • Metadata (document ID, page number, source)

They are optimized for fast similarity search, not traditional SQL queries.

Once stored, embeddings can be:

  • Loaded from the DB

  • Reused across multiple sessions

  • Shared across applications

5️⃣ Retrieval Using Semantic Similarity

Query Flow
  1. User submits a question

  2. Question → Embedding model

  3. Query embedding compared with stored embeddings

  4. Similarity calculated using Cosine Similarity

  5. Top-K chunks retrieved

retrieved_docs = retriever.invoke(test_query)

Why Cosine Similarity?

  • Measures semantic closeness

  • Works well in high-dimensional spaces

  • Scale-invariant (magnitude doesn’t distort meaning)


6️⃣ Passing Context to the LLM

The LLM receives three inputs:

  1. User Query

  2. Retrieved Context (Top-K Chunks)

  3. Prompt Instructions

Context + Query + Prompt are Passed to → LLM → Final Answer


7️⃣ invoke vs Batch Calls in LLMs

invoke()

  • Used for single queries

  • Common in:

    • Interactive apps

    • Chat interfaces

    • Development and testing

Batch Calls

  • Used in production-scale systems

  • Ideal for:

    • 10,000+ requests

    • Offline processing

    • Cost and memory optimization

Benefits:

  • Lower memory footprint

  • Better throughput

  • Reduced overhead per request

📌 Rule of Thumb

  • invoke → real-time, user-facing

  • batch → high-volume, backend processing


RunnablePassThrough:

This comes in to picture when a chain is created and in the code https://github.com/LeelaPrasadG/rag_langchain/blob/main/simple_rag_langchain.ipynb
below is the snippet 

# Define the prompt template for the RAG system
# This tells the LLM how to use the retrieved context
system_prompt = (
    "You are a helpful assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer based on the context, say that you don't know. "
    "Keep the answer concise and accurate.\n\n"
    "Context: {context}\n\n"
    "Question: {question}"
)

# Create the prompt template
prompt = ChatPromptTemplate.from_template(system_prompt)

# Build the RAG chain using LangChain 1.0+ LCEL (LangChain Expression Language)
# This uses the pipe operator (|) to chain components together

rag_chain = (
    {
        "context": retriever,  # Retrieve docs and format them
        "question": RunnablePassthrough()      # Pass through the question
    }
    | prompt           # Format with prompt template
    | llm              # Generate answer with LLM
    | StrOutputParser() # Parse output to string
)

context needs the question to retrieve the top k documents from VectorDB.
Also the Question needs to be passed to the Prompt along with top k documents retrieved from VectorDB.
Important Point: See the code
prompt = ChatPromptTemplate.from_template(system_prompt)

retriever got the top k documents from Vector DB and {context} Value is substituted with top k documents retrieved from VectorDB.
{question} is also part of system_prompt. The step | prompt in the chain is just going to substitute the {context} and {question}
values in system_prompt. This would be then passed to the LLM to generate the Response.

The output type in this example is of type StrOutputParser(). StrOutputParse is the most basic output parser in LangChain.
It takes the raw output from the language model and returns it as a plain string, without making any changes or trying to structure it.

Basing on the need the output type can be defined. Few other output types are https://medium.com/data-and-beyond/output-parsers-in-langchain-b2e0db20880f

LangChain overview

LangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool — so you can build agents that adapt as fast as the ecosystem evolves

LangChain is the easiest way to start building agents and applications powered by LLMs. With under 10 lines of code, you can connect to OpenAI, Anthropic, Google, and more. LangChain provides a pre-built agent architecture and model integrations to help you get started quickly and seamlessly incorporate LLMs into your agents and applications.

LangChain sequence is chained with Pipe and it is call LCEL


RunnableLambda


A LangChain runnable is a protocol that allows you to create and invoke custom chains. It's designed to sequence tasks, taking the output of one call and feeding it as input to the next, making it suitable for straightforward, linear tasks where each step directly builds upon the previous one.

Eg:

# Simple LCEL Example: String Transformation Chain
from langchain_core.runnables import RunnableLambda

# Create simple transformation functions
def uppercase(text: str) -> str:
    """Convert text to uppercase"""
    print(f"  Step 1: uppercase → {text.upper()}")
    return text.upper()

def add_prefix(text: str) -> str:
    """Add a prefix to text"""
    result = f"RESULT: {text}"
    print(f"  Step 2: add_prefix → {result}")
    return result

def add_emoji(text: str) -> str:
    """Add emoji to text"""
    result = f"✅ {text}"
    print(f"  Step 3: add_emoji → {result}")
    return result

# Create runnables (components that can be chained)
uppercase_runnable = RunnableLambda(uppercase)
prefix_runnable = RunnableLambda(add_prefix)
emoji_runnable = RunnableLambda(add_emoji)

# Build the chain using LCEL
chain = uppercase_runnable | prefix_runnable | emoji_runnable

# Execute the chain
print("Input: 'hello langchain'")
print("\nProcessing:")
result = chain.invoke("hello langchain")
print(f"\nFinal Output: {result}")


Input: 'hello langchain' Processing: Step 1: uppercase → HELLO LANGCHAIN Step 2: add_prefix → RESULT: HELLO LANGCHAIN Step 3: add_emoji → ✅ RESULT: HELLO LANGCHAIN Final Output: ✅ RESULT: HELLO LANGCHAIN

Explanation: In this sequence upon the call chain.invoke("hello langchain") the input text "hello langchain" is first passed to uppercase_runnable which then calls the function uppercase.
The output of this function is then again passed to the next function in the chain which is prefix_runnable and then it's output is again passed to emoji_runnable. In this way LambdaRunnable
help to cascade the output from one function call to it's next call in the sequence.

Thursday, 25 December 2025

Retrieval-Augmented Generation (RAG): A Practical Architecture Guide with Diagrams

Large Language Models (LLMs) are powerful—but they come with a fundamental limitation: they only know what they were trained on. Retrieval-Augmented Generation (RAG) solves this by combining external knowledge retrieval with LLM-based generation, enabling responses that are grounded, up-to-date, and domain-specific.

This article walks through a simple yet complete RAG architecture, explaining each stage clearly with diagrams and real-world intuition.


🔷 High-Level RAG Architecture


At a high level, RAG consists of five major stages:

  1. Data Ingestion

  2. Vector Databases

  3. Retrieval

  4. Augmentation

  5. Generation


1️⃣ Data Ingestion: From Raw Data to Chunks

Supported Data Types

RAG systems can ingest data from multiple modalities:

  • Text: PDF, HTML, JSON, CSV, XML, DOCX, PPT

  • Images: JPG, PNG, GIF

  • Audio

  • Video

Ingestion Flow

  1. Documents are crawled or loaded

  2. Content is cleaned and normalized

  3. Documents are split into chunks

  4. Each chunk is sent to an Embedding Model

  5. Resulting embeddings are stored in a Vector Database

📌 Why chunking?
LLMs have context length limits. Chunking ensures:

  • Better semantic matching

  • Faster retrieval

  • Higher answer accuracy


2️⃣ Vector Databases & Embeddings

What is a Vector?

A vector is a numeric representation of data in high-dimensional space.

Example:

"My name is Leela" → [0.8, 0.9, 0.5, 0.3]

Text, images, or audio are transformed into numbers that capture semantic meaning.


How Text Becomes Numbers

Traditional and modern techniques include:

  • One-Hot Encoding

  • Bag of Words

  • Co-Occurrence Matrix

  • TF-IDF

  • Word2Vec

  • Transformer-based Embeddings (modern standard)

An Embedding Model converts text into dense vectors.


Popular Embedding Models

Open Source

  • Hugging Face models

  • Nomic

Managed / Paid

  • Gemini Embedding Model

  • OpenAI text-embedding-3-small

  • Amazon Titan Embeddings


Vector Databases

Open Source

  • FAISS

  • ChromaDB

Managed / Paid

  • Qdrant

  • Zilliz

These databases store embeddings and allow fast semantic similarity search.


3️⃣ Retrieval: Finding the Right Context

Dense Passage Retrieval (DPR)

DPR uses Bi-Encoders:

  • Query Encoder → converts the user question into an embedding

  • Passage Encoder → embeddings already stored in the vector DB

Both embeddings are compared using similarity metrics.


Similarity Techniques

  • Cosine Similarity (most common)

  • Dot Product

  • Euclidean Distance


Retrieval Example

Query Embedding → Q1 Stored Documents: V11 → Similarity 0.4 V12 → Similarity 0.6 V13 → Similarity 0.8

V13 is the most relevant document and is retrieved.


Extractive vs Abstractive Retrieval

  • Extractive: Returns exact text from source

  • Abstractive: Rephrases or summarizes while preserving meaning


Parametric vs Non-Parametric Memory

Memory TypeDescription
Parametric Knowledge stored inside model weights (e.g., BART, GPT pre-training)
Non-ParametricExternal knowledge (Vector DB, documents, APIs)

📌 RAG combines both:

  • Parametric → language understanding

  • Non-parametric → factual grounding


4️⃣ Augmentation: Building the Final Prompt

Augmentation combines:

  1. User Query

  2. Retrieved Documents (Context)

  3. Prompt Instructions

Example prompt structure:

System Prompt: You are a domain expert… Context: Retrieved documents User Question: ...

This step controls:

  • Tone

  • Structure

  • Output format

  • Level of detail


5️⃣ Generation: LLM Produces the Answer

The LLM generates text using three inputs:

  1. Query

  2. Retrieved Context

  3. Prompt Instructions

The output is:

  • Grounded in enterprise data

  • More accurate

  • Less prone to hallucination


🔁 End-to-End Retrieval Flow

  1. User submits a query

  2. Query → embedding model

  3. Vector DB performs similarity search

  4. Top-K documents retrieved

  5. Optional re-ranking

  6. Context + query + prompt → LLM

  7. Final response generated


🧠 Types of LLM Usage in RAG

1. Pre-Training (From Scratch)

  • Build a foundation model

  • Massive compute and data required

  • Key papers: COG, TIGER


2. Fine-Tuning

Base Model + Domain Data = Specialized Model
  • Key papers: Self-RAG, FLARE


3. Inference-Time RAG (Most Common)

Pretrained LLM + Vector DB + Retriever
  • Key papers: CRAG, Iter-RETGEN


🚀 Why RAG Matters

  • Keeps LLMs up-to-date

  • Enables enterprise & domain-specific AI

  • Reduces hallucinations

  • Avoids costly full model retraining

  • Improves explainability & trust


🧩 Final Thought

RAG is not just an architecture—it’s a bridge between static language models and living enterprise knowledge. As data grows and domains evolve, RAG enables AI systems to stay accurate, scalable, and context-aware.

If you’re building production-grade GenAI systems, RAG is no longer optional—it’s foundational.

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...