Artificial Intelligence by Leela Prasad: A simple RAG Example using Langchain

RAG Implementation can easily be orchestrated with Langchain using LCEL.

LangChain sequence is chained with Pipe and it is call LCEL.

Before jumping into the example of RAG + FAISS + Langchain, here is another example that illustrates the usage of Langchain that invokes a chain inside another chain.

pip install langchain langchain-openai

export OPENAI_API_KEY="your-api-key"

Python Code

This code shows a chain that takes a topic, generates a paragraph, and then summarizes it:

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 1. Initialize the model
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)

# 2. Define the first prompt template (generate content)
prompt_generator = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that generates a short paragraph on a given topic."),
    ("human", "{topic}")
])

# 3. Define the second prompt template (summarize content)
prompt_summarizer = ChatPromptTemplate.from_messages([
    ("system", "Summarize the following content concisely in one sentence."),
    ("human", "{content}")
])

# 4. Define the output parser
output_parser = StrOutputParser()

# 5. Create the chains using the pipe (|) operator
# First chain: generate content
generate_chain = prompt_generator | llm | output_parser

# Second chain: summarize the output of the first chain
full_chain = {"content": generate_chain} | prompt_summarizer | llm | output_parser

# 6. Invoke the chain
response = full_chain.invoke({"topic": "The importance of the internet"})

print(response)

RAG + FAISS + Langchain

This article walks through a real-world pipeline that starts with PDF documents and ends with accurate, context-aware LLM responses.

Here is the Code implementation of RAG + FAISS + Langchain under https://github.com/LeelaPrasadG/rag_langchain/blob/main/simple_rag_langchain.ipynb

PDF → Text → Chunks → Embeddings → Vector DB

↓

User Query → Embedding → Retrieval → LLM → Answer

1️⃣ Reading Data from PDFs

The first step is extracting raw text from PDFs. Once extracted, the raw text is not immediately suitable for LLMs due to context length limits. This is where chunking becomes critical.

2️⃣ Text Chunking Strategy

Why Chunking Matters

LLMs and embedding models have context size constraints. Sending entire documents:

Increases cost
Reduces retrieval precision
Causes irrelevant context leakage

Chunking solves this by splitting documents into semantically meaningful pieces.

Chunk Size

Chunk Size: Number of characters per chunk
- Example: 1024 characters
- Roughly equals 200–250 tokens / words
Chunk Overlap:

Recommended: 10–15%
Preserves context between adjacent chunks

RecursiveCharacterTextSplitter:

A commonly used strategy is RecursiveCharacterTextSplitter, which:

Attempts paragraph-level splits first
Falls back to sentence or character-level splits
Maintains semantic continuity

This approach balances context preservation with retrieval accuracy.

3️⃣ Creating Embeddings from Chunks

Once chunks are created, each chunk is passed to an Embedding Model.

What Are Embeddings?

Embeddings are numeric vector representations of text that capture semantic meaning.

Embedding Models

OpenAIEmbeddings()
- API-based
- Usage is billed
Other providers may include open-source or managed alternatives

Once embeddings are generated:

They can be stored locally or in a Vector DB
Re-embedding is NOT required every time
This avoids repeated API costs and latency

4️⃣ Storing Embeddings in a Vector Database

Vector Databases store:

Embeddings (vectors)
Metadata (document ID, page number, source)

They are optimized for fast similarity search, not traditional SQL queries.

Once stored, embeddings can be:

Loaded from the DB
Reused across multiple sessions
Shared across applications

5️⃣ Retrieval Using Semantic Similarity

Query Flow

User submits a question
Question → Embedding model
Query embedding compared with stored embeddings
Similarity calculated using Cosine Similarity
Top-K chunks retrieved

retrieved_docs = retriever.invoke(test_query)

Why Cosine Similarity?

Measures semantic closeness
Works well in high-dimensional spaces
Scale-invariant (magnitude doesn’t distort meaning)

6️⃣ Passing Context to the LLM

The LLM receives three inputs:

User Query
Retrieved Context (Top-K Chunks)
Prompt Instructions

Context + Query + Prompt are Passed to → LLM → Final Answer

7️⃣ invoke vs Batch Calls in LLMs

`invoke()`

Used for single queries
Common in:
- Interactive apps
- Chat interfaces
- Development and testing

Batch Calls

Used in production-scale systems
Ideal for:
- 10,000+ requests
- Offline processing
- Cost and memory optimization

Benefits:

Lower memory footprint
Better throughput
Reduced overhead per request

📌 Rule of Thumb

invoke → real-time, user-facing
batch → high-volume, backend processing

RunnablePassThrough:

This comes in to picture when a chain is created and in the code https://github.com/LeelaPrasadG/rag_langchain/blob/main/simple_rag_langchain.ipynb

below is the snippet

# Define the prompt template for the RAG system
# This tells the LLM how to use the retrieved context
system_prompt = (
    "You are a helpful assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer based on the context, say that you don't know. "
    "Keep the answer concise and accurate.\n\n"
    "Context: {context}\n\n"
    "Question: {question}"
)

# Create the prompt template
prompt = ChatPromptTemplate.from_template(system_prompt)

# Build the RAG chain using LangChain 1.0+ LCEL (LangChain Expression Language)
# This uses the pipe operator (|) to chain components together

rag_chain = (

{

        "context": retriever,  # Retrieve docs and format them

        "question": RunnablePassthrough()      # Pass through the question

}

    | prompt           # Format with prompt template

    | llm              # Generate answer with LLM

    | StrOutputParser() # Parse output to string

)

context needs the question to retrieve the top k documents from VectorDB. 

Also the Question needs to be passed to the Prompt along with top k documents retrieved from VectorDB.

Important Point: See the code

prompt = ChatPromptTemplate.from_template(system_prompt)

retriever got the top k documents from Vector DB and {context} Value is substituted with top k documents retrieved from VectorDB.

{question} is also part of system_prompt. The step | prompt in the chain is just going to substitute the {context} and {question} 

values in system_prompt. This would be then passed to the LLM to generate the Response.

The output type in this example is of type StrOutputParser(). StrOutputParse is the most basic output parser in LangChain.

It takes the raw output from the language model and returns it as a plain string, without making any changes or trying to structure it.

Basing on the need the output type can be defined. Few other output types are https://medium.com/data-and-beyond/output-parsers-in-langchain-b2e0db20880f

Reference: https://www.youtube.com/watch?v=hKiTFJy2ijU

Artificial Intelligence by Leela Prasad

Tuesday, 30 December 2025

A simple RAG Example using Langchain