RAG Implementation can easily be orchestrated with Langchain using LCEL.
LangChain sequence is chained with Pipe and it is call LCEL.
Before jumping into the example of RAG + FAISS + Langchain, here is another example that illustrates the usage of Langchain that invokes a chain inside another chain.
pip install langchain langchain-openai
export OPENAI_API_KEY="your-api-key"
RAG + FAISS + Langchain
This article walks through a real-world pipeline that starts with PDF documents and ends with accurate, context-aware LLM responses.
Here is the Code implementation of RAG + FAISS + Langchain under https://github.com/LeelaPrasadG/rag_langchain/blob/main/simple_rag_langchain.ipynb
PDF → Text → Chunks → Embeddings → Vector DB
↓
User Query → Embedding → Retrieval → LLM → Answer
Why Chunking Matters
LLMs and embedding models have context size constraints. Sending entire documents:
-
Increases cost
-
Reduces retrieval precision
-
Causes irrelevant context leakage
Chunking solves this by splitting documents into semantically meaningful pieces.
Chunk Size
-
Chunk Size: Number of characters per chunk
-
Example:
1024 characters -
Roughly equals 200–250 tokens / words
-
-
Chunk Overlap:
-
Recommended: 10–15%
-
Preserves context between adjacent chunks
A commonly used strategy is RecursiveCharacterTextSplitter, which:
-
Attempts paragraph-level splits first
-
Falls back to sentence or character-level splits
-
Maintains semantic continuity
This approach balances context preservation with retrieval accuracy.
3️⃣ Creating Embeddings from Chunks
Once chunks are created, each chunk is passed to an Embedding Model.
What Are Embeddings?
Embeddings are numeric vector representations of text that capture semantic meaning.
Embedding Models
-
OpenAIEmbeddings()-
API-based
-
Usage is billed
-
-
Other providers may include open-source or managed alternatives
Once embeddings are generated:
-
They can be stored locally or in a Vector DB
-
Re-embedding is NOT required every time
-
This avoids repeated API costs and latency
4️⃣ Storing Embeddings in a Vector Database
Vector Databases store:
-
Embeddings (vectors)
-
Metadata (document ID, page number, source)
They are optimized for fast similarity search, not traditional SQL queries.
Once stored, embeddings can be:
-
Loaded from the DB
-
Reused across multiple sessions
-
Shared across applications
5️⃣ Retrieval Using Semantic Similarity
-
User submits a question
-
Question → Embedding model
-
Query embedding compared with stored embeddings
-
Similarity calculated using Cosine Similarity
-
Top-K chunks retrieved
Why Cosine Similarity?
-
Measures semantic closeness
-
Works well in high-dimensional spaces
-
Scale-invariant (magnitude doesn’t distort meaning)
6️⃣ Passing Context to the LLM
The LLM receives three inputs:
-
User Query
-
Retrieved Context (Top-K Chunks)
-
Prompt Instructions
7️⃣ invoke vs Batch Calls in LLMs
invoke()
-
Used for single queries
-
Common in:
-
Interactive apps
-
Chat interfaces
-
Development and testing
-
Batch Calls
-
Used in production-scale systems
-
Ideal for:
-
10,000+ requests
-
Offline processing
-
Cost and memory optimization
-
Benefits:
-
Lower memory footprint
-
Better throughput
-
Reduced overhead per request
📌 Rule of Thumb
-
invoke → real-time, user-facing
-
batch → high-volume, backend processing