Tuesday, 13 January 2026

External Index Retrievers & Multi Retrievers

 What are External Index Retrievers?


External Index Retrievers search over external data sources (e.g., the internet, academic databases, knowledge bases)
rather than local vector store.


If we have a use case where data we need a combination of Vector DB + Latest data from Internet + LLM's existing Knowledge and functionality.

This can be achieved by the concept of Multi Retrievers. Context is created with a combination of data from the below 2 sources.

1. Data from Vector DB
2. Fetch latest data from Internet search

Here, are the available external retrievers those are available to use for each of the requirement.

Sunday, 11 January 2026

Retrieval Strategies from Vector DB

Retrievers are available from Vector DB, default one is Cosine similarity and it represented as search_type="similarity"


This is the Basic retriever and there are few others like "MMR"

Is this same as cosine similarity

In most common configurations, yes, it is the same as or highly related to cosine similarity, but the exact mathematical metric depends on your Vector Database settings.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Create sample vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="LangChain is a framework for LLM applications", metadata={"topic": "langchain"}),
    Document(page_content="RAG combines retrieval with generation", metadata={"topic": "rag"}),
    Document(page_content="Vector databases store embeddings", metadata={"topic": "vectors"}),
    Document(page_content="Transformers use attention mechanisms", metadata={"topic": "transformers"}),
    Document(page_content="FAISS is a similarity search library", metadata={"topic": "vectors"}),
]

vectorstore = FAISS.from_documents(docs, embeddings)
print("✅ Vector store created")

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Return top 3 results
)

# Use retriever
query = "How does RAG work?"
results = retriever.invoke(query)

print(f"Query: {query}\n")
print("Results:")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    print(f"   Topic: {doc.metadata['topic']}\n")

Wednesday, 7 January 2026

Embeddings and Vector Representation

 

Embeddings convert text into numbers (vectors) that capture meaning.

Think of it like a GPS coordinate:
- "dog" → [0.2, 0.8, 0.1, ...] (1536 numbers)
- "cat" → [0.3, 0.7, 0.2, ...] (close to "dog"!)
- "car" → [0.9, 0.1, 0.8, ...] (far from "dog")

Similar meanings = Similar vectors!


OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings

# Initialize
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create embedding for a query
query = "What is machine learning?"
query_vector = embeddings.embed_query(query)

print(f"Query: {query}")
print(f"Vector dimensions: {len(query_vector)}")
print(f"First 5 values: {query_vector[:5]}")

# Embed multiple documents
docs = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "The weather is sunny today"
]

doc_vectors = embeddings.embed_documents(docs)
print(f"\nEmbedded {len(doc_vectors)} documents")

Query: What is machine learning? Vector dimensions: 1536 First 5 values: [-0.002476818859577179, -0.012755980715155602, -0.006645360495895147, -0.03157883137464523, 0.028759293258190155] Embedded 3 documents

Google Gemini Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

Can get the updated top embedding models from here https://huggingface.co/spaces/mteb/leaderboard the Unknown ones are non-open source

Cosine Similarity is usually used for identifying similarity between 2 vectors.


Text Splitting

Why Text splitting is Important?


Imagine you have a 200-page book and someone asks: *"What did the author say about machine learning on page 87?"*

Problem: LLMs have a limited "attention span" (context window):
- GPT-3.5-Turbo: ~4,000 tokens (~16,000 characters)
- GPT-4: ~8,000 tokens (~32,000 characters)
- You can't fit a whole book in one query!

Solution: Split the book into smaller chunks:
1. Each chunk is small enough for the LLM
2. Search finds the relevant chunks (like page 87)
3. Only send those chunks to the LLM


The Challenge

If the text is splitted randomly:
```
❌ BAD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-att"
Chunk 2: "ention mechanisms to process sequences in parallel. This allows..."
```

The word "attention" is cut in half! 😱

**Good splitters** respect boundaries (paragraphs, sentences, words):
```
✅ GOOD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-attention mechanisms."
Chunk 2: "Self-attention allows the model to process sequences in parallel. This improves speed..."
```


RecursiveCharacterTextSplitter is your go-to splitter for 80% of cases.

How it works:
1. Tries to split on double newlines (\n\n) → paragraphs
2. If chunks still too big, splits on single newlines (\n) → lines
3. If still too big, splits on periods (.) → sentences
4. If still too big, splits on spaces ( ) → words
5. Last resort: splits on characters

This preserves meaning as much as possible!

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Load the document
loader = TextLoader(txt_path)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,        # Maximum chunk size in characters
        chunk_overlap=200,      # Overlap between chunks
        length_function=len,    # How to measure length
        separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)
   
# Split the document
chunks = splitter.split_documents(documents)
   
print(f"✂️ Split into {len(chunks)} chunks\n")


Text Splitters for RAG

If you’re building a Retrieval-Augmented Generation (RAG) application, text splitting is one of the most important (and most underestimated) steps in the pipeline.

Even with a great embedding model and a strong vector database, bad chunking can ruin retrieval quality:

  • relevant info gets cut in half,

  • context becomes too long or too short,

  • citations become inaccurate,

  • LLM hallucinates because the retrieved chunks are incomplete.

This article breaks down the most useful splitter types (especially in LangChain), explains when to use which, and includes Python code with comments you can directly reuse.


Why Text Splitting Matters in RAG

RAG works like this:

  1. Load documents (PDF/HTML/CSV/JSON, etc.)

  2. Split into chunks

  3. Embed each chunk → vectors

  4. Store in vector DB

  5. Retrieve top-k chunks per query

  6. Feed retrieved context into LLM → answer

Splitting is the bridge between raw documents and searchable semantic units. Your goal is:

✅ chunks that are semantically coherent ✅ chunks that fit within embedding limits ✅ chunks that align with retrieval patterns ✅ minimal lost context (use overlap strategically)

Chunk Size: What to Use in Real Projects?

There’s no single perfect size, but these defaults work well:

  • Chunk size: 800–1200 characters (often ~200–300 tokens)

  • Chunk overlap: 10–15% (commonly 100–200 characters)

Why overlap? Because retrieval may pull one chunk but not its neighbor — overlap prevents losing key transitions, definitions, or conclusions.


RecursiveJsonSplitter splits JSON while preserving structure.

from langchain_text_splitters import RecursiveJsonSplitter

# Create splitter
    json_splitter = RecursiveJsonSplitter(
        max_chunk_size=1000,
        min_chunk_size=100
    )
   
    # Split
    json_chunks = json_splitter.split_text(
        json_data=json_data,
        convert_lists=True
    )

When You Need a Custom Splitter

You’ll want a custom splitter in most of the cases and below are a few:

  • you have a domain-specific format (EHR notes, bank statements, logs, invoices)

  • a document has repeated templates (forms, contracts)

  • you need chunking aligned to user queries (“policy section”, “procedure step”, “definition”)

  • you must preserve strict boundaries (like JSON objects or CSV rows)


For different kinds of Splitter implementations, do follow the Notebook under Git https://github.com/LeelaPrasadG/rag_langchain/blob/main/03_Text_Splitting_Strategies.ipynb

Monday, 5 January 2026

Document Loaders

This Article talks about different kinds of Document Loaders those are available for the different file types.

Please follow the Notebook for the code for the below Document Loaders https://github.com/LeelaPrasadG/rag_langchain/blob/main/02_Document_Loaders.ipynb

Document Loaders are tools that help you convert files (PDFs, CSVs, web pages, etc.) into **Document objects** that LangChain can work with.

Think of them as translators:
Input: Files in various formats (PDF, CSV, JSON, HTML)
Output: Standardized Document objects with text content and metadata

Why is this important?

Every RAG application needs to:
1. 📥 Load data from various sources
2. 🔄 Convert it to a standard format
3. 📊 Extract metadata (source, page number, etc.)
4. 🎯 Prepare it for embedding and retrieval

Document Loaders handle all of this automatically.

2 types of load functions()

`.load()`: Load all documents at once (returns list[Document])
`.lazy_load()`: Load documents one at a time (generator, memory efficient)

Tip: Use lazy_load() for PDFs > 100 pages to save memory


Few commonly used Document Loaders:


Document Loaders convert files into standardized Document objects
from langchain_core.documents import Document

PyPDFLoader loads PDF files (1 document per page)
from langchain_community.document_loaders import PyPDFLoader

PyPDFLoader is used to load PDF files. It:
- Extracts text from each page
- Creates one Document per page
- Automatically adds source and page number to metadata

CSVLoader loads CSV data (1 document per row)
from langchain_community.document_loaders import CSVLoader

CSVLoader converts each row of a CSV file into a separate Document.

Use cases:
- Product catalogs
- Customer records
- Any tabular data

JSONLoader uses jq syntax to extract data from JSON
from langchain_community.document_loaders import JSONLoader

JSONLoader extracts data from JSON files using jq syntax (a query language for JSON).
Can extract specific fields from JSON as well. See the Example in my Github code provided above.

Use cases:
- API responses
- Configuration files
- Structured data exports

WebBaseLoader scrapes web pages (static HTML only) - Widely used
from langchain_community.document_loaders import UnstructuredHTMLLoader

WebBaseLoader scrapes web pages and extracts text content.

Important: Only works with static HTML. For JavaScript-rendered sites, you'd need Playwright or Selenium.

TextLoader handles plain text files
from langchain_community.document_loaders import TextLoader

For simple text files, use TextLoader.

DirectoryLoader batch processes multiple files
from langchain_community.document_loaders import DirectoryLoader

DirectoryLoader loads all files from a directory automatically.

Perfect for:
- Loading entire document libraries
- Processing multiple files at once
- Building knowledge bases

✅ All loaders return **Document** objects with `page_content` and `metadata`



Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...