What are External Index Retrievers?
If we have a use case where data we need a combination of Vector DB + Latest data from Internet + LLM's existing Knowledge and functionality.
What are External Index Retrievers?
Retrievers are available from Vector DB, default one is Cosine similarity and it represented as search_type="similarity"
This is the Basic retriever and there are few others like "MMR"
Is this same as cosine similarity
If you’re building a Retrieval-Augmented Generation (RAG) application, text splitting is one of the most important (and most underestimated) steps in the pipeline.
Even with a great embedding model and a strong vector database, bad chunking can ruin retrieval quality:
relevant info gets cut in half,
context becomes too long or too short,
citations become inaccurate,
LLM hallucinates because the retrieved chunks are incomplete.
This article breaks down the most useful splitter types (especially in LangChain), explains when to use which, and includes Python code with comments you can directly reuse.
RAG works like this:
Load documents (PDF/HTML/CSV/JSON, etc.)
Split into chunks
Embed each chunk → vectors
Store in vector DB
Retrieve top-k chunks per query
Feed retrieved context into LLM → answer
Splitting is the bridge between raw documents and searchable semantic units. Your goal is:
✅ chunks that are semantically coherent ✅ chunks that fit within embedding limits ✅ chunks that align with retrieval patterns ✅ minimal lost context (use overlap strategically)
There’s no single perfect size, but these defaults work well:
Chunk size: 800–1200 characters (often ~200–300 tokens)
Chunk overlap: 10–15% (commonly 100–200 characters)
Why overlap? Because retrieval may pull one chunk but not its neighbor — overlap prevents losing key transitions, definitions, or conclusions.
You’ll want a custom splitter in most of the cases and below are a few:
you have a domain-specific format (EHR notes, bank statements, logs, invoices)
a document has repeated templates (forms, contracts)
you need chunking aligned to user queries (“policy section”, “procedure step”, “definition”)
you must preserve strict boundaries (like JSON objects or CSV rows)
This Article talks about different kinds of Document Loaders those are available for the different file types.
Please follow the Notebook for the code for the below Document Loaders https://github.com/LeelaPrasadG/rag_langchain/blob/main/02_Document_Loaders.ipynb
In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...