Why Text splitting is Important?

Imagine you have a 200-page book and someone asks: *"What did the author say about machine learning on page 87?"*

Problem: LLMs have a limited "attention span" (context window):
- GPT-3.5-Turbo: ~4,000 tokens (~16,000 characters)
- GPT-4: ~8,000 tokens (~32,000 characters)
- You can't fit a whole book in one query!

Solution: Split the book into smaller chunks:
1. Each chunk is small enough for the LLM
2. Search finds the relevant chunks (like page 87)
3. Only send those chunks to the LLM


The Challenge

If the text is splitted randomly:
```
❌ BAD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-att"
Chunk 2: "ention mechanisms to process sequences in parallel. This allows..."
```

The word "attention" is cut in half! 😱

**Good splitters** respect boundaries (paragraphs, sentences, words):
```
✅ GOOD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-attention mechanisms."
Chunk 2: "Self-attention allows the model to process sequences in parallel. This improves speed..."
```

RecursiveCharacterTextSplitter is your go-to splitter for 80% of cases.

How it works:

1. Tries to split on double newlines (\n\n) → paragraphs

2. If chunks still too big, splits on single newlines (\n) → lines

3. If still too big, splits on periods (.) → sentences

4. If still too big, splits on spaces ( ) → words

5. Last resort: splits on characters

This preserves meaning as much as possible!

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.document_loaders import TextLoader

# Load the document

loader = TextLoader(txt_path)

documents = loader.load()

splitter = RecursiveCharacterTextSplitter(

        chunk_size=1000,        # Maximum chunk size in characters

        chunk_overlap=200,      # Overlap between chunks

        length_function=len,    # How to measure length

        separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order

)

# Split the document

chunks = splitter.split_documents(documents)

print(f"✂️ Split into {len(chunks)} chunks\n")

Text Splitters for RAG

If you’re building a Retrieval-Augmented Generation (RAG) application, text splitting is one of the most important (and most underestimated) steps in the pipeline.

Even with a great embedding model and a strong vector database, bad chunking can ruin retrieval quality:

relevant info gets cut in half,
context becomes too long or too short,
citations become inaccurate,
LLM hallucinates because the retrieved chunks are incomplete.

This article breaks down the most useful splitter types (especially in LangChain), explains when to use which, and includes Python code with comments you can directly reuse.

Why Text Splitting Matters in RAG

RAG works like this:

Load documents (PDF/HTML/CSV/JSON, etc.)
Split into chunks
Embed each chunk → vectors
Store in vector DB
Retrieve top-k chunks per query
Feed retrieved context into LLM → answer

Splitting is the bridge between raw documents and searchable semantic units. Your goal is:

✅ chunks that are semantically coherent ✅ chunks that fit within embedding limits ✅ chunks that align with retrieval patterns ✅ minimal lost context (use overlap strategically)

Chunk Size: What to Use in Real Projects?

There’s no single perfect size, but these defaults work well:

Chunk size: 800–1200 characters (often ~200–300 tokens)
Chunk overlap: 10–15% (commonly 100–200 characters)

Why overlap? Because retrieval may pull one chunk but not its neighbor — overlap prevents losing key transitions, definitions, or conclusions.

RecursiveJsonSplitter splits JSON while preserving structure.

from langchain_text_splitters import RecursiveJsonSplitter

# Create splitter

    json_splitter = RecursiveJsonSplitter(

        max_chunk_size=1000,

        min_chunk_size=100

)

    # Split

    json_chunks = json_splitter.split_text(

        json_data=json_data,

        convert_lists=True

)

When You Need a Custom SplitterYou’ll want a custom splitter in most of the cases and below are a few:
you have a domain-specific format (EHR notes, bank statements, logs, invoices)
a document has repeated templates (forms, contracts)
you need chunking aligned to user queries (“policy section”, “procedure step”, “definition”)
you must preserve strict boundaries (like JSON objects or CSV rows)

For different kinds of Splitter implementations, do follow the Notebook under Git https://github.com/LeelaPrasadG/rag_langchain/blob/main/03_Text_Splitting_Strategies.ipynb

Artificial Intelligence by Leela Prasad

Wednesday, 7 January 2026

Text Splitting

Why Text splitting is Important?

Text Splitters for RAG

Why Text Splitting Matters in RAG

Chunk Size: What to Use in Real Projects?

When You Need a Custom Splitter

No comments:

Post a Comment

Significance of Fallback Mechanism

Report Abuse