Artificial Intelligence by Leela Prasad: Document Loaders

This Article talks about different kinds of Document Loaders those are available for the different file types.

Please follow the Notebook for the code for the below Document Loaders https://github.com/LeelaPrasadG/rag_langchain/blob/main/02_Document_Loaders.ipynb

Document Loaders are tools that help you convert files (PDFs, CSVs, web pages, etc.) into **Document objects** that LangChain can work with.

Think of them as translators:
Input: Files in various formats (PDF, CSV, JSON, HTML)
Output: Standardized Document objects with text content and metadata

Why is this important?

Every RAG application needs to:
1. 📥 Load data from various sources
2. 🔄 Convert it to a standard format
3. 📊 Extract metadata (source, page number, etc.)
4. 🎯 Prepare it for embedding and retrieval

Document Loaders handle all of this automatically.

2 types of load functions()

`.load()`: Load all documents at once (returns list[Document])
`.lazy_load()`: Load documents one at a time (generator, memory efficient)

Tip: Use lazy_load() for PDFs > 100 pages to save memory

Few commonly used Document Loaders:
✅ Document Loaders convert files into standardized Document objects
from langchain_core.documents import Document

✅ PyPDFLoader loads PDF files (1 document per page)
from langchain_community.document_loaders import PyPDFLoader

PyPDFLoader is used to load PDF files. It:
- Extracts text from each page
- Creates one Document per page
- Automatically adds source and page number to metadata

✅ CSVLoader loads CSV data (1 document per row)
from langchain_community.document_loaders import CSVLoader

CSVLoader converts each row of a CSV file into a separate Document.

Use cases:
- Product catalogs
- Customer records
- Any tabular data

✅ JSONLoader uses jq syntax to extract data from JSON
from langchain_community.document_loaders import JSONLoader

JSONLoader extracts data from JSON files using jq syntax (a query language for JSON).
Can extract specific fields from JSON as well. See the Example in my Github code provided above.

Use cases:
- API responses
- Configuration files
- Structured data exports

✅ WebBaseLoader scrapes web pages (static HTML only) - Widely used
from langchain_community.document_loaders import UnstructuredHTMLLoader

WebBaseLoader scrapes web pages and extracts text content.

Important: Only works with static HTML. For JavaScript-rendered sites, you'd need Playwright or Selenium.

✅ TextLoader handles plain text files
from langchain_community.document_loaders import TextLoader

For simple text files, use TextLoader.

✅ DirectoryLoader batch processes multiple files
from langchain_community.document_loaders import DirectoryLoader

DirectoryLoader loads all files from a directory automatically.

Perfect for:
- Loading entire document libraries
- Processing multiple files at once
- Building knowledge bases

✅ All loaders return **Document** objects with `page_content` and `metadata`

Artificial Intelligence by Leela Prasad

Monday, 5 January 2026

Document Loaders

Few commonly used Document Loaders:

No comments:

Post a Comment

Significance of Fallback Mechanism

Report Abuse