Thursday, 25 December 2025

Retrieval-Augmented Generation (RAG): A Practical Architecture Guide with Diagrams

Large Language Models (LLMs) are powerful—but they come with a fundamental limitation: they only know what they were trained on. Retrieval-Augmented Generation (RAG) solves this by combining external knowledge retrieval with LLM-based generation, enabling responses that are grounded, up-to-date, and domain-specific.

This article walks through a simple yet complete RAG architecture, explaining each stage clearly with diagrams and real-world intuition.


๐Ÿ”ท High-Level RAG Architecture


At a high level, RAG consists of five major stages:

  1. Data Ingestion

  2. Vector Databases

  3. Retrieval

  4. Augmentation

  5. Generation


1️⃣ Data Ingestion: From Raw Data to Chunks

Supported Data Types

RAG systems can ingest data from multiple modalities:

  • Text: PDF, HTML, JSON, CSV, XML, DOCX, PPT

  • Images: JPG, PNG, GIF

  • Audio

  • Video

Ingestion Flow

  1. Documents are crawled or loaded

  2. Content is cleaned and normalized

  3. Documents are split into chunks

  4. Each chunk is sent to an Embedding Model

  5. Resulting embeddings are stored in a Vector Database

๐Ÿ“Œ Why chunking?
LLMs have context length limits. Chunking ensures:

  • Better semantic matching

  • Faster retrieval

  • Higher answer accuracy


2️⃣ Vector Databases & Embeddings

What is a Vector?

A vector is a numeric representation of data in high-dimensional space.

Example:

"My name is Leela" → [0.8, 0.9, 0.5, 0.3]

Text, images, or audio are transformed into numbers that capture semantic meaning.


How Text Becomes Numbers

Traditional and modern techniques include:

  • One-Hot Encoding

  • Bag of Words

  • Co-Occurrence Matrix

  • TF-IDF

  • Word2Vec

  • Transformer-based Embeddings (modern standard)

An Embedding Model converts text into dense vectors.


Popular Embedding Models

Open Source

  • Hugging Face models

  • Nomic

Managed / Paid

  • Gemini Embedding Model

  • OpenAI text-embedding-3-small

  • Amazon Titan Embeddings


Vector Databases

Open Source

  • FAISS

  • ChromaDB

Managed / Paid

  • Qdrant

  • Zilliz

These databases store embeddings and allow fast semantic similarity search.


3️⃣ Retrieval: Finding the Right Context

Dense Passage Retrieval (DPR)

DPR uses Bi-Encoders:

  • Query Encoder → converts the user question into an embedding

  • Passage Encoder → embeddings already stored in the vector DB

Both embeddings are compared using similarity metrics.


Similarity Techniques

  • Cosine Similarity (most common)

  • Dot Product

  • Euclidean Distance


Retrieval Example

Query Embedding → Q1 Stored Documents: V11 → Similarity 0.4 V12 → Similarity 0.6 V13 → Similarity 0.8

V13 is the most relevant document and is retrieved.


Extractive vs Abstractive Retrieval

  • Extractive: Returns exact text from source

  • Abstractive: Rephrases or summarizes while preserving meaning


Parametric vs Non-Parametric Memory

Memory TypeDescription
Parametric Knowledge stored inside model weights (e.g., BART, GPT pre-training)
Non-ParametricExternal knowledge (Vector DB, documents, APIs)

๐Ÿ“Œ RAG combines both:

  • Parametric → language understanding

  • Non-parametric → factual grounding


4️⃣ Augmentation: Building the Final Prompt

Augmentation combines:

  1. User Query

  2. Retrieved Documents (Context)

  3. Prompt Instructions

Example prompt structure:

System Prompt: You are a domain expert… Context: Retrieved documents User Question: ...

This step controls:

  • Tone

  • Structure

  • Output format

  • Level of detail


5️⃣ Generation: LLM Produces the Answer

The LLM generates text using three inputs:

  1. Query

  2. Retrieved Context

  3. Prompt Instructions

The output is:

  • Grounded in enterprise data

  • More accurate

  • Less prone to hallucination


๐Ÿ” End-to-End Retrieval Flow

  1. User submits a query

  2. Query → embedding model

  3. Vector DB performs similarity search

  4. Top-K documents retrieved

  5. Optional re-ranking

  6. Context + query + prompt → LLM

  7. Final response generated


๐Ÿง  Types of LLM Usage in RAG

1. Pre-Training (From Scratch)

  • Build a foundation model

  • Massive compute and data required

  • Key papers: COG, TIGER


2. Fine-Tuning

Base Model + Domain Data = Specialized Model
  • Key papers: Self-RAG, FLARE


3. Inference-Time RAG (Most Common)

Pretrained LLM + Vector DB + Retriever
  • Key papers: CRAG, Iter-RETGEN


๐Ÿš€ Why RAG Matters

  • Keeps LLMs up-to-date

  • Enables enterprise & domain-specific AI

  • Reduces hallucinations

  • Avoids costly full model retraining

  • Improves explainability & trust


๐Ÿงฉ Final Thought

RAG is not just an architecture—it’s a bridge between static language models and living enterprise knowledge. As data grows and domains evolve, RAG enables AI systems to stay accurate, scalable, and context-aware.

If you’re building production-grade GenAI systems, RAG is no longer optional—it’s foundational.

No comments:

Post a Comment

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...