Artificial Intelligence by Leela Prasad: Retrieval-Augmented Generation (RAG): A Practical Architecture Guide with Diagrams

Large Language Models (LLMs) are powerful—but they come with a fundamental limitation: they only know what they were trained on. Retrieval-Augmented Generation (RAG) solves this by combining external knowledge retrieval with LLM-based generation, enabling responses that are grounded, up-to-date, and domain-specific.

This article walks through a simple yet complete RAG architecture, explaining each stage clearly with diagrams and real-world intuition.

🔷 High-Level RAG Architecture

At a high level, RAG consists of five major stages:

Data Ingestion
Vector Databases
Retrieval
Augmentation
Generation

1️⃣ Data Ingestion: From Raw Data to Chunks

Supported Data Types

RAG systems can ingest data from multiple modalities:

Text: PDF, HTML, JSON, CSV, XML, DOCX, PPT
Images: JPG, PNG, GIF
Audio
Video

Ingestion Flow

Documents are crawled or loaded
Content is cleaned and normalized
Documents are split into chunks
Each chunk is sent to an Embedding Model
Resulting embeddings are stored in a Vector Database

📌 Why chunking?
LLMs have context length limits. Chunking ensures:

Better semantic matching
Faster retrieval
Higher answer accuracy

2️⃣ Vector Databases & Embeddings

What is a Vector?

A vector is a numeric representation of data in high-dimensional space.

Example:


"My name is Leela"
→ [0.8, 0.9, 0.5, 0.3]

Text, images, or audio are transformed into numbers that capture semantic meaning.

How Text Becomes Numbers

Traditional and modern techniques include:

One-Hot Encoding
Bag of Words
Co-Occurrence Matrix
TF-IDF
Word2Vec
Transformer-based Embeddings (modern standard)

An Embedding Model converts text into dense vectors.

Popular Embedding Models

Open Source

Hugging Face models
Nomic

Managed / Paid

Gemini Embedding Model
OpenAI text-embedding-3-small
Amazon Titan Embeddings

Vector Databases

Open Source

FAISS
ChromaDB

Managed / Paid

Qdrant
Zilliz

These databases store embeddings and allow fast semantic similarity search.

3️⃣ Retrieval: Finding the Right Context

Dense Passage Retrieval (DPR)

DPR uses Bi-Encoders:

Query Encoder → converts the user question into an embedding
Passage Encoder → embeddings already stored in the vector DB

Both embeddings are compared using similarity metrics.

Similarity Techniques

Cosine Similarity (most common)
Dot Product
Euclidean Distance

Retrieval Example


Query Embedding → Q1

Stored Documents:
V11 → Similarity 0.4
V12 → Similarity 0.6
V13 → Similarity 0.8

✅ V13 is the most relevant document and is retrieved.

Extractive vs Abstractive Retrieval

Extractive: Returns exact text from source
Abstractive: Rephrases or summarizes while preserving meaning

Parametric vs Non-Parametric Memory

Memory Type	Description
Parametric	Knowledge stored inside model weights (e.g., BART, GPT pre-training)
Non-Parametric	External knowledge (Vector DB, documents, APIs)

📌 RAG combines both:

Parametric → language understanding
Non-parametric → factual grounding

4️⃣ Augmentation: Building the Final Prompt

Augmentation combines:

User Query
Retrieved Documents (Context)
Prompt Instructions

Example prompt structure:


System Prompt: You are a domain expert…
Context: Retrieved documents
User Question: ...

This step controls:

Tone
Structure
Output format
Level of detail

5️⃣ Generation: LLM Produces the Answer

The LLM generates text using three inputs:

Query
Retrieved Context
Prompt Instructions

The output is:

Grounded in enterprise data
More accurate
Less prone to hallucination

🔁 End-to-End Retrieval Flow

User submits a query
Query → embedding model
Vector DB performs similarity search
Top-K documents retrieved
Optional re-ranking
Context + query + prompt → LLM
Final response generated

🧠 Types of LLM Usage in RAG

1. Pre-Training (From Scratch)

Build a foundation model
Massive compute and data required
Key papers: COG, TIGER

2. Fine-Tuning


Base Model
   +
Domain Data
   =
Specialized Model

Key papers: Self-RAG, FLARE

3. Inference-Time RAG (Most Common)


Pretrained LLM
   +
Vector DB
   +
Retriever

Key papers: CRAG, Iter-RETGEN

🚀 Why RAG Matters

Keeps LLMs up-to-date
Enables enterprise & domain-specific AI
Reduces hallucinations
Avoids costly full model retraining
Improves explainability & trust

🧩 Final Thought

RAG is not just an architecture—it’s a bridge between static language models and living enterprise knowledge. As data grows and domains evolve, RAG enables AI systems to stay accurate, scalable, and context-aware.

If you’re building production-grade GenAI systems, RAG is no longer optional—it’s foundational.

Artificial Intelligence by Leela Prasad

Thursday, 25 December 2025

Retrieval-Augmented Generation (RAG): A Practical Architecture Guide with Diagrams