Artificial Intelligence by Leela Prasad

Sunday, 7 June 2026

Significance of Fallback Mechanism

The Core Concept: Reliability in Production While many focus heavily on choosing the most powerful models when building AI agents, the most critical requirement in enterprise production environments is reliability. AI systems must continue functioning seamlessly even when a model fails, becomes slow, or returns uncertain results.

The Risk of Single-Model Dependency

Early AI systems often relied on a single model for every request.
This is a massive operational risk in a production system. If the model provider (like OpenAI or Anthropic) experiences an API outage or latency spikes, the entire application stops functioning.

What is Fallback Architecture? A fallback architecture introduces backup pathways into the system. If the primary model fails or produces a low-quality, unreliable response, the system automatically redirects the request to an alternative model or workflow. This ensures service continuity, fault tolerance, and high uptime.

A Simple Fallback Pipeline

The user request is sent to the primary model.
The system evaluates the response to check if it is valid, accurate, and returned within an acceptable latency window.
If it fails these checks, the system automatically triggers a fallback model to process the request and deliver the final response to the user.

Common Fallback Strategies There are several ways to implement fallback architectures in AI systems:

Provider Fallback: Switching to a different vendor completely (e.g., if OpenAI's GPT fails, the system immediately switches to Anthropic's Claude). This prevents single-vendor dependency.
Model Size Fallback: Switching between large and small models (e.g., routing a failed query from a small model to a massive reasoning model to fix the error).
Validation Fallback: Using a secondary model specifically to verify the output of the primary model.
Tool Fallback: Automatically switching away from the AI model entirely and using deterministic APIs or hardcoded tools to complete the task.
Human-in-the-Loop (HITL) Escalation: In high-risk environments (like healthcare, legal, or finance), if the automated AI response is insufficient or dangerous, the task is escalated to a human expert for final judgment and accountability.

The Golden Rule of System Design I conclude with a fundamental principle for AI architects: Never depend on a single model for critical systems. Always design with built-in redundancy, validation mechanisms, and fallback pathways.

SLM: Small Language Models

What are SLMs?

Lightweight Efficiency: Small Language Models (SLMs) are optimized for efficiency rather than massive parameter sizes. They require less memory, consume less processing power, and produce responses much faster than Large Language Models (LLMs).
The LLM vs. SLM Divide: While LLMs are built for complex reasoning, multi-step problem solving, and planning, they are computationally heavy and expensive. SLMs are built for speed and handling simpler, routine tasks that do not require deep logical reasoning.

Ideal Tasks for SLMs If an enterprise used a massive LLM for every single action, the system would become incredibly slow and expensive. Instead, SLMs are perfectly suited for:

Routing requests to the correct agent.
Classifying inputs and determining user intent.
Extracting structured information from text.
Categorizing documents and summarizing short messages.

Enterprise Architecture & "Model Cascading"

Enterprise Scalability: Organizations processing thousands or millions of requests daily rely on SLMs to handle routine operations cheaply, reducing the workload and latency on their expensive large models.
Model Cascading Strategy: Modern agentic systems use a layered architectural pattern called "Model Cascading." When a user sends a request, it hits an SLM first. If the task is simple, the SLM handles it immediately. Only if the task requires deep reasoning does the SLM escalate the request to the larger, more expensive LLM.

Popular SLM Options Several highly capable small models are commonly used for routing and classification in modern pipelines, including:

Phi (Microsoft)
Gemma (Google)
Lightweight variants of Llama (Meta) and Mistral

The Golden Rule of AI Architecture The most important takeaway for designing AI systems is to always use the smallest model capable of solving the task. Large models should be strictly reserved for complex decision-making, while SLMs handle the high-speed, high-volume routing.

Saturday, 6 June 2026

Architecture - Choose right model for right task

Choose the Right Model When selecting an LLM for an agentic system, architects evaluate 5 key factors:

Reasoning Capability: Does the task need deep logic, or just simple classification?
Performance/Latency: Does the application require extremely fast responses?
Cost Considerations: Is the model affordable to run at scale (inference and token costs)?
Data Privacy: Does the data need to remain on-premise without touching external clouds?
Enterprise Integration: How well does the model fit into the existing workflow?

There is No Single Best Model A "one-size-fits-all" model does not exist. Instead of seeking a universal best model, modern AI systems use a Combined Pipeline (Model Stack). For instance, smaller models handle cheap, fast tasks like routing and classification, while large, expensive models are reserved strictly for deep reasoning and planning.

Based on the architectural principles outlined above, the key is to balance cost, speed, and intelligence.

For the below example architectural flow, lets identify the Model for each of the AI Agents.

1. Smaller, Fast Models (e.g., Llama 3 8B, Claude Haiku, GPT-4o-mini)

These models are used for "cheap, fast tasks like routing and classification." In the architecture, they handle structural, repetitive, or low-complexity tasks where speed is critical and the risk of hallucination is managed by strict constraints.

The Orchestrator / Router: Before any deep thinking happens, the stateful LangGraph orchestrator needs to classify the user's intent (e.g., "Is this a risk query or a daily status update?"). A small, fast model can instantly classify the intent and route the request to the correct agent.
Integration Analysts (Jira & ServiceNow Agents): These agents have a single, dedicated job: extract structured signals from external tickets. Because this task is purely data extraction and mapping (often into JSON schemas), a small model is perfectly capable and significantly cheaper.
The Communication Agent: This agent's job is to take raw technical logs and dependencies and translate them into a clear, easy-to-read summary for stakeholders. Summarization is a standard capability that small models handle exceptionally well at low latency.
Governance & Evaluator Agents (Initial Pass): Small models can act as fast classifiers to check if an output violates a corporate policy, contains Personally Identifiable Information (PII), or fails a basic compliance rule before passing it along.

2. Large, Expensive Models (e.g., GPT-4, Claude 3.5 Sonnet)

These models are reserved strictly for "deep reasoning and planning." They are invoked only when the system needs to process ambiguity, synthesize complex contexts, or make high-stakes decisions.

The Planner Agent: When a Program Manager asks a vague question like, "Look into the next sprint and tell me the risks," the Planner Agent must break that ambiguous request down into a logical sequence of actionable steps without executing anything. This requires high-level cognitive planning, making it the perfect job for a large model.
Risk & Dependency Agents: These agents run parallel analyses to figure out "what could go wrong," detect hidden upstream/downstream blockers, and evaluate cross-team impacts. Identifying hidden correlations across hundreds of Jira tickets requires the deep, sophisticated reasoning capabilities of a massive frontier model.
The Action Agent: This agent safely executes tasks, but only after synthesizing all the risk analyses and dependency checks. Because its decisions have actual consequences, it requires the high accuracy and reasoning power of a large model.

The Architectural Takeaway

By designing the system this way, the architecture protects operational costs. The system uses a cheap $0.10/million token model to pull the Jira tickets, format the data, and summarize the final output, and only pays the premium $10.00/million token cost for the specific nodes (Risk, Dependency, Planner) that actually need to "think" about the data.

Modern LLMs comparision

The Rapid Growth of LLMs The field of Large Language Models (LLMs) has evolved dramatically, with multiple organizations developing powerful models. This has created a rich ecosystem with diverse capabilities, meaning AI architects now have plenty of options but must carefully evaluate which model fits their specific needs.

The 4 Major LLM Families each with unique strengths and design philosophies:

GPT (OpenAI):
- Strengths: Exceptional reasoning capabilities, high-quality natural language generation, and excellent support for tool usage and workflows.
- Use Cases: Complex problem-solving, planning, and agent-based enterprise systems.
- Trade-off: Can be expensive to operate at a large scale.
Claude (Anthropic):
- Strengths: Designed with a heavy focus on safety, alignment, and responsible AI. It excels at processing extremely long contexts.
- Use Cases: Document-heavy enterprise workflows, knowledge assistants, and analyzing massive policy materials.
Llama (Meta):
- Strengths: It is an open-source (open-weight) model, providing maximum flexibility and control.
- Use Cases: On-premise deployment, local infrastructure running, and custom fine-tuning to keep sensitive data strictly private.
- Eg: Llama 3.1 8B - Best for chatbots to serve thousands of concurrent users cost-efficiently.
  Llama 3.3 70B - If response quality is critical — e.g., internal knowledge assistants, HR/legal Q&A
Qwen (Alibaba):
- Strengths: Highly optimized for strong multilingual capabilities.
- Use Cases: Global applications, international AI ecosystems, and conversational agents that must operate across diverse languages.

Choose the Right Model When selecting an LLM for an agentic system, architects evaluate 5 key factors:

Reasoning Capability: Does the task need deep logic, or just simple classification?
Performance/Latency: Does the application require extremely fast responses?
Cost Considerations: Is the model affordable to run at scale (inference and token costs)?
Data Privacy: Does the data need to remain on-premise without touching external clouds?
Enterprise Integration: How well does the model fit into the existing workflow?

Key Takeaway: There is No Single Best Model A "one-size-fits-all" model does not exist. Instead of seeking a universal best model, modern AI systems use a Combined Pipeline (Model Stack). For instance, smaller models handle cheap, fast tasks like routing and classification, while large, expensive models are reserved strictly for deep reasoning and planning.

Sunday, 31 May 2026

Guardrail Placement: A Multi-Layered Approach

Most people feel, placing guardrails exclusively at the output stage. This wastes API compute money on bad queries, exposes the system to dangerous reasoning states, and adds noticeable latency

I Propose a Multi-Layered Framework:

Layer 1 (Input/Pre-processing): Implement cheap, fast checks (regex or light classifiers) for prompt injections, toxicity, and PII before spending money on the core agent execution

Layer 2 (Tool Level): Verify parameters before a tool is called, restricting dangerous commands (like blocking "DROP" SQL commands or capping row returns).

Layer 3 (Reasoning/Execution): Monitor the agent while it runs to prevent infinite loops, excessive tool calls, or runaway costs

Layer 4 (Output): Serve as the final defense to ensure no PII or system prompts leaked, accepting higher latency only as a final check.

Gliner API's can be used for Guard Rails, it detects Personally Identifiable Information in text. RegEx can also be used in the Inputlayer

Cost Optimization for AI Applications

Cost Optimization: Analyzing Before Optimizing

First step, Dig into Logs and Analyze the cost breakdown. Upon Analyzing logs, new insights can be found.

Eg: After cost breakdown Analysis if below is the finding. $8000/month

Model cost: $2,400 - 30%

Embeddings: $2000 - 25%

Vector Databases: $1200 - 15%

Wasted Compute: $2400 - 30%

Cost Breakdown Analysis: You must identify where the money is going by analyzing:

Embedding Costs: Are you embedding redundant metadata (headers, footers) unnecessarily? For some applications, Metadata is key and this cannot be applicable.
Vector Database Costs: Are you over-indexing or storing unnecessary historical document versions? Eg: If 10 versions of a document are stored in Vector DB and when only the latest version is considered during the retrieval. It's unnecessary to store older versions and it can be replaced with latest document.
LLM Inference Costs: Analyze token distribution (input context length vs. output response length). If the response is longer than necessary, this can be taken care by adjusting the prompt as necessary.

Architectural Optimizations: You can cut 40-70% of costs before even touching the model choice by implementing :

Semantic Caching: Deduplicating similar queries (e.g., password resets) rather than repeatedly processing them. Semantic Caching is a strategy used to optimize performance and reduce costs in LLM-powered applications by storing and retrieving meanings (vectors) rather than exact text strings. Semantic caching leverages your existing Vector Database (or a specialized cache layer) to perform "similarity lookups" before calling the LLM.
Input Filtering: Using simple rules for basic greetings instead of full LLM reasoning.
Context Window Reduction: Limiting retrieved documents to only the top 3 most relevant results instead of 10.
Batching: Processing document embeddings in large batches rather than one at a time. Most embedding APIs support Batching. Eg: Embed 100 chunks as a Batch per API call.

Tiered Model Routing: Only after architecture is optimized should you route simple tasks to cheaper models (like GPT-3.5) and reserve expensive models (GPT-4) strictly for complex reasoning.

Saturday, 16 May 2026

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the LLM dynamically decides when to call a Wikipedia tool to answer factual questions.

The agent is wired as a stateful graph — looping between reasoning and tool use until it has a confident answer. To trace every LLM call, tool invocation, and token usage in real time, LangSmith is integrated by passing a LangChainTracer directly via config={"callbacks": [tracer]} — a more reliable approach than relying on environment variables in Jupyter notebooks.

Check out the full code and setup on GitHub.

Source: https://www.youtube.com/watch?v=3h4Wc19LhL8