Traditional RAG is a one-shot process: retrieve documents, generate answer, done. Agentic RAG breaks this limitation—agents can evaluate retrieval quality, reformulate queries, and iterate until they find what they need. Combined with human-in-the-loop patterns, you build systems that are both autonomous and controllable.
The Retrieval Paradox
Retrieval-Augmented Generation promised to solve hallucination by grounding LLM responses in external documents. The idea was elegant: instead of relying on potentially outdated training data, fetch relevant documents at query time and use them as context.
But a fundamental tension emerged. RAG systems are only as good as their retrieval step. If the vector search returns irrelevant documents, the LLM either ignores them (hallucinating anyway) or incorporates misleading information (hallucinating with false confidence).
Traditional RAG treats retrieval as a single, infallible step. Query goes in, documents come out, generation happens. There’s no feedback loop, no quality check, no opportunity to try again. This retrieve-once assumption breaks down in practice because:
- Queries are often ambiguous or poorly phrased
- Embedding models don’t perfectly capture semantic similarity
- The right answer might require information from multiple retrieval strategies
- Sometimes no relevant documents exist, and the system should say so
Agentic RAG addresses this by adding agency to the retrieval process itself. The agent can examine what it retrieved, judge quality, reformulate queries, and iterate until it has what it needs—or explicitly acknowledge when it doesn’t.
The Limits of Static RAG
Standard RAG pipelines follow a fixed path:
graph LR
A[Query] --> B[Retrieve]
B --> C[Generate]
C --> D[Answer]
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
class A,B,C,D blueClass
Problems emerge when:
- Retrieved documents don’t answer the question
- The query is ambiguous or poorly phrased
- Multiple retrieval attempts are needed
- The agent needs to reason about which sources to use
Agentic RAG addresses these with retrieval loops, self-correction, and intelligent source selection.
Agentic RAG Patterns
The shift from static to agentic RAG isn’t just about adding retry logic—it’s about giving the system the capacity to reason about its own retrieval. Three patterns dominate this space:
| Pattern | Description | When to Use |
|---|---|---|
| Self-Correcting | Evaluate retrieval quality, reformulate and retry if poor | Default for most applications |
| Multi-Source | Route to different stores based on query type | When you have specialized knowledge bases |
| Adaptive Retrieval | Dynamically adjust k, similarity threshold, or strategy | High-precision requirements |
Self-Correcting Retrieval
The self-correction pattern uses the LLM as a judge of its own retrieval. This creates a feedback loop: retrieve → grade → reformulate → retrieve again. The key insight is that LLMs are surprisingly good at evaluating whether documents are relevant to a question, even when they can’t answer the question directly.
The core pattern involves four nodes: retrieve, grade, reformulate, and generate. The grading step is the key innovation—it uses the LLM to evaluate whether retrieved documents actually help answer the query:
1 | def grade_relevance(state: RAGState) -> dict: |
The routing logic is straightforward: if documents are relevant (score ≥ 0.7), proceed to generation. Otherwise, reformulate the query and try again—up to a maximum number of attempts to prevent infinite loops.
graph TD
A[START] --> B[Retrieve]
B --> C[Grade Relevance]
C --> D{Score >= 0.7?}
D -->|Yes| E[Generate]
D -->|No| F{Max Attempts?}
F -->|No| G[Reformulate Query]
G --> B
F -->|Yes| E
E --> H[END]
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class A,H greenClass
class B,E,G blueClass
class C,D,F orangeClass
Multi-Source RAG
Real knowledge bases aren’t monolithic. Documentation lives in one place, code examples in another, API references in a third. Multi-source RAG routes queries to the appropriate stores:
1 | def classify_query(state: MultiSourceState) -> dict: |
This pattern particularly helps when different sources require different retrieval strategies—documentation might use semantic search while code examples might benefit from keyword matching.
Human-in-the-Loop Patterns
Autonomous agents are powerful, but autonomy comes with risk. An agent that can send emails can send wrong emails. One that can modify databases can corrupt data. One that can execute code can introduce vulnerabilities.
The question isn’t whether to add human oversight, but where and when. Too much oversight defeats the purpose of automation; too little creates liability. Human-in-the-loop patterns provide structured ways to insert human judgment at critical decision points while letting the agent handle routine operations independently.
The Spectrum of Autonomy
Different actions warrant different levels of oversight:
| Risk Level | Examples | Pattern |
|---|---|---|
| Low | Search, read, analyze | Fully autonomous |
| Medium | Draft content, propose changes | Review optional |
| High | Send emails, modify records | Require approval |
| Critical | Delete data, financial transactions | Multi-person approval |
LangGraph’s interrupt_before mechanism pauses execution at specified nodes, saves state, and waits for external input. This isn’t just a simple pause—the full execution context is preserved, allowing the human to inspect what led to this point and make an informed decision.
Basic Interrupt
The implementation requires a checkpointer (for state persistence) and the interrupt_before parameter specifying which nodes should pause for approval:
1 | # Compile with interrupt before the execute node |
The thread_id in the config is crucial—it links the resume call to the paused execution. Without it, the system wouldn’t know which interrupted workflow to continue.
Conditional Human Review
Requiring approval for every action defeats the purpose of automation. Smart systems route based on risk level:
1 | HIGH_RISK = ["delete", "send_email", "transfer_funds", "modify_database"] |
This creates a fork in the workflow: low-risk actions flow through auto-approve and continue without interruption, while high-risk actions hit the wait_approval node and pause for human review.
Checkpointing for Persistence
Agents operate over time. A research agent might spend minutes gathering information before synthesizing a report. A customer service agent maintains context across a multi-message conversation. A workflow agent might pause for human approval for hours before resuming.
Without persistence, all this state lives in memory. If the process crashes, the server restarts, or the user closes their browser—everything is lost. Checkpointing solves this by serializing agent state at each step, enabling:
- Pause/Resume: Stop execution and continue later, even on a different machine
- Crash Recovery: Restart failed executions from the last successful state
- Time Travel: Inspect or branch from any historical state
- Multi-Turn Conversations: Maintain context across user sessions
The checkpoint contains everything needed to reconstruct the agent’s position: current state values, next node to execute, and the thread identifier linking this execution to its history.
Checkpointer Options
LangGraph provides two main checkpointer implementations:
| Checkpointer | Use Case | Persistence |
|---|---|---|
MemorySaver |
Development, testing | In-memory only |
SqliteSaver |
Production, persistence | Survives restarts |
1 | # Development: in-memory (fast, no persistence) |
Thread Isolation and Time Travel
Each thread_id maintains completely independent state. This enables concurrent users without state collision:
1 | # Two users, two threads, no interference |
Observability with LangSmith
Agents are notoriously difficult to debug. Unlike traditional software where you can trace execution through function calls and stack traces, agent behavior emerges from the interaction between prompts, model responses, and tool results. A bug might manifest as the agent choosing the wrong tool, misinterpreting a response, or getting stuck in a loop—none of which produce traditional error messages.
Observability means capturing enough information to understand what happened and why. For agents, this requires tracing every LLM call (with inputs and outputs), every tool invocation, every state transition, and every routing decision. LangSmith provides purpose-built infrastructure for this, but the principles apply regardless of tooling:
- Trace hierarchies: See how high-level operations decompose into sub-steps
- Input/output pairs: Inspect exactly what the model saw and produced
- Latency breakdown: Identify which steps are slow
- Token usage: Track costs per operation
- Feedback collection: Gather human ratings for continuous improvement
Enabling LangSmith tracing requires just environment variables:
1 | import os |
For custom instrumentation without LangSmith, use callbacks:
1 | from langchain_core.callbacks import BaseCallbackHandler |
Agent Evaluation
Testing agents requires fundamentally different approaches than traditional software testing. With conventional code, you can define exact expected outputs for given inputs. With agents, the “correct” output depends on model behavior, which is inherently non-deterministic and can change with model updates.
This doesn’t mean agents can’t be tested—it means we need layered testing strategies:
| Level | Tests | Purpose |
|---|---|---|
| Unit | Individual nodes in isolation | Verify data transformations |
| Integration | Node interactions with mocked LLMs | Test routing and state flow |
| Component | Real LLM calls with controlled inputs | Validate prompt effectiveness |
| End-to-End | Full agent on realistic scenarios | Confirm overall behavior |
| Regression | Golden dataset with known-good outputs | Detect behavior drift |
The key insight is that agent tests should often check for properties rather than exact outputs. Does the response mention the relevant topic? Is the retrieved document actually relevant? Did the agent use the expected tool? These property-based assertions remain valid even when the exact wording changes.
Testing Strategies
1 | # Component test: verify individual node behavior |
The key difference from traditional testing: we check for properties (contains relevant keywords, stays within bounds) rather than exact outputs (equals this specific string).
Key Takeaways
Agentic RAG iterates: Self-correcting retrieval with query reformulation outperforms single-shot approaches.
Grade retrieval quality: Use LLM-as-judge to evaluate document relevance and decide whether to retry.
Human-in-the-loop adds control: Use
interrupt_beforeto pause for approval on high-risk actions.Checkpointing enables persistence: Save state for resume, debugging, and multi-turn conversations.
Observability is essential: LangSmith tracing and custom callbacks help diagnose production issues.
Test agents differently: Component tests for individual nodes, end-to-end tests for complete flows.
Next: Multi-Agent Architecture with LangGraph - We’ll explore orchestrator patterns, agent communication, and coordinating specialized agents.
Comments