Agentic RAG and Human-in-the-Loop with LangGraph

Jun 26 2025 AI langchain-langgraph

Traditional RAG is a one-shot process: retrieve documents, generate answer, done. Agentic RAG breaks this limitation - agents can evaluate retrieval quality, reformulate queries, and iterate until they find what they need. Combined with human-in-the-loop patterns, you build systems that are both autonomous and controllable.

The Retrieval Paradox

Retrieval-Augmented Generation promised to solve hallucination by grounding LLM responses in external documents. The idea was elegant: instead of relying on potentially outdated training data, fetch relevant documents at query time and use them as context.

But a fundamental tension emerged. RAG systems are only as good as their retrieval step. If the vector search returns irrelevant documents, the LLM either ignores them (hallucinating anyway) or incorporates misleading information (hallucinating with false confidence).

Traditional RAG treats retrieval as a single, infallible step. Query goes in, documents come out, generation happens. There’s no feedback loop, no quality check, no opportunity to try again. This retrieve-once assumption breaks down in practice because:

Queries are often ambiguous or poorly phrased
Embedding models don’t perfectly capture semantic similarity
The right answer might require information from multiple retrieval strategies
Sometimes no relevant documents exist, and the system should say so

Agentic RAG addresses this by adding agency to the retrieval process itself. The agent can examine what it retrieved, judge quality, reformulate queries, and iterate until it has what it needs - or explicitly acknowledge when it doesn’t.

The Limits of Static RAG

Standard RAG pipelines follow a fixed path:

graph LR
    A[Query] --> B[Retrieve]
    B --> C[Generate]
    C --> D[Answer]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff

    class A,B,C,D blueClass

Problems emerge when:

Retrieved documents don’t answer the question
The query is ambiguous or poorly phrased
Multiple retrieval attempts are needed
The agent needs to reason about which sources to use

Agentic RAG addresses these with retrieval loops, self-correction, and intelligent source selection.

Agentic RAG Patterns

The shift from static to agentic RAG isn’t just about adding retry logic - it’s about giving the system the capacity to reason about its own retrieval. Three patterns dominate this space:

Pattern	Description	When to Use
Self-Correcting	Evaluate retrieval quality, reformulate and retry if poor	Default for most applications
Multi-Source	Route to different stores based on query type	When you have specialized knowledge bases
Adaptive Retrieval	Dynamically adjust k, similarity threshold, or strategy	High-precision requirements

Self-Correcting Retrieval

The self-correction pattern uses the LLM as a judge of its own retrieval. This creates a feedback loop: retrieve → grade → reformulate → retrieve again. The key insight is that LLMs are surprisingly good at evaluating whether documents are relevant to a question, even when they can’t answer the question directly.

The core pattern involves four nodes: retrieve, grade, reformulate, and generate. The grading step is the key innovation - it uses the LLM to evaluate whether retrieved documents actually help answer the query:

def grade_relevance(state: RAGState) -> dict:
    """LLM-as-judge for retrieval quality."""
    prompt = f"""Rate relevance of these documents to the query (0.0-1.0):
    Query: {state['query']}
    Documents: {state['documents'][:2000]}"""

    score = float(llm.invoke(prompt).content)
    return {"relevance_score": score}

def should_retry(state: RAGState) -> str:
    """Route based on relevance score."""
    if state["relevance_score"] >= 0.7:
        return "generate"
    if state["attempts"] >= 3:
        return "generate"  # Give up after max attempts
    return "reformulate"

The routing logic is straightforward: if documents are relevant (score ≥ 0.7), proceed to generation. Otherwise, reformulate the query and try again - up to a maximum number of attempts to prevent infinite loops.

graph TD
    A[START] --> B[Retrieve]
    B --> C[Grade Relevance]
    C --> D{Score >= 0.7?}
    D -->|Yes| E[Generate]
    D -->|No| F{Max Attempts?}
    F -->|No| G[Reformulate Query]
    G --> B
    F -->|Yes| E
    E --> H[END]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class A,H greenClass
    class B,E,G blueClass
    class C,D,F orangeClass

Multi-Source RAG

Real knowledge bases aren’t monolithic. Documentation lives in one place, code examples in another, API references in a third. Multi-source RAG routes queries to the appropriate stores:

def classify_query(state: MultiSourceState) -> dict:
    """LLM classifies query to route to appropriate store."""
    prompt = f"""Classify: {state['query']}
    Options: docs, code, api, all"""
    return {"source": llm.invoke(prompt).content.strip()}

def route_to_source(state: MultiSourceState) -> list[str]:
    """Route to one or multiple retrieval stores."""
    return ["docs", "code", "api"] if state["source"] == "all" else [state["source"]]

This pattern particularly helps when different sources require different retrieval strategies - documentation might use semantic search while code examples might benefit from keyword matching.

Human-in-the-Loop Patterns

Autonomous agents are powerful, but autonomy comes with risk. An agent that can send emails can send wrong emails. One that can modify databases can corrupt data. One that can execute code can introduce vulnerabilities.

The question isn’t whether to add human oversight, but where and when. Too much oversight defeats the purpose of automation; too little creates liability. Human-in-the-loop patterns provide structured ways to insert human judgment at critical decision points while letting the agent handle routine operations independently.

The Spectrum of Autonomy

Different actions warrant different levels of oversight:

Risk Level	Examples	Pattern
Low	Search, read, analyze	Fully autonomous
Medium	Draft content, propose changes	Review optional
High	Send emails, modify records	Require approval
Critical	Delete data, financial transactions	Multi-person approval

LangGraph’s interrupt_before mechanism pauses execution at specified nodes, saves state, and waits for external input. This isn’t just a simple pause - the full execution context is preserved, allowing the human to inspect what led to this point and make an informed decision.

Basic Interrupt

The implementation requires a checkpointer (for state persistence) and the interrupt_before parameter specifying which nodes should pause for approval:

# Compile with interrupt before the execute node
checkpointer = MemorySaver()
agent = graph.compile(
    checkpointer=checkpointer,
    interrupt_before=["execute"]  # Pause before this node
)

# Run until interrupt point
config = {"configurable": {"thread_id": "approval-123"}}
result = agent.invoke(initial_state, config=config)

# Workflow pauses here - human reviews proposed action
print(f"Proposed: {result['action']} with {result['action_input']}")

# Resume with approval decision
final = agent.invoke({"approved": True}, config=config)

The thread_id in the config is crucial - it links the resume call to the paused execution. Without it, the system wouldn’t know which interrupted workflow to continue.

Conditional Human Review

Requiring approval for every action defeats the purpose of automation. Smart systems route based on risk level:

HIGH_RISK = ["delete", "send_email", "transfer_funds", "modify_database"]

def should_require_approval(state: ApprovalState) -> str:
    """Route high-risk to human, low-risk to auto-approve."""
    if state["action"] in HIGH_RISK:
        return "require_approval"
    return "auto_approve"

# Only interrupt before wait_approval node, not auto_approve
agent = graph.compile(
    checkpointer=checkpointer,
    interrupt_before=["wait_approval"]  # Only high-risk actions pause
)

This creates a fork in the workflow: low-risk actions flow through auto-approve and continue without interruption, while high-risk actions hit the wait_approval node and pause for human review.

Checkpointing for Persistence

Agents operate over time. A research agent might spend minutes gathering information before synthesizing a report. A customer service agent maintains context across a multi-message conversation. A workflow agent might pause for human approval for hours before resuming.

Without persistence, all this state lives in memory. If the process crashes, the server restarts, or the user closes their browser - everything is lost. Checkpointing solves this by serializing agent state at each step, enabling:

Pause/Resume: Stop execution and continue later, even on a different machine
Crash Recovery: Restart failed executions from the last successful state
Time Travel: Inspect or branch from any historical state
Multi-Turn Conversations: Maintain context across user sessions

The checkpoint contains everything needed to reconstruct the agent’s position: current state values, next node to execute, and the thread identifier linking this execution to its history.

Checkpointer Options

LangGraph provides two main checkpointer implementations:

Checkpointer	Use Case	Persistence
`MemorySaver`	Development, testing	In-memory only
`SqliteSaver`	Production, persistence	Survives restarts

# Development: in-memory (fast, no persistence)
from langgraph.checkpoint.memory import MemorySaver
agent = graph.compile(checkpointer=MemorySaver())

# Production: SQLite (or PostgreSQL, Redis)
from langgraph.checkpoint.sqlite import SqliteSaver
agent = graph.compile(checkpointer=SqliteSaver.from_conn_string("agent.db"))

Thread Isolation and Time Travel

Each thread_id maintains completely independent state. This enables concurrent users without state collision:

# Two users, two threads, no interference
agent.invoke(state, config={"configurable": {"thread_id": "user-001"}})
agent.invoke(state, config={"configurable": {"thread_id": "user-002"}})

# Time travel: inspect and branch from any historical state
history = list(agent.get_state_history(config))
earlier_state = history[2].config  # Go back to step 2
agent.invoke(new_input, config=earlier_state)  # Branch from there

Observability with LangSmith

Agents are notoriously difficult to debug. Unlike traditional software where you can trace execution through function calls and stack traces, agent behavior emerges from the interaction between prompts, model responses, and tool results. A bug might manifest as the agent choosing the wrong tool, misinterpreting a response, or getting stuck in a loop - none of which produce traditional error messages.

Observability means capturing enough information to understand what happened and why. For agents, this requires tracing every LLM call (with inputs and outputs), every tool invocation, every state transition, and every routing decision. LangSmith provides purpose-built infrastructure for this, but the principles apply regardless of tooling:

Trace hierarchies: See how high-level operations decompose into sub-steps
Input/output pairs: Inspect exactly what the model saw and produced
Latency breakdown: Identify which steps are slow
Token usage: Track costs per operation
Feedback collection: Gather human ratings for continuous improvement

Enabling LangSmith tracing requires just environment variables:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-agent"

# All invocations now traced automatically
result = agent.invoke(state, config={"tags": ["production", "v2.0"]})

For custom instrumentation without LangSmith, use callbacks:

from langchain_core.callbacks import BaseCallbackHandler

class AgentLogger(BaseCallbackHandler):
    def on_chain_start(self, serialized, inputs, **kwargs):
        print(f"Chain started: {serialized.get('name')}")

    def on_tool_end(self, output, **kwargs):
        print(f"Tool returned: {output[:100]}...")

# Attach callbacks to any invocation
agent.invoke(state, config={"callbacks": [AgentLogger()]})

Agent Evaluation

Testing agents requires fundamentally different approaches than traditional software testing. With conventional code, you can define exact expected outputs for given inputs. With agents, the “correct” output depends on model behavior, which is inherently non-deterministic and can change with model updates.

This doesn’t mean agents can’t be tested - it means we need layered testing strategies:

Level	Tests	Purpose
Unit	Individual nodes in isolation	Verify data transformations
Integration	Node interactions with mocked LLMs	Test routing and state flow
Component	Real LLM calls with controlled inputs	Validate prompt effectiveness
End-to-End	Full agent on realistic scenarios	Confirm overall behavior
Regression	Golden dataset with known-good outputs	Detect behavior drift

The key insight is that agent tests should often check for properties rather than exact outputs. Does the response mention the relevant topic? Is the retrieved document actually relevant? Did the agent use the expected tool? These property-based assertions remain valid even when the exact wording changes.

Testing Strategies

# Component test: verify individual node behavior
def test_relevance_grading():
    relevant_state = {"query": "Python GC", "documents": [gc_doc]}
    irrelevant_state = {"query": "Python GC", "documents": [weather_doc]}

    assert grade_relevance(relevant_state)["relevance_score"] >= 0.7
    assert grade_relevance(irrelevant_state)["relevance_score"] < 0.5

# End-to-end test: verify complete flow
def test_rag_agent_finds_answer():
    result = rag_agent.invoke({"query": "What is LangGraph?"})

    # Property-based assertion: check for relevant keywords, not exact text
    assert any(word in result["generation"].lower()
               for word in ["agent", "stateful", "graph"])
    assert result["attempts"] <= 3  # Respects max attempts

The key difference from traditional testing: we check for properties (contains relevant keywords, stays within bounds) rather than exact outputs (equals this specific string).

Key Takeaways

Agentic RAG iterates: Self-correcting retrieval with query reformulation outperforms single-shot approaches.
Grade retrieval quality: Use LLM-as-judge to evaluate document relevance and decide whether to retry.
Human-in-the-loop adds control: Use interrupt_before to pause for approval on high-risk actions.
Checkpointing enables persistence: Save state for resume, debugging, and multi-turn conversations.
Observability is essential: LangSmith tracing and custom callbacks help diagnose production issues.
Test agents differently: Component tests for individual nodes, end-to-end tests for complete flows.

Next: Multi-Agent Architecture with LangGraph - We’ll explore orchestrator patterns, agent communication, and coordinating specialized agents.

#llm #python #rag #langchain #langgraph #agents

Agentic RAG and Human-in-the-Loop with LangGraph

The Retrieval Paradox

The Limits of Static RAG

Agentic RAG Patterns

Self-Correcting Retrieval

Multi-Source RAG

Human-in-the-Loop Patterns

The Spectrum of Autonomy

Basic Interrupt

Conditional Human Review

Checkpointing for Persistence

Checkpointer Options

Thread Isolation and Time Travel

Observability with LangSmith

Agent Evaluation

Testing Strategies

Key Takeaways

Comments

Your browser is out-of-date!

Agentic RAG and Human-in-the-Loop with LangGraph

The Retrieval Paradox

The Limits of Static RAG

Agentic RAG Patterns

Self-Correcting Retrieval

Multi-Source RAG

Human-in-the-Loop Patterns

The Spectrum of Autonomy

Basic Interrupt

Conditional Human Review

Checkpointing for Persistence

Checkpointer Options

Thread Isolation and Time Travel

Observability with LangSmith

Agent Evaluation

Testing Strategies

Key Takeaways

Related Posts

Comments

Your browser is out-of-date!