Agentic RAG and Human-in-the-Loop with LangGraph

Traditional RAG is a one-shot process: retrieve documents, generate answer, done. Agentic RAG breaks this limitation—agents can evaluate retrieval quality, reformulate queries, and iterate until they find what they need. Combined with human-in-the-loop patterns, you build systems that are both autonomous and controllable.

The Retrieval Paradox

Retrieval-Augmented Generation promised to solve hallucination by grounding LLM responses in external documents. The idea was elegant: instead of relying on potentially outdated training data, fetch relevant documents at query time and use them as context.

But a fundamental tension emerged. RAG systems are only as good as their retrieval step. If the vector search returns irrelevant documents, the LLM either ignores them (hallucinating anyway) or incorporates misleading information (hallucinating with false confidence).

Traditional RAG treats retrieval as a single, infallible step. Query goes in, documents come out, generation happens. There’s no feedback loop, no quality check, no opportunity to try again. This retrieve-once assumption breaks down in practice because:

  • Queries are often ambiguous or poorly phrased
  • Embedding models don’t perfectly capture semantic similarity
  • The right answer might require information from multiple retrieval strategies
  • Sometimes no relevant documents exist, and the system should say so

Agentic RAG addresses this by adding agency to the retrieval process itself. The agent can examine what it retrieved, judge quality, reformulate queries, and iterate until it has what it needs—or explicitly acknowledge when it doesn’t.

The Limits of Static RAG

Standard RAG pipelines follow a fixed path:

graph LR
    A[Query] --> B[Retrieve]
    B --> C[Generate]
    C --> D[Answer]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff

    class A,B,C,D blueClass

Problems emerge when:

  • Retrieved documents don’t answer the question
  • The query is ambiguous or poorly phrased
  • Multiple retrieval attempts are needed
  • The agent needs to reason about which sources to use

Agentic RAG addresses these with retrieval loops, self-correction, and intelligent source selection.

Agentic RAG Patterns

The shift from static to agentic RAG isn’t just about adding retry logic—it’s about giving the system the capacity to reason about its own retrieval. Three patterns dominate this space:

Pattern Description When to Use
Self-Correcting Evaluate retrieval quality, reformulate and retry if poor Default for most applications
Multi-Source Route to different stores based on query type When you have specialized knowledge bases
Adaptive Retrieval Dynamically adjust k, similarity threshold, or strategy High-precision requirements

Self-Correcting Retrieval

The self-correction pattern uses the LLM as a judge of its own retrieval. This creates a feedback loop: retrieve → grade → reformulate → retrieve again. The key insight is that LLMs are surprisingly good at evaluating whether documents are relevant to a question, even when they can’t answer the question directly.

The core pattern involves four nodes: retrieve, grade, reformulate, and generate. The grading step is the key innovation—it uses the LLM to evaluate whether retrieved documents actually help answer the query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def grade_relevance(state: RAGState) -> dict:
"""LLM-as-judge for retrieval quality."""
prompt = f"""Rate relevance of these documents to the query (0.0-1.0):
Query: {state['query']}
Documents: {state['documents'][:2000]}"""

score = float(llm.invoke(prompt).content)
return {"relevance_score": score}

def should_retry(state: RAGState) -> str:
"""Route based on relevance score."""
if state["relevance_score"] >= 0.7:
return "generate"
if state["attempts"] >= 3:
return "generate" # Give up after max attempts
return "reformulate"

The routing logic is straightforward: if documents are relevant (score ≥ 0.7), proceed to generation. Otherwise, reformulate the query and try again—up to a maximum number of attempts to prevent infinite loops.

graph TD
    A[START] --> B[Retrieve]
    B --> C[Grade Relevance]
    C --> D{Score >= 0.7?}
    D -->|Yes| E[Generate]
    D -->|No| F{Max Attempts?}
    F -->|No| G[Reformulate Query]
    G --> B
    F -->|Yes| E
    E --> H[END]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class A,H greenClass
    class B,E,G blueClass
    class C,D,F orangeClass

Multi-Source RAG

Real knowledge bases aren’t monolithic. Documentation lives in one place, code examples in another, API references in a third. Multi-source RAG routes queries to the appropriate stores:

1
2
3
4
5
6
7
8
9
def classify_query(state: MultiSourceState) -> dict:
"""LLM classifies query to route to appropriate store."""
prompt = f"""Classify: {state['query']}
Options: docs, code, api, all"""
return {"source": llm.invoke(prompt).content.strip()}

def route_to_source(state: MultiSourceState) -> list[str]:
"""Route to one or multiple retrieval stores."""
return ["docs", "code", "api"] if state["source"] == "all" else [state["source"]]

This pattern particularly helps when different sources require different retrieval strategies—documentation might use semantic search while code examples might benefit from keyword matching.

Human-in-the-Loop Patterns

Autonomous agents are powerful, but autonomy comes with risk. An agent that can send emails can send wrong emails. One that can modify databases can corrupt data. One that can execute code can introduce vulnerabilities.

The question isn’t whether to add human oversight, but where and when. Too much oversight defeats the purpose of automation; too little creates liability. Human-in-the-loop patterns provide structured ways to insert human judgment at critical decision points while letting the agent handle routine operations independently.

The Spectrum of Autonomy

Different actions warrant different levels of oversight:

Risk Level Examples Pattern
Low Search, read, analyze Fully autonomous
Medium Draft content, propose changes Review optional
High Send emails, modify records Require approval
Critical Delete data, financial transactions Multi-person approval

LangGraph’s interrupt_before mechanism pauses execution at specified nodes, saves state, and waits for external input. This isn’t just a simple pause—the full execution context is preserved, allowing the human to inspect what led to this point and make an informed decision.

Basic Interrupt

The implementation requires a checkpointer (for state persistence) and the interrupt_before parameter specifying which nodes should pause for approval:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Compile with interrupt before the execute node
checkpointer = MemorySaver()
agent = graph.compile(
checkpointer=checkpointer,
interrupt_before=["execute"] # Pause before this node
)

# Run until interrupt point
config = {"configurable": {"thread_id": "approval-123"}}
result = agent.invoke(initial_state, config=config)

# Workflow pauses here—human reviews proposed action
print(f"Proposed: {result['action']} with {result['action_input']}")

# Resume with approval decision
final = agent.invoke({"approved": True}, config=config)

The thread_id in the config is crucial—it links the resume call to the paused execution. Without it, the system wouldn’t know which interrupted workflow to continue.

Conditional Human Review

Requiring approval for every action defeats the purpose of automation. Smart systems route based on risk level:

1
2
3
4
5
6
7
8
9
10
11
12
13
HIGH_RISK = ["delete", "send_email", "transfer_funds", "modify_database"]

def should_require_approval(state: ApprovalState) -> str:
"""Route high-risk to human, low-risk to auto-approve."""
if state["action"] in HIGH_RISK:
return "require_approval"
return "auto_approve"

# Only interrupt before wait_approval node, not auto_approve
agent = graph.compile(
checkpointer=checkpointer,
interrupt_before=["wait_approval"] # Only high-risk actions pause
)

This creates a fork in the workflow: low-risk actions flow through auto-approve and continue without interruption, while high-risk actions hit the wait_approval node and pause for human review.

Checkpointing for Persistence

Agents operate over time. A research agent might spend minutes gathering information before synthesizing a report. A customer service agent maintains context across a multi-message conversation. A workflow agent might pause for human approval for hours before resuming.

Without persistence, all this state lives in memory. If the process crashes, the server restarts, or the user closes their browser—everything is lost. Checkpointing solves this by serializing agent state at each step, enabling:

  1. Pause/Resume: Stop execution and continue later, even on a different machine
  2. Crash Recovery: Restart failed executions from the last successful state
  3. Time Travel: Inspect or branch from any historical state
  4. Multi-Turn Conversations: Maintain context across user sessions

The checkpoint contains everything needed to reconstruct the agent’s position: current state values, next node to execute, and the thread identifier linking this execution to its history.

Checkpointer Options

LangGraph provides two main checkpointer implementations:

Checkpointer Use Case Persistence
MemorySaver Development, testing In-memory only
SqliteSaver Production, persistence Survives restarts
1
2
3
4
5
6
7
# Development: in-memory (fast, no persistence)
from langgraph.checkpoint.memory import MemorySaver
agent = graph.compile(checkpointer=MemorySaver())

# Production: SQLite (or PostgreSQL, Redis)
from langgraph.checkpoint.sqlite import SqliteSaver
agent = graph.compile(checkpointer=SqliteSaver.from_conn_string("agent.db"))

Thread Isolation and Time Travel

Each thread_id maintains completely independent state. This enables concurrent users without state collision:

1
2
3
4
5
6
7
8
# Two users, two threads, no interference
agent.invoke(state, config={"configurable": {"thread_id": "user-001"}})
agent.invoke(state, config={"configurable": {"thread_id": "user-002"}})

# Time travel: inspect and branch from any historical state
history = list(agent.get_state_history(config))
earlier_state = history[2].config # Go back to step 2
agent.invoke(new_input, config=earlier_state) # Branch from there

Observability with LangSmith

Agents are notoriously difficult to debug. Unlike traditional software where you can trace execution through function calls and stack traces, agent behavior emerges from the interaction between prompts, model responses, and tool results. A bug might manifest as the agent choosing the wrong tool, misinterpreting a response, or getting stuck in a loop—none of which produce traditional error messages.

Observability means capturing enough information to understand what happened and why. For agents, this requires tracing every LLM call (with inputs and outputs), every tool invocation, every state transition, and every routing decision. LangSmith provides purpose-built infrastructure for this, but the principles apply regardless of tooling:

  • Trace hierarchies: See how high-level operations decompose into sub-steps
  • Input/output pairs: Inspect exactly what the model saw and produced
  • Latency breakdown: Identify which steps are slow
  • Token usage: Track costs per operation
  • Feedback collection: Gather human ratings for continuous improvement

Enabling LangSmith tracing requires just environment variables:

1
2
3
4
5
6
7
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-agent"

# All invocations now traced automatically
result = agent.invoke(state, config={"tags": ["production", "v2.0"]})

For custom instrumentation without LangSmith, use callbacks:

1
2
3
4
5
6
7
8
9
10
11
from langchain_core.callbacks import BaseCallbackHandler

class AgentLogger(BaseCallbackHandler):
def on_chain_start(self, serialized, inputs, **kwargs):
print(f"Chain started: {serialized.get('name')}")

def on_tool_end(self, output, **kwargs):
print(f"Tool returned: {output[:100]}...")

# Attach callbacks to any invocation
agent.invoke(state, config={"callbacks": [AgentLogger()]})

Agent Evaluation

Testing agents requires fundamentally different approaches than traditional software testing. With conventional code, you can define exact expected outputs for given inputs. With agents, the “correct” output depends on model behavior, which is inherently non-deterministic and can change with model updates.

This doesn’t mean agents can’t be tested—it means we need layered testing strategies:

Level Tests Purpose
Unit Individual nodes in isolation Verify data transformations
Integration Node interactions with mocked LLMs Test routing and state flow
Component Real LLM calls with controlled inputs Validate prompt effectiveness
End-to-End Full agent on realistic scenarios Confirm overall behavior
Regression Golden dataset with known-good outputs Detect behavior drift

The key insight is that agent tests should often check for properties rather than exact outputs. Does the response mention the relevant topic? Is the retrieved document actually relevant? Did the agent use the expected tool? These property-based assertions remain valid even when the exact wording changes.

Testing Strategies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Component test: verify individual node behavior
def test_relevance_grading():
relevant_state = {"query": "Python GC", "documents": [gc_doc]}
irrelevant_state = {"query": "Python GC", "documents": [weather_doc]}

assert grade_relevance(relevant_state)["relevance_score"] >= 0.7
assert grade_relevance(irrelevant_state)["relevance_score"] < 0.5

# End-to-end test: verify complete flow
def test_rag_agent_finds_answer():
result = rag_agent.invoke({"query": "What is LangGraph?"})

# Property-based assertion: check for relevant keywords, not exact text
assert any(word in result["generation"].lower()
for word in ["agent", "stateful", "graph"])
assert result["attempts"] <= 3 # Respects max attempts

The key difference from traditional testing: we check for properties (contains relevant keywords, stays within bounds) rather than exact outputs (equals this specific string).

Key Takeaways

  1. Agentic RAG iterates: Self-correcting retrieval with query reformulation outperforms single-shot approaches.

  2. Grade retrieval quality: Use LLM-as-judge to evaluate document relevance and decide whether to retry.

  3. Human-in-the-loop adds control: Use interrupt_before to pause for approval on high-risk actions.

  4. Checkpointing enables persistence: Save state for resume, debugging, and multi-turn conversations.

  5. Observability is essential: LangSmith tracing and custom callbacks help diagnose production issues.

  6. Test agents differently: Component tests for individual nodes, end-to-end tests for complete flows.


Next: Multi-Agent Architecture with LangGraph - We’ll explore orchestrator patterns, agent communication, and coordinating specialized agents.

Connecting LangGraph Agents to APIs and Databases Multi-Agent Architecture with LangGraph

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×