Traditional RAG (Retrieval-Augmented Generation) follows a fixed pattern: query in, documents out, response generated. But what if the agent could decide when and how to retrieve? Agentic RAG gives agents control over their own knowledge acquisition. In this post, I’ll explore this dynamic approach to retrieval, then tackle the equally important question: how do we know if our agents actually work?
The Limits of Traditional RAG
Standard RAG follows a rigid pipeline:
flowchart LR
Q[Query] --> E[Embed]
E --> S[Search]
S --> D[Documents]
D --> G[Generate]
G --> R[Response]
style S fill:#e3f2fd
This works for straightforward questions but fails when:
The query needs reformulation for better retrieval
Multiple retrieval steps are needed
Some queries don’t need retrieval at all
The initial results are insufficient
Agentic RAG: The Agent Decides
Agentic RAG shifts control to the agent. Retrieval becomes a tool the agent chooses to use, not a mandatory step:
flowchart TD
Q[Query] --> A{Agent}
A -->|Needs Info| R[Retrieve]
R --> D[Documents]
D --> A
A -->|Needs More| R
A -->|Knows Answer| G[Generate]
A -->|Wrong Query| RQ[Reformulate]
RQ --> R
style A fill:#fff3e0
style R fill:#e3f2fd
The agent can:
Skip retrieval when it already knows the answer
Reformulate queries for better search results
Retrieve iteratively until it has enough information
Evaluate results and decide if more searching is needed
@tool defsearch_knowledge_base(query: str, num_results: int = 5) -> str: """ Search the knowledge base for relevant information. Args: query: Search query - be specific for better results num_results: Number of documents to retrieve (default 5) Returns: Relevant documents from the knowledge base """ docs = vectorstore.similarity_search(query, k=num_results)
results = [] for i, doc inenumerate(docs): results.append(f"[Document {i+1}]") results.append(f"Source: {doc.metadata.get('source', 'Unknown')}") results.append(f"Content: {doc.page_content}") results.append("---")
classAgenticRAG: def__init__(self): self.tools = [search_knowledge_base] self.tool_map = {t.name: t for t inself.tools}
defquery(self, question: str) -> str: messages = [ { "role": "system", "content": """You are a knowledgeable assistant with access to a knowledge base. Use these strategies: 1. If you're confident in your answer, respond directly 2. If you need specific facts, search the knowledge base 3. If initial results are insufficient, reformulate and search again 4. If the query is ambiguous, clarify before searching 5. Always cite your sources when using retrieved information Be strategic about when to search - don't search unnecessarily.""" }, {"role": "user", "content": question} ]
for tool_call in message.tool_calls: result = self._execute_tool(tool_call) messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result })
@tool defreformulate_query( original_query: str, search_results: str, reason: str ) -> str: """ Create a better search query when initial results are insufficient. Args: original_query: The original search query search_results: What was found (or not found) reason: Why the results were insufficient Returns: Improved search query """ prompt = f""" Original query: {original_query} Results obtained: {search_results[:500]} Problem: {reason} Generate a better search query to find the needed information. Return only the new query, nothing else. """
def_keyword_search(self, query: str, k: int) -> List[Dict]: # BM25 or similar keyword search results = self.keyword_index.search(query, limit=k) return [ {"content": r.text, "score": r.score, "source": "keyword"} for r in results ]
def_merge_results( self, semantic: List[Dict], keyword: List[Dict], k: int ) -> List[Dict]: # Reciprocal rank fusion scores = {} for rank, result inenumerate(semantic): key = result["content"][:100] # Use prefix as key scores[key] = scores.get(key, 0) + 1 / (rank + 60)
for rank, result inenumerate(keyword): key = result["content"][:100] scores[key] = scores.get(key, 0) + 1 / (rank + 60)
# Sort by combined score all_results = semantic + keyword unique = {r["content"][:100]: r for r in all_results} sorted_keys = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [unique[k] for k in sorted_keys[:k] if k in unique]
Agent Evaluation: Why It Matters
Building agents is one thing - knowing if they work is another. Without evaluation, you’re flying blind. Agent evaluation presents unique challenges:
Non-determinism: Same input may produce different outputs
Multi-step workflows: Errors compound across steps
Tool usage: Must evaluate both tool selection and execution
Quality is subjective: What makes a “good” response?
Evaluation Dimensions
Evaluate agents across multiple dimensions:
flowchart TD
E[Agent Evaluation]
E --> A[Accuracy]
E --> R[Reliability]
E --> T[Tool Usage]
E --> L[Latency]
E --> C[Cost]
A --> A1[Correct answers]
R --> R1[Consistent behavior]
T --> T1[Right tool selection]
L --> L1[Response time]
C --> C1[Token/API costs]
style E fill:#fff3e0
def_semantic_similarity(self, output: str, expected: str) -> tuple[bool, float]: """Use LLM to judge semantic similarity""" prompt = f""" Compare these two responses for semantic equivalence. Rate similarity from 0.0 to 1.0. Expected: {expected} Actual: {output} Return only a number between 0.0 and 1.0. """
# Define test cases test_cases = [ TestCase( name="simple_factual", input="What is the capital of France?", expected_contains=["Paris"], metric=EvalMetric.CONTAINS ), TestCase( name="requires_search", input="What are our company's refund policies?", expected_tools=["search_knowledge_base"], expected_contains=["refund", "days"], metric=EvalMetric.TOOL_USAGE ), TestCase( name="complex_reasoning", input="Compare our Q3 and Q4 sales performance", expected_tools=["search_knowledge_base"], metric=EvalMetric.CUSTOM, custom_evaluator=lambda output, tc: ( "Q3"in output and"Q4"in output andlen(output) > 100, 0.8if"Q3"in output and"Q4"in output else0.3 ) ) ]
# Run evaluation evaluator = AgentEvaluator(my_agent) results = evaluator.run_evaluation(test_cases)
# Summary statistics passed = sum(1for r in results if r.passed) avg_score = sum(r.score for r in results) / len(results) avg_latency = sum(r.latency_ms for r in results) / len(results)
defllm_judge( question: str, response: str, criteria: List[str] ) -> dict: """ Use LLM to evaluate response quality. Args: question: Original user question response: Agent's response criteria: List of evaluation criteria Returns: Scores for each criterion """ criteria_str = "\n".join(f"- {c}"for c in criteria)
prompt = f""" Evaluate this response on a scale of 1-5 for each criterion. Question: {question} Response: {response} Criteria: {criteria_str} Return JSON with scores for each criterion and brief justification: {{"criterion_name": {{"score": 1-5, "reason": "..."}}, ...}} """
# Usage criteria = [ "accuracy: factual correctness", "completeness: addresses all parts of the question", "clarity: easy to understand", "relevance: stays on topic" ]
scores = llm_judge( question="How do I reset my password?", response=agent_response, criteria=criteria )
return { "total_queries": len(recent), "success_rate": sum(1for m in recent if m["success"]) / len(recent), "avg_latency_ms": sum(m["latency_ms"] for m in recent) / len(recent), "tool_usage": self._count_tools(recent) }
def_count_tools(self, records: List[dict]) -> dict: counts = {} for r in records: for tool in r["tools_used"]: counts[tool] = counts.get(tool, 0) + 1 return counts
Key Takeaways
Agentic RAG
Agent controls retrieval: Let the agent decide when and how to search
Iterative refinement: Allow query reformulation and multiple retrieval attempts
Hybrid approaches: Combine semantic and keyword search for better coverage
Skip when possible: Don’t retrieve unnecessarily
Evaluation
Multi-dimensional: Evaluate accuracy, reliability, tool usage, latency, and cost
Test systematically: Build comprehensive test suites with clear expected outcomes
Use LLM judges: For subjective quality assessment
Monitor continuously: Production agents need ongoing observation
With agentic RAG, agents become smarter about knowledge acquisition. With proper evaluation, you gain confidence that they actually work. This completes our exploration of single-agent capabilities. In the next post, we’ll step into multi-agent territory - designing systems where multiple specialized agents collaborate.
This is Part 11 of my series on building intelligent AI systems. Next: designing multi-agent architectures.
Comments