Agentic RAG and Agent Evaluation Strategies

Traditional RAG (Retrieval-Augmented Generation) follows a fixed pattern: query in, documents out, response generated. But what if the agent could decide when and how to retrieve? Agentic RAG gives agents control over their own knowledge acquisition. In this post, I’ll explore this dynamic approach to retrieval, then tackle the equally important question: how do we know if our agents actually work?

The Limits of Traditional RAG

Standard RAG follows a rigid pipeline:

flowchart LR
    Q[Query] --> E[Embed]
    E --> S[Search]
    S --> D[Documents]
    D --> G[Generate]
    G --> R[Response]

    style S fill:#e3f2fd

This works for straightforward questions but fails when:

  • The query needs reformulation for better retrieval
  • Multiple retrieval steps are needed
  • Some queries don’t need retrieval at all
  • The initial results are insufficient

Agentic RAG: The Agent Decides

Agentic RAG shifts control to the agent. Retrieval becomes a tool the agent chooses to use, not a mandatory step:

flowchart TD
    Q[Query] --> A{Agent}
    A -->|Needs Info| R[Retrieve]
    R --> D[Documents]
    D --> A
    A -->|Needs More| R
    A -->|Knows Answer| G[Generate]
    A -->|Wrong Query| RQ[Reformulate]
    RQ --> R

    style A fill:#fff3e0
    style R fill:#e3f2fd

The agent can:

  1. Skip retrieval when it already knows the answer
  2. Reformulate queries for better search results
  3. Retrieve iteratively until it has enough information
  4. Evaluate results and decide if more searching is needed

Implementing Agentic RAG

The Retrieval Tool

First, wrap your vector store as a tool:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from langchain_core.tools import tool
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
collection_name="knowledge_base",
embedding_function=embeddings,
persist_directory="./chroma_db"
)

@tool
def search_knowledge_base(query: str, num_results: int = 5) -> str:
"""
Search the knowledge base for relevant information.

Args:
query: Search query - be specific for better results
num_results: Number of documents to retrieve (default 5)

Returns:
Relevant documents from the knowledge base
"""
docs = vectorstore.similarity_search(query, k=num_results)

results = []
for i, doc in enumerate(docs):
results.append(f"[Document {i+1}]")
results.append(f"Source: {doc.metadata.get('source', 'Unknown')}")
results.append(f"Content: {doc.page_content}")
results.append("---")

return "\n".join(results)

The Agentic RAG Agent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from openai import OpenAI
import json

client = OpenAI()

class AgenticRAG:
def __init__(self):
self.tools = [search_knowledge_base]
self.tool_map = {t.name: t for t in self.tools}

def query(self, question: str) -> str:
messages = [
{
"role": "system",
"content": """You are a knowledgeable assistant with access to a
knowledge base. Use these strategies:

1. If you're confident in your answer, respond directly
2. If you need specific facts, search the knowledge base
3. If initial results are insufficient, reformulate and search again
4. If the query is ambiguous, clarify before searching
5. Always cite your sources when using retrieved information

Be strategic about when to search - don't search unnecessarily."""
},
{"role": "user", "content": question}
]

max_iterations = 5
for _ in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=self._get_tool_schemas(),
temperature=0
)

message = response.choices[0].message

if not message.tool_calls:
return message.content

messages.append(message)

for tool_call in message.tool_calls:
result = self._execute_tool(tool_call)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})

return "Unable to find sufficient information."

def _execute_tool(self, tool_call) -> str:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
return str(self.tool_map[func_name].invoke(func_args))

def _get_tool_schemas(self) -> list:
return [
{
"type": "function",
"function": {
"name": tool.name,
"description": tool.description,
"parameters": tool.args_schema.schema()
}
}
for tool in self.tools
]

Query Reformulation

Add a dedicated reformulation tool for complex queries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@tool
def reformulate_query(
original_query: str,
search_results: str,
reason: str
) -> str:
"""
Create a better search query when initial results are insufficient.

Args:
original_query: The original search query
search_results: What was found (or not found)
reason: Why the results were insufficient

Returns:
Improved search query
"""
prompt = f"""
Original query: {original_query}
Results obtained: {search_results[:500]}
Problem: {reason}

Generate a better search query to find the needed information.
Return only the new query, nothing else.
"""

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)

return response.choices[0].message.content

Hybrid Search Strategies

Combine multiple retrieval approaches:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from typing import List, Dict

class HybridRetriever:
def __init__(self, vectorstore, keyword_index):
self.vectorstore = vectorstore
self.keyword_index = keyword_index

def search(
self,
query: str,
method: str = "hybrid",
k: int = 5
) -> List[Dict]:
"""
Search using different strategies.

Args:
query: Search query
method: "semantic", "keyword", or "hybrid"
k: Number of results
"""
if method == "semantic":
return self._semantic_search(query, k)
elif method == "keyword":
return self._keyword_search(query, k)
else:
# Hybrid: combine both
semantic_results = self._semantic_search(query, k)
keyword_results = self._keyword_search(query, k)
return self._merge_results(semantic_results, keyword_results, k)

def _semantic_search(self, query: str, k: int) -> List[Dict]:
docs = self.vectorstore.similarity_search_with_score(query, k=k)
return [
{"content": doc.page_content, "score": score, "source": "semantic"}
for doc, score in docs
]

def _keyword_search(self, query: str, k: int) -> List[Dict]:
# BM25 or similar keyword search
results = self.keyword_index.search(query, limit=k)
return [
{"content": r.text, "score": r.score, "source": "keyword"}
for r in results
]

def _merge_results(
self,
semantic: List[Dict],
keyword: List[Dict],
k: int
) -> List[Dict]:
# Reciprocal rank fusion
scores = {}
for rank, result in enumerate(semantic):
key = result["content"][:100] # Use prefix as key
scores[key] = scores.get(key, 0) + 1 / (rank + 60)

for rank, result in enumerate(keyword):
key = result["content"][:100]
scores[key] = scores.get(key, 0) + 1 / (rank + 60)

# Sort by combined score
all_results = semantic + keyword
unique = {r["content"][:100]: r for r in all_results}
sorted_keys = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

return [unique[k] for k in sorted_keys[:k] if k in unique]

Agent Evaluation: Why It Matters

Building agents is one thing - knowing if they work is another. Without evaluation, you’re flying blind. Agent evaluation presents unique challenges:

  1. Non-determinism: Same input may produce different outputs
  2. Multi-step workflows: Errors compound across steps
  3. Tool usage: Must evaluate both tool selection and execution
  4. Quality is subjective: What makes a “good” response?

Evaluation Dimensions

Evaluate agents across multiple dimensions:

flowchart TD
    E[Agent Evaluation]
    E --> A[Accuracy]
    E --> R[Reliability]
    E --> T[Tool Usage]
    E --> L[Latency]
    E --> C[Cost]

    A --> A1[Correct answers]
    R --> R1[Consistent behavior]
    T --> T1[Right tool selection]
    L --> L1[Response time]
    C --> C1[Token/API costs]

    style E fill:#fff3e0

Building an Evaluation Framework

Test Case Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from dataclasses import dataclass
from typing import List, Optional, Callable
from enum import Enum

class EvalMetric(Enum):
EXACT_MATCH = "exact_match"
CONTAINS = "contains"
SEMANTIC_SIMILARITY = "semantic_similarity"
TOOL_USAGE = "tool_usage"
CUSTOM = "custom"

@dataclass
class TestCase:
"""Single test case for agent evaluation"""
name: str
input: str
expected_output: Optional[str] = None
expected_tools: Optional[List[str]] = None
expected_contains: Optional[List[str]] = None
metric: EvalMetric = EvalMetric.CONTAINS
custom_evaluator: Optional[Callable] = None

@dataclass
class EvalResult:
test_case: TestCase
actual_output: str
passed: bool
score: float
latency_ms: float
tools_used: List[str]
error: Optional[str] = None

The Evaluator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import time
from openai import OpenAI

client = OpenAI()

class AgentEvaluator:
def __init__(self, agent):
self.agent = agent

def run_evaluation(self, test_cases: List[TestCase]) -> List[EvalResult]:
results = []
for tc in test_cases:
result = self._evaluate_single(tc)
results.append(result)
print(f"{'✓' if result.passed else '✗'} {tc.name}: {result.score:.2f}")
return results

def _evaluate_single(self, tc: TestCase) -> EvalResult:
start_time = time.time()

try:
output = self.agent.query(tc.input)
latency = (time.time() - start_time) * 1000

passed, score = self._check_result(tc, output)

return EvalResult(
test_case=tc,
actual_output=output,
passed=passed,
score=score,
latency_ms=latency,
tools_used=getattr(self.agent, 'last_tools_used', [])
)
except Exception as e:
return EvalResult(
test_case=tc,
actual_output="",
passed=False,
score=0.0,
latency_ms=0,
tools_used=[],
error=str(e)
)

def _check_result(self, tc: TestCase, output: str) -> tuple[bool, float]:
if tc.metric == EvalMetric.EXACT_MATCH:
passed = output.strip() == tc.expected_output.strip()
return passed, 1.0 if passed else 0.0

elif tc.metric == EvalMetric.CONTAINS:
if not tc.expected_contains:
return True, 1.0
matches = sum(1 for term in tc.expected_contains if term.lower() in output.lower())
score = matches / len(tc.expected_contains)
return score >= 0.8, score

elif tc.metric == EvalMetric.SEMANTIC_SIMILARITY:
return self._semantic_similarity(output, tc.expected_output)

elif tc.metric == EvalMetric.TOOL_USAGE:
tools_used = getattr(self.agent, 'last_tools_used', [])
if not tc.expected_tools:
return True, 1.0
matches = sum(1 for t in tc.expected_tools if t in tools_used)
score = matches / len(tc.expected_tools)
return score >= 0.8, score

elif tc.metric == EvalMetric.CUSTOM:
return tc.custom_evaluator(output, tc)

return False, 0.0

def _semantic_similarity(self, output: str, expected: str) -> tuple[bool, float]:
"""Use LLM to judge semantic similarity"""
prompt = f"""
Compare these two responses for semantic equivalence.
Rate similarity from 0.0 to 1.0.

Expected: {expected}
Actual: {output}

Return only a number between 0.0 and 1.0.
"""

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)

try:
score = float(response.choices[0].message.content.strip())
return score >= 0.7, score
except:
return False, 0.0

Using the Evaluator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Define test cases
test_cases = [
TestCase(
name="simple_factual",
input="What is the capital of France?",
expected_contains=["Paris"],
metric=EvalMetric.CONTAINS
),
TestCase(
name="requires_search",
input="What are our company's refund policies?",
expected_tools=["search_knowledge_base"],
expected_contains=["refund", "days"],
metric=EvalMetric.TOOL_USAGE
),
TestCase(
name="complex_reasoning",
input="Compare our Q3 and Q4 sales performance",
expected_tools=["search_knowledge_base"],
metric=EvalMetric.CUSTOM,
custom_evaluator=lambda output, tc: (
"Q3" in output and "Q4" in output and len(output) > 100,
0.8 if "Q3" in output and "Q4" in output else 0.3
)
)
]

# Run evaluation
evaluator = AgentEvaluator(my_agent)
results = evaluator.run_evaluation(test_cases)

# Summary statistics
passed = sum(1 for r in results if r.passed)
avg_score = sum(r.score for r in results) / len(results)
avg_latency = sum(r.latency_ms for r in results) / len(results)

print(f"\nResults: {passed}/{len(results)} passed")
print(f"Average score: {avg_score:.2f}")
print(f"Average latency: {avg_latency:.0f}ms")

LLM-as-Judge Evaluation

For subjective quality assessment, use an LLM as evaluator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def llm_judge(
question: str,
response: str,
criteria: List[str]
) -> dict:
"""
Use LLM to evaluate response quality.

Args:
question: Original user question
response: Agent's response
criteria: List of evaluation criteria

Returns:
Scores for each criterion
"""
criteria_str = "\n".join(f"- {c}" for c in criteria)

prompt = f"""
Evaluate this response on a scale of 1-5 for each criterion.

Question: {question}
Response: {response}

Criteria:
{criteria_str}

Return JSON with scores for each criterion and brief justification:
{{"criterion_name": {{"score": 1-5, "reason": "..."}}, ...}}
"""

result = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)

return json.loads(result.choices[0].message.content)


# Usage
criteria = [
"accuracy: factual correctness",
"completeness: addresses all parts of the question",
"clarity: easy to understand",
"relevance: stays on topic"
]

scores = llm_judge(
question="How do I reset my password?",
response=agent_response,
criteria=criteria
)

Continuous Monitoring

Production agents need ongoing monitoring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from datetime import datetime
import logging

class AgentMonitor:
def __init__(self, agent_name: str):
self.agent_name = agent_name
self.metrics = []
self.logger = logging.getLogger(f"agent.{agent_name}")

def log_interaction(
self,
query: str,
response: str,
latency_ms: float,
tools_used: List[str],
success: bool
):
record = {
"timestamp": datetime.now().isoformat(),
"query": query,
"response_length": len(response),
"latency_ms": latency_ms,
"tools_used": tools_used,
"success": success
}

self.metrics.append(record)
self.logger.info(f"Query processed in {latency_ms:.0f}ms, success={success}")

def get_summary(self, last_n: int = 100) -> dict:
recent = self.metrics[-last_n:]
if not recent:
return {}

return {
"total_queries": len(recent),
"success_rate": sum(1 for m in recent if m["success"]) / len(recent),
"avg_latency_ms": sum(m["latency_ms"] for m in recent) / len(recent),
"tool_usage": self._count_tools(recent)
}

def _count_tools(self, records: List[dict]) -> dict:
counts = {}
for r in records:
for tool in r["tools_used"]:
counts[tool] = counts.get(tool, 0) + 1
return counts

Key Takeaways

Agentic RAG

  1. Agent controls retrieval: Let the agent decide when and how to search
  2. Iterative refinement: Allow query reformulation and multiple retrieval attempts
  3. Hybrid approaches: Combine semantic and keyword search for better coverage
  4. Skip when possible: Don’t retrieve unnecessarily

Evaluation

  1. Multi-dimensional: Evaluate accuracy, reliability, tool usage, latency, and cost
  2. Test systematically: Build comprehensive test suites with clear expected outcomes
  3. Use LLM judges: For subjective quality assessment
  4. Monitor continuously: Production agents need ongoing observation

With agentic RAG, agents become smarter about knowledge acquisition. With proper evaluation, you gain confidence that they actually work. This completes our exploration of single-agent capabilities. In the next post, we’ll step into multi-agent territory - designing systems where multiple specialized agents collaborate.


This is Part 11 of my series on building intelligent AI systems. Next: designing multi-agent architectures.

Connecting Agents to the World - External APIs and Data

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×