Agentic RAG and Agent Evaluation Strategies

Dec 5 2025 AI agentic-ai

Traditional RAG (Retrieval-Augmented Generation) follows a fixed pattern: query in, documents out, response generated. But what if the agent could decide when and how to retrieve? Agentic RAG gives agents control over their own knowledge acquisition. In this post, I’ll explore this dynamic approach to retrieval, then tackle the equally important question: how do we know if our agents actually work?

The Limits of Traditional RAG

Standard RAG follows a rigid pipeline:

flowchart LR
    Q[Query] --> E[Embed]
    E --> S[Search]
    S --> D[Documents]
    D --> G[Generate]
    G --> R[Response]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff

    class S blueClass

This works for straightforward questions but fails when:

The query needs reformulation for better retrieval
Multiple retrieval steps are needed
Some queries don’t need retrieval at all
The initial results are insufficient

Agentic RAG: The Agent Decides

Agentic RAG shifts control to the agent. Retrieval becomes a tool the agent chooses to use, not a mandatory step:

flowchart TD
    Q[Query] --> A{Agent}
    A -->|Needs Info| R[Retrieve]
    R --> D[Documents]
    D --> A
    A -->|Needs More| R
    A -->|Knows Answer| G[Generate]
    A -->|Wrong Query| RQ[Reformulate]
    RQ --> R

    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff

    class A orangeClass
    class R blueClass

The agent can:

Skip retrieval when it already knows the answer
Reformulate queries for better search results
Retrieve iteratively until it has enough information
Evaluate results and decide if more searching is needed

Implementing Agentic RAG

The Retrieval Tool

First, wrap your vector store as a tool:

from langchain_core.tools import tool
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
    collection_name="knowledge_base",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

@tool
def search_knowledge_base(query: str, num_results: int = 5) -> str:
    """
    Search the knowledge base for relevant information.

    Args:
        query: Search query - be specific for better results
        num_results: Number of documents to retrieve (default 5)

    Returns:
        Relevant documents from the knowledge base
    """
    docs = vectorstore.similarity_search(query, k=num_results)

    results = []
    for i, doc in enumerate(docs):
        results.append(f"[Document {i+1}]")
        results.append(f"Source: {doc.metadata.get('source', 'Unknown')}")
        results.append(f"Content: {doc.page_content}")
        results.append("---")

    return "\n".join(results)

The Agentic RAG Agent

from openai import OpenAI
import json

client = OpenAI()

class AgenticRAG:
    def __init__(self):
        self.tools = [search_knowledge_base]
        self.tool_map = {t.name: t for t in self.tools}

    def query(self, question: str) -> str:
        messages = [
            {
                "role": "system",
                "content": """You are a knowledgeable assistant with access to a
                knowledge base. Use these strategies:

                1. If you're confident in your answer, respond directly
                2. If you need specific facts, search the knowledge base
                3. If initial results are insufficient, reformulate and search again
                4. If the query is ambiguous, clarify before searching
                5. Always cite your sources when using retrieved information

                Be strategic about when to search - don't search unnecessarily."""
            },
            {"role": "user", "content": question}
        ]

        max_iterations = 5
        for _ in range(max_iterations):
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                tools=self._get_tool_schemas(),
                temperature=0
            )

            message = response.choices[0].message

            if not message.tool_calls:
                return message.content

            messages.append(message)

            for tool_call in message.tool_calls:
                result = self._execute_tool(tool_call)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": result
                })

        return "Unable to find sufficient information."

    def _execute_tool(self, tool_call) -> str:
        func_name = tool_call.function.name
        func_args = json.loads(tool_call.function.arguments)
        return str(self.tool_map[func_name].invoke(func_args))

    def _get_tool_schemas(self) -> list:
        return [
            {
                "type": "function",
                "function": {
                    "name": tool.name,
                    "description": tool.description,
                    "parameters": tool.args_schema.schema()
                }
            }
            for tool in self.tools
        ]

Query Reformulation

Add a dedicated reformulation tool for complex queries:

@tool
def reformulate_query(
    original_query: str,
    search_results: str,
    reason: str
) -> str:
    """
    Create a better search query when initial results are insufficient.

    Args:
        original_query: The original search query
        search_results: What was found (or not found)
        reason: Why the results were insufficient

    Returns:
        Improved search query
    """
    prompt = f"""
    Original query: {original_query}
    Results obtained: {search_results[:500]}
    Problem: {reason}

    Generate a better search query to find the needed information.
    Return only the new query, nothing else.
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return response.choices[0].message.content

Hybrid Search Strategies

Combine multiple retrieval approaches:

from typing import List, Dict

class HybridRetriever:
    def __init__(self, vectorstore, keyword_index):
        self.vectorstore = vectorstore
        self.keyword_index = keyword_index

    def search(
        self,
        query: str,
        method: str = "hybrid",
        k: int = 5
    ) -> List[Dict]:
        """
        Search using different strategies.

        Args:
            query: Search query
            method: "semantic", "keyword", or "hybrid"
            k: Number of results
        """
        if method == "semantic":
            return self._semantic_search(query, k)
        elif method == "keyword":
            return self._keyword_search(query, k)
        else:
            # Hybrid: combine both
            semantic_results = self._semantic_search(query, k)
            keyword_results = self._keyword_search(query, k)
            return self._merge_results(semantic_results, keyword_results, k)

    def _semantic_search(self, query: str, k: int) -> List[Dict]:
        docs = self.vectorstore.similarity_search_with_score(query, k=k)
        return [
            {"content": doc.page_content, "score": score, "source": "semantic"}
            for doc, score in docs
        ]

    def _keyword_search(self, query: str, k: int) -> List[Dict]:
        # BM25 or similar keyword search
        results = self.keyword_index.search(query, limit=k)
        return [
            {"content": r.text, "score": r.score, "source": "keyword"}
            for r in results
        ]

    def _merge_results(
        self,
        semantic: List[Dict],
        keyword: List[Dict],
        k: int
    ) -> List[Dict]:
        # Reciprocal rank fusion
        scores = {}
        for rank, result in enumerate(semantic):
            key = result["content"][:100]  # Use prefix as key
            scores[key] = scores.get(key, 0) + 1 / (rank + 60)

        for rank, result in enumerate(keyword):
            key = result["content"][:100]
            scores[key] = scores.get(key, 0) + 1 / (rank + 60)

        # Sort by combined score
        all_results = semantic + keyword
        unique = {r["content"][:100]: r for r in all_results}
        sorted_keys = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

        return [unique[k] for k in sorted_keys[:k] if k in unique]

Agent Evaluation: Why It Matters

Building agents is one thing - knowing if they work is another. Without evaluation, you’re flying blind. Agent evaluation presents unique challenges:

Non-determinism: Same input may produce different outputs
Multi-step workflows: Errors compound across steps
Tool usage: Must evaluate both tool selection and execution
Quality is subjective: What makes a “good” response?

Evaluation Dimensions

Evaluate agents across multiple dimensions:

flowchart TD
    E[Agent Evaluation]
    E --> A[Accuracy]
    E --> R[Reliability]
    E --> T[Tool Usage]
    E --> L[Latency]
    E --> C[Cost]

    A --> A1[Correct answers]
    R --> R1[Consistent behavior]
    T --> T1[Right tool selection]
    L --> L1[Response time]
    C --> C1[Token/API costs]

    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff

    class E orangeClass

Building an Evaluation Framework

Test Case Structure

from dataclasses import dataclass
from typing import List, Optional, Callable
from enum import Enum

class EvalMetric(Enum):
    EXACT_MATCH = "exact_match"
    CONTAINS = "contains"
    SEMANTIC_SIMILARITY = "semantic_similarity"
    TOOL_USAGE = "tool_usage"
    CUSTOM = "custom"

@dataclass
class TestCase:
    """Single test case for agent evaluation"""
    name: str
    input: str
    expected_output: Optional[str] = None
    expected_tools: Optional[List[str]] = None
    expected_contains: Optional[List[str]] = None
    metric: EvalMetric = EvalMetric.CONTAINS
    custom_evaluator: Optional[Callable] = None

@dataclass
class EvalResult:
    test_case: TestCase
    actual_output: str
    passed: bool
    score: float
    latency_ms: float
    tools_used: List[str]
    error: Optional[str] = None

The Evaluator

import time
from openai import OpenAI

client = OpenAI()

class AgentEvaluator:
    def __init__(self, agent):
        self.agent = agent

    def run_evaluation(self, test_cases: List[TestCase]) -> List[EvalResult]:
        results = []
        for tc in test_cases:
            result = self._evaluate_single(tc)
            results.append(result)
            print(f"{'✓' if result.passed else '✗'} {tc.name}: {result.score:.2f}")
        return results

    def _evaluate_single(self, tc: TestCase) -> EvalResult:
        start_time = time.time()

        try:
            output = self.agent.query(tc.input)
            latency = (time.time() - start_time) * 1000

            passed, score = self._check_result(tc, output)

            return EvalResult(
                test_case=tc,
                actual_output=output,
                passed=passed,
                score=score,
                latency_ms=latency,
                tools_used=getattr(self.agent, 'last_tools_used', [])
            )
        except Exception as e:
            return EvalResult(
                test_case=tc,
                actual_output="",
                passed=False,
                score=0.0,
                latency_ms=0,
                tools_used=[],
                error=str(e)
            )

    def _check_result(self, tc: TestCase, output: str) -> tuple[bool, float]:
        if tc.metric == EvalMetric.EXACT_MATCH:
            passed = output.strip() == tc.expected_output.strip()
            return passed, 1.0 if passed else 0.0

        elif tc.metric == EvalMetric.CONTAINS:
            if not tc.expected_contains:
                return True, 1.0
            matches = sum(1 for term in tc.expected_contains if term.lower() in output.lower())
            score = matches / len(tc.expected_contains)
            return score >= 0.8, score

        elif tc.metric == EvalMetric.SEMANTIC_SIMILARITY:
            return self._semantic_similarity(output, tc.expected_output)

        elif tc.metric == EvalMetric.TOOL_USAGE:
            tools_used = getattr(self.agent, 'last_tools_used', [])
            if not tc.expected_tools:
                return True, 1.0
            matches = sum(1 for t in tc.expected_tools if t in tools_used)
            score = matches / len(tc.expected_tools)
            return score >= 0.8, score

        elif tc.metric == EvalMetric.CUSTOM:
            return tc.custom_evaluator(output, tc)

        return False, 0.0

    def _semantic_similarity(self, output: str, expected: str) -> tuple[bool, float]:
        """Use LLM to judge semantic similarity"""
        prompt = f"""
        Compare these two responses for semantic equivalence.
        Rate similarity from 0.0 to 1.0.

        Expected: {expected}
        Actual: {output}

        Return only a number between 0.0 and 1.0.
        """

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        try:
            score = float(response.choices[0].message.content.strip())
            return score >= 0.7, score
        except:
            return False, 0.0

Using the Evaluator

# Define test cases
test_cases = [
    TestCase(
        name="simple_factual",
        input="What is the capital of France?",
        expected_contains=["Paris"],
        metric=EvalMetric.CONTAINS
    ),
    TestCase(
        name="requires_search",
        input="What are our company's refund policies?",
        expected_tools=["search_knowledge_base"],
        expected_contains=["refund", "days"],
        metric=EvalMetric.TOOL_USAGE
    ),
    TestCase(
        name="complex_reasoning",
        input="Compare our Q3 and Q4 sales performance",
        expected_tools=["search_knowledge_base"],
        metric=EvalMetric.CUSTOM,
        custom_evaluator=lambda output, tc: (
            "Q3" in output and "Q4" in output and len(output) > 100,
            0.8 if "Q3" in output and "Q4" in output else 0.3
        )
    )
]

# Run evaluation
evaluator = AgentEvaluator(my_agent)
results = evaluator.run_evaluation(test_cases)

# Summary statistics
passed = sum(1 for r in results if r.passed)
avg_score = sum(r.score for r in results) / len(results)
avg_latency = sum(r.latency_ms for r in results) / len(results)

print(f"\nResults: {passed}/{len(results)} passed")
print(f"Average score: {avg_score:.2f}")
print(f"Average latency: {avg_latency:.0f}ms")

LLM-as-Judge Evaluation

For subjective quality assessment, use an LLM as evaluator:

def llm_judge(
    question: str,
    response: str,
    criteria: List[str]
) -> dict:
    """
    Use LLM to evaluate response quality.

    Args:
        question: Original user question
        response: Agent's response
        criteria: List of evaluation criteria

    Returns:
        Scores for each criterion
    """
    criteria_str = "\n".join(f"- {c}" for c in criteria)

    prompt = f"""
    Evaluate this response on a scale of 1-5 for each criterion.

    Question: {question}
    Response: {response}

    Criteria:
    {criteria_str}

    Return JSON with scores for each criterion and brief justification:
    {{"criterion_name": {{"score": 1-5, "reason": "..."}}, ...}}
    """

    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0
    )

    return json.loads(result.choices[0].message.content)


# Usage
criteria = [
    "accuracy: factual correctness",
    "completeness: addresses all parts of the question",
    "clarity: easy to understand",
    "relevance: stays on topic"
]

scores = llm_judge(
    question="How do I reset my password?",
    response=agent_response,
    criteria=criteria
)

Continuous Monitoring

Production agents need ongoing monitoring:

from datetime import datetime
import logging

class AgentMonitor:
    def __init__(self, agent_name: str):
        self.agent_name = agent_name
        self.metrics = []
        self.logger = logging.getLogger(f"agent.{agent_name}")

    def log_interaction(
        self,
        query: str,
        response: str,
        latency_ms: float,
        tools_used: List[str],
        success: bool
    ):
        record = {
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "response_length": len(response),
            "latency_ms": latency_ms,
            "tools_used": tools_used,
            "success": success
        }

        self.metrics.append(record)
        self.logger.info(f"Query processed in {latency_ms:.0f}ms, success={success}")

    def get_summary(self, last_n: int = 100) -> dict:
        recent = self.metrics[-last_n:]
        if not recent:
            return {}

        return {
            "total_queries": len(recent),
            "success_rate": sum(1 for m in recent if m["success"]) / len(recent),
            "avg_latency_ms": sum(m["latency_ms"] for m in recent) / len(recent),
            "tool_usage": self._count_tools(recent)
        }

    def _count_tools(self, records: List[dict]) -> dict:
        counts = {}
        for r in records:
            for tool in r["tools_used"]:
                counts[tool] = counts.get(tool, 0) + 1
        return counts

Key Takeaways

Agentic RAG

Agent controls retrieval: Let the agent decide when and how to search
Iterative refinement: Allow query reformulation and multiple retrieval attempts
Hybrid approaches: Combine semantic and keyword search for better coverage
Skip when possible: Don’t retrieve unnecessarily

Evaluation

Multi-dimensional: Evaluate accuracy, reliability, tool usage, latency, and cost
Test systematically: Build comprehensive test suites with clear expected outcomes
Use LLM judges: For subjective quality assessment
Monitor continuously: Production agents need ongoing observation

With agentic RAG, agents become smarter about knowledge acquisition. With proper evaluation, you gain confidence that they actually work. This completes our exploration of single-agent capabilities. In the next post, we’ll step into multi-agent territory - designing systems where multiple specialized agents collaborate.

This is Part 11 of my series on building intelligent AI systems. Next: designing multi-agent architectures.

#llm #multi-agent #python #rag

Agentic RAG and Agent Evaluation Strategies

The Limits of Traditional RAG

Agentic RAG: The Agent Decides

Implementing Agentic RAG

The Retrieval Tool

The Agentic RAG Agent

Query Reformulation

Hybrid Search Strategies

Agent Evaluation: Why It Matters

Evaluation Dimensions

Building an Evaluation Framework

Test Case Structure

The Evaluator

Using the Evaluator

LLM-as-Judge Evaluation

Continuous Monitoring

Key Takeaways

Agentic RAG

Evaluation

Comments

Your browser is out-of-date!

Agentic RAG and Agent Evaluation Strategies

The Limits of Traditional RAG

Agentic RAG: The Agent Decides

Implementing Agentic RAG

The Retrieval Tool

The Agentic RAG Agent

Query Reformulation

Hybrid Search Strategies

Agent Evaluation: Why It Matters

Evaluation Dimensions

Building an Evaluation Framework

Test Case Structure

The Evaluator

Using the Evaluator

LLM-as-Judge Evaluation

Continuous Monitoring

Key Takeaways

Agentic RAG

Evaluation

Related Posts

Comments

Your browser is out-of-date!