Summary: Google's Agent Quality & Evaluation Framework

Jan 13 2026 AI agentic-ai

This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, and Context Engineering. This fourth installment tackles perhaps the most critical challenge: how do we know if an agent is actually good?

Source: Agent Quality (PDF) by Google, November 2025

Why Agent Quality Demands a New Approach

Traditional software testing is deterministic: given input X, expect output Y. A function either passes or fails. But AI agents break this model entirely. Their non-deterministic nature means the same input can produce different (yet equally valid) outputs. An agent can pass 100 unit tests and still fail catastrophically in production - not because of a bug in the code, but because of a flaw in its judgment.

The whitepaper frames this shift powerfully:

Traditional software verification asks: “Did we build the product right?” Modern AI evaluation must ask: “Did we build the right product?”

The Delivery Truck vs Formula 1 Analogy

Think of traditional software as a delivery truck. Quality assurance is a checklist: Did the engine start? Did it follow the fixed route? Was it on time?

An AI agent is more like a Formula 1 race car. Success depends on dynamic judgment. Evaluation cannot be a simple checklist - it requires continuous telemetry to judge every decision, from fuel consumption to braking strategy.

flowchart TB
    subgraph Agent["AI Agent (F1 Race Car)"]
        direction LR
        A1["User Intent"] --> A2["Dynamic Reasoning"] --> A3["Variable Paths"]
        A3 --> A4["Adaptive Output"]
    end

    subgraph Traditional["Traditional Software (Delivery Truck)"]
        direction LR
        T1["Fixed Input"] --> T2["Deterministic Logic"] --> T3["Expected Output"]
    end

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class T1,T2,T3 blueClass
    class A1,A2,A3,A4 orangeClass

Agent Failure Modes

Unlike traditional software that crashes with explicit errors, agents fail silently. The system keeps running, API calls return 200 OK, outputs look plausible - but they’re profoundly wrong.

Failure Mode	Description	Example
Algorithmic Bias	Operationalizes biases from training data	Financial agent over-penalizing loan applications based on zip codes
Factual Hallucination	Produces plausible but invented information	Research tool generating false historical dates with high confidence
Concept Drift	Performance degrades as real-world data changes	Fraud detection missing new attack patterns
Emergent Behaviors	Develops unanticipated strategies	Finding loopholes in system rules, engaging in proxy wars with other bots

The Evolution: From ML to Multi-Agent Systems

Each stage of AI evolution adds a new layer of evaluative complexity:

flowchart LR
    ML["Traditional ML
    Clear metrics: F1, RMSE"]
    LLM["Passive LLMs
    No simple metrics
    Probabilistic output"]
    RAG["LLM + RAG
    Multi-component
    Retrieval + Generation"]
    Agent["LLM Agents
    Planning + Tools + Memory"]
    MAS["Multi-Agent
    Emergent behaviors
    System-level failures"]

    ML --> LLM --> RAG --> Agent --> MAS

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff
    classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class ML blueClass
    class LLM orangeClass
    class RAG greenClass
    class Agent purpleClass
    class MAS redClass

What makes agents uniquely challenging:

Planning & Multi-Step Reasoning: Non-determinism compounds at every step. A small word choice in Step 1 can send the agent down a completely different path by Step 4
Tool Use: Actions depend on external, uncontrollable world state
Memory: Behavior evolves based on past interactions - same input today may produce different results than yesterday

The Four Pillars of Agent Quality

The whitepaper establishes a framework for holistic agent evaluation:

flowchart LR
    E["🎯 Effectiveness
    Goal Achievement"]
    EF["⚡ Efficiency
    Operational Cost"]
    R["🛡️ Robustness
    Reliability"]
    S["🔒 Safety
    Trustworthiness"]

    E ~~~ EF ~~~ R ~~~ S

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class E blueClass
    class EF orangeClass
    class R greenClass
    class S redClass

Pillar	Core Question	Example Metrics
Effectiveness	Did the agent achieve the user’s actual intent?	PR acceptance rate, conversion rate, session completion
Efficiency	Did it solve the problem well?	Token count, latency, trajectory steps, API costs
Robustness	How does it handle adversity?	Graceful degradation, retry success, error recovery
Safety & Alignment	Does it operate within ethical boundaries?	Bias metrics, prompt injection resistance, data leakage prevention

The critical insight: you cannot measure any of these pillars if you only see the final answer. You need visibility into the entire decision-making process.

The “Outside-In” Evaluation Framework

Evaluation must be a top-down, strategic process. Start with the only metric that ultimately matters - real-world success - before diving into technical details.

Stage 1: End-to-End Evaluation (Black Box)

The first question: Did the agent achieve the user’s goal effectively?

Before analyzing any internal thought or tool call, evaluate the final performance:

Task Success Rate: Binary or graded score of goal completion
User Satisfaction: Thumbs up/down, CSAT scores
Overall Quality: Accuracy, completeness metrics

If the agent scores 100% at this stage, work may be done. But when failures occur, we need to open the box.

Stage 2: Trajectory Evaluation (Glass Box)

Once a failure is identified, analyze the agent’s entire execution trajectory:

Component	What to Evaluate	Common Failures
LLM Planning	Core reasoning quality	Hallucinations, off-topic responses, repetitive loops
Tool Selection	Right tool for the task	Wrong tool, hallucinated tool names, unnecessary calls
Tool Parameterization	Correct arguments	Missing params, wrong types, malformed JSON
Response Interpretation	Understanding tool output	Misinterpreting data, missing error states
RAG Performance	Retrieval quality	Irrelevant docs, outdated info, ignored context
Efficiency	Resource allocation	Excessive API calls, high latency, redundant work
Multi-Agent Dynamics	Inter-agent coordination	Communication loops, role conflicts

The trace moves us from “the final answer is wrong” to “the final answer is wrong because…”.

The Evaluators: Methods of Judgment

Knowing what to evaluate is half the battle. The other half is how to judge it.

Automated Metrics

Fast and reproducible, useful for regression testing:

Metric Type	Examples	Use Case
String-based	ROUGE, BLEU	Comparing to reference text
Embedding-based	BERTScore, cosine similarity	Semantic closeness
Task-specific	TruthfulQA	Domain benchmarks

Limitation: Metrics are efficient but shallow - they capture surface similarity, not deeper reasoning or user value.

LLM-as-a-Judge

Use a powerful model (like Gemini) to evaluate another agent’s outputs. Provide:

The agent’s output
Original prompt
Golden answer (if available)
Detailed evaluation rubric

You are an expert evaluator for a customer support chatbot.

[User Query]
"Hi, my order #12345 hasn't arrived yet."

[Answer A]
"I can see that order #12345 is currently out for delivery
and should arrive by 5 PM today."

[Answer B]
"Order #12345 is on the truck. It will be there by 5."

Compare them on correctness, helpfulness, and tone. Output
your decision as JSON with "winner" (A/B/tie) and "rationale".

Best Practice: Use pairwise comparison over single-scoring. A high “win rate” is more reliable than noisy absolute scores.

Agent-as-a-Judge

While LLMs score final responses, agents require evaluation of their full reasoning and actions. The emerging Agent-as-a-Judge paradigm uses one agent to evaluate another’s execution trace:

Plan quality: Was the plan logical and feasible?
Tool use: Were right tools chosen and applied correctly?
Context handling: Did agent use prior information effectively?

Human-in-the-Loop (HITL)

Automation provides scale, but struggles with deep subjectivity and domain knowledge. HITL is essential for:

Domain Expertise: Medical, legal, financial accuracy
Interpreting Nuance: Tone, creativity, complex ethical alignment
Creating the Golden Set: Establishing benchmarks before automation

flowchart TB
    subgraph Methods["Evaluation Methods"]
        direction LR
        AM["Automated
        Metrics"]
        LJ["LLM-as-a
        -Judge"]
        AJ["Agent-as-a
        -Judge"]
        HI["Human-in
        -the-Loop"]
    end

    AM --> |"Scale"| LJ --> |"Process"| AJ --> |"Ground Truth"| HI

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff

    class AM blueClass
    class LJ orangeClass
    class AJ greenClass
    class HI purpleClass

Evaluator	Scalability	Nuance	Cost	Best For
Automated Metrics	High	Low	Low	Regression testing, CI/CD gates
LLM-as-a-Judge	High	Medium	Medium	Rapid iteration, A/B testing
Agent-as-a-Judge	Medium	High	Medium	Process evaluation, trajectory analysis
Human-in-the-Loop	Low	Very High	High	Ground truth, domain expertise, edge cases

Observability: The Three Pillars

You cannot judge a process you cannot see. The whitepaper uses a compelling analogy:

Line Cook vs Gourmet Chef

Traditional Software is a Line Cook: Rigid recipe card - toast bun 30 seconds, grill patty 90 seconds, add cheese. Monitoring is a checklist.

AI Agent is a Gourmet Chef in a Mystery Box Challenge: Given a goal and ingredients, no single correct recipe. We need to understand the reasoning - why pair raspberries with basil? How did they adapt when out of sugar?

This is the shift from monitoring (is it running?) to observability (is it thinking effectively?).

The Three Pillars

flowchart LR
    L["📝 Logs
    The Agent's Diary
    What happened"]
    T["🔗 Traces
    Following Footsteps
    Why it happened"]
    M["📊 Metrics
    Health Report
    How well"]

    L ~~~ T ~~~ M

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class L blueClass
    class T orangeClass
    class M greenClass

Pillar 1: Logs - The Agent’s Diary

Timestamped entries recording discrete events. A structured JSON format captures:

Prompt/response pairs
Intermediate reasoning steps (chain of thought)
Tool calls with inputs/outputs/errors
State changes

{
  "timestamp": "2025-07-10T15:26:13.778Z",
  "level": "DEBUG",
  "component": "google_adk.models.google_llm",
  "event": "LLM Request",
  "system_instruction": "You roll dice and answer questions...",
  "contents": [
    {"role": "user", "text": "Roll a 6 sided dice"},
    {"role": "model", "function_call": {"name": "roll_die", "args": {"sides": 6}}},
    {"role": "user", "function_response": {"result": 2}}
  ],
  "functions_available": ["roll_die", "check_prime"]
}

Pillar 2: Traces - Following the Agent’s Footsteps

If logs are diary entries, traces are the narrative thread connecting them. They follow a single task from initial query to final answer.

Consider a failure where a user gets a nonsensical answer:

Isolated Logs: ERROR: RAG search failed and ERROR: LLM response failed validation
A Trace reveals the causal chain: User Query → RAG Search (failed) → Faulty Tool Call (null input) → LLM Error (confused) → Incorrect Answer

Modern tracing uses OpenTelemetry with:

Spans: Named operations (llm_call, tool_execution)
Attributes: Metadata (prompt_id, latency_ms, token_count)
Context Propagation: Links spans via trace_id

Pillar 3: Metrics - The Agent’s Health Report

Aggregated scores derived from logs and traces. Split into two categories:

System Metrics (Vital Signs):

Metric	Derivation	Purpose
Latency P50/P99	Aggregate duration_ms from traces	User experience
Error Rate	% of traces with error=true	System health
Tokens per Task	Average token_count	Cost management
Task Completion Rate	% reaching success span	Effectiveness

Quality Metrics (Decision Quality):

Metric	Evaluation Method	Purpose
Correctness	Golden set comparison, LLM-as-Judge	Accuracy
Trajectory Adherence	Compare to ideal path	Process quality
Safety Score	RAI classifiers	Responsibility
Helpfulness	User feedback, LLM-as-Judge	User value

The Agent Quality Flywheel

The whitepaper synthesizes everything into a continuous improvement cycle:

flowchart LR
    D["1️⃣ Define Quality
    Four Pillars"]
    I["2️⃣ Instrument Visibility
    Logs & Traces"]
    E["3️⃣ Evaluate Process
    LLM-Judge + HITL"]
    F["4️⃣ Feedback Loop
    Regression Tests"]

    D --> I --> E --> F --> D

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff

    class D blueClass
    class I orangeClass
    class E greenClass
    class F purpleClass

Each step builds on the previous:

Define Quality: Establish concrete targets with the Four Pillars
Instrument Visibility: Build observability into the architecture from day one
Evaluate Process: Apply hybrid judgment (automated + human)
Feedback Loop: Convert failures into regression tests

Every production failure, when captured and annotated, becomes a permanent test case in the “Golden” evaluation set. Every failure makes the system smarter.

Three Core Principles

The whitepaper distills its guidance into three foundational principles:

Principle 1: Evaluation as Architecture

Agent quality is an architectural pillar, not a final testing phase.

You don’t build a Formula 1 car and bolt on sensors afterward. Design agents to be “evaluatable-by-design” - instrumented from the first line of code.

Principle 2: Trajectory is Truth

The final answer is merely the last sentence of a long story. The true measure of logic, safety, and efficiency lies in the end-to-end “thought process” - the trajectory. This is only possible through deep observability.

Principle 3: Human as Arbiter

Automation is the tool for scale; humanity is the source of truth. An AI can help grade the test, but a human writes the rubric and decides what an ‘A+’ really means.

Key Takeaways

Non-determinism breaks traditional QA: Agents fail silently with subtle quality degradations, not explicit crashes
Four Pillars define quality: Effectiveness, Efficiency, Robustness, and Safety - you cannot measure any without process visibility
Outside-In evaluation: Start with end-to-end success, then dive into trajectory analysis when failures occur
Hybrid judgment is essential: Combine automated metrics, LLM-as-a-Judge, and Human-in-the-Loop
Three Pillars of Observability: Logs (what), Traces (why), Metrics (how well)
Quality Flywheel drives improvement: Define → Instrument → Evaluate → Feedback
Build for evaluation from day one: Quality is an architectural choice, not an afterthought

Connecting the Series

This whitepaper builds on concepts from our agentic AI coverage:

Previous Post	Connection
Introduction to Agents	Four Pillars extend the core architecture
Agent Tools & MCP	Tool evaluation in trajectory analysis
Context Engineering	Memory/session evaluation patterns
Anatomy of an AI Agent	Quality framework for building blocks

The future of AI is agentic. Its success is determined by quality. Organizations that treat agent quality as an afterthought will be stuck in a cycle of promising demos and failed deployments. Those who invest in rigorous, architecturally-integrated evaluation will deploy truly transformative, enterprise-grade AI systems.

References

Agent Quality (PDF) - Google, November 2025
OpenTelemetry Documentation
Google Cloud Logging
Google Cloud Trace
TruthfulQA Benchmark
LLM-as-a-Judge (arXiv)
Agent-as-a-Judge (arXiv)

#llm #agentic-ai #evaluation #agent-ops #observability

Summary: Google's Agent Quality & Evaluation Framework

Why Agent Quality Demands a New Approach

The Delivery Truck vs Formula 1 Analogy

Agent Failure Modes

The Evolution: From ML to Multi-Agent Systems

The Four Pillars of Agent Quality

The “Outside-In” Evaluation Framework

Stage 1: End-to-End Evaluation (Black Box)

Stage 2: Trajectory Evaluation (Glass Box)

The Evaluators: Methods of Judgment

Automated Metrics

LLM-as-a-Judge

Agent-as-a-Judge

Human-in-the-Loop (HITL)

Observability: The Three Pillars

Line Cook vs Gourmet Chef

The Three Pillars

Pillar 1: Logs - The Agent’s Diary

Pillar 2: Traces - Following the Agent’s Footsteps

Pillar 3: Metrics - The Agent’s Health Report

The Agent Quality Flywheel

Three Core Principles

Principle 1: Evaluation as Architecture

Principle 2: Trajectory is Truth

Principle 3: Human as Arbiter

Key Takeaways

Connecting the Series

References

Comments

Your browser is out-of-date!

Summary: Google's Agent Quality & Evaluation Framework

Why Agent Quality Demands a New Approach

The Delivery Truck vs Formula 1 Analogy

Agent Failure Modes

The Evolution: From ML to Multi-Agent Systems

The Four Pillars of Agent Quality

The “Outside-In” Evaluation Framework

Stage 1: End-to-End Evaluation (Black Box)

Stage 2: Trajectory Evaluation (Glass Box)

The Evaluators: Methods of Judgment

Automated Metrics

LLM-as-a-Judge

Agent-as-a-Judge

Human-in-the-Loop (HITL)

Observability: The Three Pillars

Line Cook vs Gourmet Chef

The Three Pillars

Pillar 1: Logs - The Agent’s Diary

Pillar 2: Traces - Following the Agent’s Footsteps

Pillar 3: Metrics - The Agent’s Health Report

The Agent Quality Flywheel

Three Core Principles

Principle 1: Evaluation as Architecture

Principle 2: Trajectory is Truth

Principle 3: Human as Arbiter

Key Takeaways

Connecting the Series

References

Related Posts

Comments

Your browser is out-of-date!