Summary: Google's Agent Quality & Evaluation Framework

This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, and Context Engineering. This fourth installment tackles perhaps the most critical challenge: how do we know if an agent is actually good?

Source: Agent Quality (PDF) by Google, November 2025

Why Agent Quality Demands a New Approach

Traditional software testing is deterministic: given input X, expect output Y. A function either passes or fails. But AI agents break this model entirely. Their non-deterministic nature means the same input can produce different (yet equally valid) outputs. An agent can pass 100 unit tests and still fail catastrophically in production - not because of a bug in the code, but because of a flaw in its judgment.

The whitepaper frames this shift powerfully:

Traditional software verification asks: “Did we build the product right?” Modern AI evaluation must ask: “Did we build the right product?”

The Delivery Truck vs Formula 1 Analogy

Think of traditional software as a delivery truck. Quality assurance is a checklist: Did the engine start? Did it follow the fixed route? Was it on time?

An AI agent is more like a Formula 1 race car. Success depends on dynamic judgment. Evaluation cannot be a simple checklist - it requires continuous telemetry to judge every decision, from fuel consumption to braking strategy.

flowchart TB
    subgraph Traditional["Traditional Software (Delivery Truck)"]
        direction LR
        T1["Fixed Input"] --> T2["Deterministic Logic"] --> T3["Expected Output"]
    end

    subgraph Agent["AI Agent (F1 Race Car)"]
        direction LR
        A1["User Intent"] --> A2["Dynamic Reasoning"] --> A3["Variable Paths"]
        A3 --> A4["Adaptive Output"]
    end

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class T1,T2,T3 blueClass
    class A1,A2,A3,A4 orangeClass

Agent Failure Modes

Unlike traditional software that crashes with explicit errors, agents fail silently. The system keeps running, API calls return 200 OK, outputs look plausible - but they’re profoundly wrong.

Failure Mode Description Example
Algorithmic Bias Operationalizes biases from training data Financial agent over-penalizing loan applications based on zip codes
Factual Hallucination Produces plausible but invented information Research tool generating false historical dates with high confidence
Concept Drift Performance degrades as real-world data changes Fraud detection missing new attack patterns
Emergent Behaviors Develops unanticipated strategies Finding loopholes in system rules, engaging in proxy wars with other bots

The Evolution: From ML to Multi-Agent Systems

Each stage of AI evolution adds a new layer of evaluative complexity:

flowchart LR
    ML["Traditional ML
───
Clear metrics:
F1, RMSE"] LLM["Passive LLMs
───
No simple metrics
Probabilistic output"] RAG["LLM + RAG
───
Multi-component
Retrieval + Generation"] Agent["LLM Agents
───
Planning + Tools
+ Memory"] MAS["Multi-Agent
───
Emergent behaviors
System-level failures"] ML --> LLM --> RAG --> Agent --> MAS classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff class ML blueClass class LLM orangeClass class RAG greenClass class Agent purpleClass class MAS redClass

What makes agents uniquely challenging:

  1. Planning & Multi-Step Reasoning: Non-determinism compounds at every step. A small word choice in Step 1 can send the agent down a completely different path by Step 4
  2. Tool Use: Actions depend on external, uncontrollable world state
  3. Memory: Behavior evolves based on past interactions - same input today may produce different results than yesterday

The Four Pillars of Agent Quality

The whitepaper establishes a framework for holistic agent evaluation:

flowchart LR
    E["🎯 Effectiveness
Goal Achievement"] EF["⚡ Efficiency
Operational Cost"] R["🛡️ Robustness
Reliability"] S["🔒 Safety
Trustworthiness"] classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff class E blueClass class EF orangeClass class R greenClass class S redClass
Pillar Core Question Example Metrics
Effectiveness Did the agent achieve the user’s actual intent? PR acceptance rate, conversion rate, session completion
Efficiency Did it solve the problem well? Token count, latency, trajectory steps, API costs
Robustness How does it handle adversity? Graceful degradation, retry success, error recovery
Safety & Alignment Does it operate within ethical boundaries? Bias metrics, prompt injection resistance, data leakage prevention

The critical insight: you cannot measure any of these pillars if you only see the final answer. You need visibility into the entire decision-making process.

The “Outside-In” Evaluation Framework

Evaluation must be a top-down, strategic process. Start with the only metric that ultimately matters - real-world success - before diving into technical details.

Stage 1: End-to-End Evaluation (Black Box)

The first question: Did the agent achieve the user’s goal effectively?

Before analyzing any internal thought or tool call, evaluate the final performance:

  • Task Success Rate: Binary or graded score of goal completion
  • User Satisfaction: Thumbs up/down, CSAT scores
  • Overall Quality: Accuracy, completeness metrics

If the agent scores 100% at this stage, work may be done. But when failures occur, we need to open the box.

Stage 2: Trajectory Evaluation (Glass Box)

Once a failure is identified, analyze the agent’s entire execution trajectory:

Component What to Evaluate Common Failures
LLM Planning Core reasoning quality Hallucinations, off-topic responses, repetitive loops
Tool Selection Right tool for the task Wrong tool, hallucinated tool names, unnecessary calls
Tool Parameterization Correct arguments Missing params, wrong types, malformed JSON
Response Interpretation Understanding tool output Misinterpreting data, missing error states
RAG Performance Retrieval quality Irrelevant docs, outdated info, ignored context
Efficiency Resource allocation Excessive API calls, high latency, redundant work
Multi-Agent Dynamics Inter-agent coordination Communication loops, role conflicts

The trace moves us from “the final answer is wrong” to “the final answer is wrong because…”.

The Evaluators: Methods of Judgment

Knowing what to evaluate is half the battle. The other half is how to judge it.

Automated Metrics

Fast and reproducible, useful for regression testing:

Metric Type Examples Use Case
String-based ROUGE, BLEU Comparing to reference text
Embedding-based BERTScore, cosine similarity Semantic closeness
Task-specific TruthfulQA Domain benchmarks

Limitation: Metrics are efficient but shallow - they capture surface similarity, not deeper reasoning or user value.

LLM-as-a-Judge

Use a powerful model (like Gemini) to evaluate another agent’s outputs. Provide:

  • The agent’s output
  • Original prompt
  • Golden answer (if available)
  • Detailed evaluation rubric
1
2
3
4
5
6
7
8
9
10
11
12
13
14
You are an expert evaluator for a customer support chatbot.

[User Query]
"Hi, my order #12345 hasn't arrived yet."

[Answer A]
"I can see that order #12345 is currently out for delivery
and should arrive by 5 PM today."

[Answer B]
"Order #12345 is on the truck. It will be there by 5."

Compare them on correctness, helpfulness, and tone. Output
your decision as JSON with "winner" (A/B/tie) and "rationale".

Best Practice: Use pairwise comparison over single-scoring. A high “win rate” is more reliable than noisy absolute scores.

Agent-as-a-Judge

While LLMs score final responses, agents require evaluation of their full reasoning and actions. The emerging Agent-as-a-Judge paradigm uses one agent to evaluate another’s execution trace:

  • Plan quality: Was the plan logical and feasible?
  • Tool use: Were right tools chosen and applied correctly?
  • Context handling: Did agent use prior information effectively?

Human-in-the-Loop (HITL)

Automation provides scale, but struggles with deep subjectivity and domain knowledge. HITL is essential for:

  • Domain Expertise: Medical, legal, financial accuracy
  • Interpreting Nuance: Tone, creativity, complex ethical alignment
  • Creating the Golden Set: Establishing benchmarks before automation
flowchart TB
    subgraph Methods["Evaluation Methods"]
        direction LR
        AM["Automated
Metrics"] LJ["LLM-as-a
-Judge"] AJ["Agent-as-a
-Judge"] HI["Human-in
-the-Loop"] end AM --> |"Scale"| LJ --> |"Process"| AJ --> |"Ground Truth"| HI classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff class AM blueClass class LJ orangeClass class AJ greenClass class HI purpleClass
Evaluator Scalability Nuance Cost Best For
Automated Metrics High Low Low Regression testing, CI/CD gates
LLM-as-a-Judge High Medium Medium Rapid iteration, A/B testing
Agent-as-a-Judge Medium High Medium Process evaluation, trajectory analysis
Human-in-the-Loop Low Very High High Ground truth, domain expertise, edge cases

Observability: The Three Pillars

You cannot judge a process you cannot see. The whitepaper uses a compelling analogy:

Line Cook vs Gourmet Chef

Traditional Software is a Line Cook: Rigid recipe card - toast bun 30 seconds, grill patty 90 seconds, add cheese. Monitoring is a checklist.

AI Agent is a Gourmet Chef in a Mystery Box Challenge: Given a goal and ingredients, no single correct recipe. We need to understand the reasoning - why pair raspberries with basil? How did they adapt when out of sugar?

This is the shift from monitoring (is it running?) to observability (is it thinking effectively?).

The Three Pillars

flowchart TB
    subgraph Observability["Agent Observability"]
        direction LR
        L["📝 Logs
The Agent's Diary
───
What happened"] T["🔗 Traces
Following Footsteps
───
Why it happened"] M["📊 Metrics
Health Report
───
How well"] end classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff class L blueClass class T orangeClass class M greenClass

Pillar 1: Logs - The Agent’s Diary

Timestamped entries recording discrete events. A structured JSON format captures:

  • Prompt/response pairs
  • Intermediate reasoning steps (chain of thought)
  • Tool calls with inputs/outputs/errors
  • State changes
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"timestamp": "2025-07-10T15:26:13.778Z",
"level": "DEBUG",
"component": "google_adk.models.google_llm",
"event": "LLM Request",
"system_instruction": "You roll dice and answer questions...",
"contents": [
{"role": "user", "text": "Roll a 6 sided dice"},
{"role": "model", "function_call": {"name": "roll_die", "args": {"sides": 6}}},
{"role": "user", "function_response": {"result": 2}}
],
"functions_available": ["roll_die", "check_prime"]
}

Pillar 2: Traces - Following the Agent’s Footsteps

If logs are diary entries, traces are the narrative thread connecting them. They follow a single task from initial query to final answer.

Consider a failure where a user gets a nonsensical answer:

  • Isolated Logs: ERROR: RAG search failed and ERROR: LLM response failed validation
  • A Trace reveals the causal chain: User Query → RAG Search (failed) → Faulty Tool Call (null input) → LLM Error (confused) → Incorrect Answer

Modern tracing uses OpenTelemetry with:

  • Spans: Named operations (llm_call, tool_execution)
  • Attributes: Metadata (prompt_id, latency_ms, token_count)
  • Context Propagation: Links spans via trace_id

Pillar 3: Metrics - The Agent’s Health Report

Aggregated scores derived from logs and traces. Split into two categories:

System Metrics (Vital Signs):

Metric Derivation Purpose
Latency P50/P99 Aggregate duration_ms from traces User experience
Error Rate % of traces with error=true System health
Tokens per Task Average token_count Cost management
Task Completion Rate % reaching success span Effectiveness

Quality Metrics (Decision Quality):

Metric Evaluation Method Purpose
Correctness Golden set comparison, LLM-as-Judge Accuracy
Trajectory Adherence Compare to ideal path Process quality
Safety Score RAI classifiers Responsibility
Helpfulness User feedback, LLM-as-Judge User value

The Agent Quality Flywheel

The whitepaper synthesizes everything into a continuous improvement cycle:

flowchart TB
    D["1️⃣ Define Quality
───
Four Pillars:
Effectiveness, Efficiency,
Robustness, Safety"] I["2️⃣ Instrument Visibility
───
Logs, Traces
Observability Foundation"] E["3️⃣ Evaluate Process
───
Outside-In Assessment
LLM-Judge + HITL"] F["4️⃣ Feedback Loop
───
Failures → Regression Tests
Continuous Improvement"] D --> I --> E --> F --> D classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff class D blueClass class I orangeClass class E greenClass class F purpleClass

Each step builds on the previous:

  1. Define Quality: Establish concrete targets with the Four Pillars
  2. Instrument Visibility: Build observability into the architecture from day one
  3. Evaluate Process: Apply hybrid judgment (automated + human)
  4. Feedback Loop: Convert failures into regression tests

Every production failure, when captured and annotated, becomes a permanent test case in the “Golden” evaluation set. Every failure makes the system smarter.

Three Core Principles

The whitepaper distills its guidance into three foundational principles:

Principle 1: Evaluation as Architecture

Agent quality is an architectural pillar, not a final testing phase.

You don’t build a Formula 1 car and bolt on sensors afterward. Design agents to be “evaluatable-by-design” - instrumented from the first line of code.

Principle 2: Trajectory is Truth

The final answer is merely the last sentence of a long story. The true measure of logic, safety, and efficiency lies in the end-to-end “thought process” - the trajectory. This is only possible through deep observability.

Principle 3: Human as Arbiter

Automation is the tool for scale; humanity is the source of truth. An AI can help grade the test, but a human writes the rubric and decides what an ‘A+’ really means.

Key Takeaways

  1. Non-determinism breaks traditional QA: Agents fail silently with subtle quality degradations, not explicit crashes
  2. Four Pillars define quality: Effectiveness, Efficiency, Robustness, and Safety - you cannot measure any without process visibility
  3. Outside-In evaluation: Start with end-to-end success, then dive into trajectory analysis when failures occur
  4. Hybrid judgment is essential: Combine automated metrics, LLM-as-a-Judge, and Human-in-the-Loop
  5. Three Pillars of Observability: Logs (what), Traces (why), Metrics (how well)
  6. Quality Flywheel drives improvement: Define → Instrument → Evaluate → Feedback
  7. Build for evaluation from day one: Quality is an architectural choice, not an afterthought

Connecting the Series

This whitepaper builds on concepts from our agentic AI coverage:

Previous Post Connection
Introduction to Agents Four Pillars extend the core architecture
Agent Tools & MCP Tool evaluation in trajectory analysis
Context Engineering Memory/session evaluation patterns
Anatomy of an AI Agent Quality framework for building blocks

The future of AI is agentic. Its success is determined by quality. Organizations that treat agent quality as an afterthought will be stuck in a cycle of promising demos and failed deployments. Those who invest in rigorous, architecturally-integrated evaluation will deploy truly transformative, enterprise-grade AI systems.

References

Financial Tools and Structured Outputs State and Memory for Trading Agents

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×