This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, and Context Engineering. This fourth installment tackles perhaps the most critical challenge: how do we know if an agent is actually good?
Source: Agent Quality (PDF) by Google, November 2025
Why Agent Quality Demands a New Approach
Traditional software testing is deterministic: given input X, expect output Y. A function either passes or fails. But AI agents break this model entirely. Their non-deterministic nature means the same input can produce different (yet equally valid) outputs. An agent can pass 100 unit tests and still fail catastrophically in production - not because of a bug in the code, but because of a flaw in its judgment.
The whitepaper frames this shift powerfully:
Traditional software verification asks: “Did we build the product right?” Modern AI evaluation must ask: “Did we build the right product?”
The Delivery Truck vs Formula 1 Analogy
Think of traditional software as a delivery truck. Quality assurance is a checklist: Did the engine start? Did it follow the fixed route? Was it on time?
An AI agent is more like a Formula 1 race car. Success depends on dynamic judgment. Evaluation cannot be a simple checklist - it requires continuous telemetry to judge every decision, from fuel consumption to braking strategy.
flowchart TB
subgraph Traditional["Traditional Software (Delivery Truck)"]
direction LR
T1["Fixed Input"] --> T2["Deterministic Logic"] --> T3["Expected Output"]
end
subgraph Agent["AI Agent (F1 Race Car)"]
direction LR
A1["User Intent"] --> A2["Dynamic Reasoning"] --> A3["Variable Paths"]
A3 --> A4["Adaptive Output"]
end
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class T1,T2,T3 blueClass
class A1,A2,A3,A4 orangeClass
Agent Failure Modes
Unlike traditional software that crashes with explicit errors, agents fail silently. The system keeps running, API calls return 200 OK, outputs look plausible - but they’re profoundly wrong.
| Failure Mode | Description | Example |
|---|---|---|
| Algorithmic Bias | Operationalizes biases from training data | Financial agent over-penalizing loan applications based on zip codes |
| Factual Hallucination | Produces plausible but invented information | Research tool generating false historical dates with high confidence |
| Concept Drift | Performance degrades as real-world data changes | Fraud detection missing new attack patterns |
| Emergent Behaviors | Develops unanticipated strategies | Finding loopholes in system rules, engaging in proxy wars with other bots |
The Evolution: From ML to Multi-Agent Systems
Each stage of AI evolution adds a new layer of evaluative complexity:
flowchart LR
ML["Traditional ML
───
Clear metrics:
F1, RMSE"]
LLM["Passive LLMs
───
No simple metrics
Probabilistic output"]
RAG["LLM + RAG
───
Multi-component
Retrieval + Generation"]
Agent["LLM Agents
───
Planning + Tools
+ Memory"]
MAS["Multi-Agent
───
Emergent behaviors
System-level failures"]
ML --> LLM --> RAG --> Agent --> MAS
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff
classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
class ML blueClass
class LLM orangeClass
class RAG greenClass
class Agent purpleClass
class MAS redClass
What makes agents uniquely challenging:
- Planning & Multi-Step Reasoning: Non-determinism compounds at every step. A small word choice in Step 1 can send the agent down a completely different path by Step 4
- Tool Use: Actions depend on external, uncontrollable world state
- Memory: Behavior evolves based on past interactions - same input today may produce different results than yesterday
The Four Pillars of Agent Quality
The whitepaper establishes a framework for holistic agent evaluation:
flowchart LR
E["🎯 Effectiveness
Goal Achievement"]
EF["⚡ Efficiency
Operational Cost"]
R["🛡️ Robustness
Reliability"]
S["🔒 Safety
Trustworthiness"]
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
class E blueClass
class EF orangeClass
class R greenClass
class S redClass
| Pillar | Core Question | Example Metrics |
|---|---|---|
| Effectiveness | Did the agent achieve the user’s actual intent? | PR acceptance rate, conversion rate, session completion |
| Efficiency | Did it solve the problem well? | Token count, latency, trajectory steps, API costs |
| Robustness | How does it handle adversity? | Graceful degradation, retry success, error recovery |
| Safety & Alignment | Does it operate within ethical boundaries? | Bias metrics, prompt injection resistance, data leakage prevention |
The critical insight: you cannot measure any of these pillars if you only see the final answer. You need visibility into the entire decision-making process.
The “Outside-In” Evaluation Framework
Evaluation must be a top-down, strategic process. Start with the only metric that ultimately matters - real-world success - before diving into technical details.
Stage 1: End-to-End Evaluation (Black Box)
The first question: Did the agent achieve the user’s goal effectively?
Before analyzing any internal thought or tool call, evaluate the final performance:
- Task Success Rate: Binary or graded score of goal completion
- User Satisfaction: Thumbs up/down, CSAT scores
- Overall Quality: Accuracy, completeness metrics
If the agent scores 100% at this stage, work may be done. But when failures occur, we need to open the box.
Stage 2: Trajectory Evaluation (Glass Box)
Once a failure is identified, analyze the agent’s entire execution trajectory:
| Component | What to Evaluate | Common Failures |
|---|---|---|
| LLM Planning | Core reasoning quality | Hallucinations, off-topic responses, repetitive loops |
| Tool Selection | Right tool for the task | Wrong tool, hallucinated tool names, unnecessary calls |
| Tool Parameterization | Correct arguments | Missing params, wrong types, malformed JSON |
| Response Interpretation | Understanding tool output | Misinterpreting data, missing error states |
| RAG Performance | Retrieval quality | Irrelevant docs, outdated info, ignored context |
| Efficiency | Resource allocation | Excessive API calls, high latency, redundant work |
| Multi-Agent Dynamics | Inter-agent coordination | Communication loops, role conflicts |
The trace moves us from “the final answer is wrong” to “the final answer is wrong because…”.
The Evaluators: Methods of Judgment
Knowing what to evaluate is half the battle. The other half is how to judge it.
Automated Metrics
Fast and reproducible, useful for regression testing:
| Metric Type | Examples | Use Case |
|---|---|---|
| String-based | ROUGE, BLEU | Comparing to reference text |
| Embedding-based | BERTScore, cosine similarity | Semantic closeness |
| Task-specific | TruthfulQA | Domain benchmarks |
Limitation: Metrics are efficient but shallow - they capture surface similarity, not deeper reasoning or user value.
LLM-as-a-Judge
Use a powerful model (like Gemini) to evaluate another agent’s outputs. Provide:
- The agent’s output
- Original prompt
- Golden answer (if available)
- Detailed evaluation rubric
1 | You are an expert evaluator for a customer support chatbot. |
Best Practice: Use pairwise comparison over single-scoring. A high “win rate” is more reliable than noisy absolute scores.
Agent-as-a-Judge
While LLMs score final responses, agents require evaluation of their full reasoning and actions. The emerging Agent-as-a-Judge paradigm uses one agent to evaluate another’s execution trace:
- Plan quality: Was the plan logical and feasible?
- Tool use: Were right tools chosen and applied correctly?
- Context handling: Did agent use prior information effectively?
Human-in-the-Loop (HITL)
Automation provides scale, but struggles with deep subjectivity and domain knowledge. HITL is essential for:
- Domain Expertise: Medical, legal, financial accuracy
- Interpreting Nuance: Tone, creativity, complex ethical alignment
- Creating the Golden Set: Establishing benchmarks before automation
flowchart TB
subgraph Methods["Evaluation Methods"]
direction LR
AM["Automated
Metrics"]
LJ["LLM-as-a
-Judge"]
AJ["Agent-as-a
-Judge"]
HI["Human-in
-the-Loop"]
end
AM --> |"Scale"| LJ --> |"Process"| AJ --> |"Ground Truth"| HI
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff
class AM blueClass
class LJ orangeClass
class AJ greenClass
class HI purpleClass
| Evaluator | Scalability | Nuance | Cost | Best For |
|---|---|---|---|---|
| Automated Metrics | High | Low | Low | Regression testing, CI/CD gates |
| LLM-as-a-Judge | High | Medium | Medium | Rapid iteration, A/B testing |
| Agent-as-a-Judge | Medium | High | Medium | Process evaluation, trajectory analysis |
| Human-in-the-Loop | Low | Very High | High | Ground truth, domain expertise, edge cases |
Observability: The Three Pillars
You cannot judge a process you cannot see. The whitepaper uses a compelling analogy:
Line Cook vs Gourmet Chef
Traditional Software is a Line Cook: Rigid recipe card - toast bun 30 seconds, grill patty 90 seconds, add cheese. Monitoring is a checklist.
AI Agent is a Gourmet Chef in a Mystery Box Challenge: Given a goal and ingredients, no single correct recipe. We need to understand the reasoning - why pair raspberries with basil? How did they adapt when out of sugar?
This is the shift from monitoring (is it running?) to observability (is it thinking effectively?).
The Three Pillars
flowchart TB
subgraph Observability["Agent Observability"]
direction LR
L["📝 Logs
The Agent's Diary
───
What happened"]
T["🔗 Traces
Following Footsteps
───
Why it happened"]
M["📊 Metrics
Health Report
───
How well"]
end
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class L blueClass
class T orangeClass
class M greenClass
Pillar 1: Logs - The Agent’s Diary
Timestamped entries recording discrete events. A structured JSON format captures:
- Prompt/response pairs
- Intermediate reasoning steps (chain of thought)
- Tool calls with inputs/outputs/errors
- State changes
1 | { |
Pillar 2: Traces - Following the Agent’s Footsteps
If logs are diary entries, traces are the narrative thread connecting them. They follow a single task from initial query to final answer.
Consider a failure where a user gets a nonsensical answer:
- Isolated Logs:
ERROR: RAG search failedandERROR: LLM response failed validation - A Trace reveals the causal chain: User Query → RAG Search (failed) → Faulty Tool Call (null input) → LLM Error (confused) → Incorrect Answer
Modern tracing uses OpenTelemetry with:
- Spans: Named operations (llm_call, tool_execution)
- Attributes: Metadata (prompt_id, latency_ms, token_count)
- Context Propagation: Links spans via trace_id
Pillar 3: Metrics - The Agent’s Health Report
Aggregated scores derived from logs and traces. Split into two categories:
System Metrics (Vital Signs):
| Metric | Derivation | Purpose |
|---|---|---|
| Latency P50/P99 | Aggregate duration_ms from traces | User experience |
| Error Rate | % of traces with error=true | System health |
| Tokens per Task | Average token_count | Cost management |
| Task Completion Rate | % reaching success span | Effectiveness |
Quality Metrics (Decision Quality):
| Metric | Evaluation Method | Purpose |
|---|---|---|
| Correctness | Golden set comparison, LLM-as-Judge | Accuracy |
| Trajectory Adherence | Compare to ideal path | Process quality |
| Safety Score | RAI classifiers | Responsibility |
| Helpfulness | User feedback, LLM-as-Judge | User value |
The Agent Quality Flywheel
The whitepaper synthesizes everything into a continuous improvement cycle:
flowchart TB
D["1️⃣ Define Quality
───
Four Pillars:
Effectiveness, Efficiency,
Robustness, Safety"]
I["2️⃣ Instrument Visibility
───
Logs, Traces
Observability Foundation"]
E["3️⃣ Evaluate Process
───
Outside-In Assessment
LLM-Judge + HITL"]
F["4️⃣ Feedback Loop
───
Failures → Regression Tests
Continuous Improvement"]
D --> I --> E --> F --> D
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff
class D blueClass
class I orangeClass
class E greenClass
class F purpleClass
Each step builds on the previous:
- Define Quality: Establish concrete targets with the Four Pillars
- Instrument Visibility: Build observability into the architecture from day one
- Evaluate Process: Apply hybrid judgment (automated + human)
- Feedback Loop: Convert failures into regression tests
Every production failure, when captured and annotated, becomes a permanent test case in the “Golden” evaluation set. Every failure makes the system smarter.
Three Core Principles
The whitepaper distills its guidance into three foundational principles:
Principle 1: Evaluation as Architecture
Agent quality is an architectural pillar, not a final testing phase.
You don’t build a Formula 1 car and bolt on sensors afterward. Design agents to be “evaluatable-by-design” - instrumented from the first line of code.
Principle 2: Trajectory is Truth
The final answer is merely the last sentence of a long story. The true measure of logic, safety, and efficiency lies in the end-to-end “thought process” - the trajectory. This is only possible through deep observability.
Principle 3: Human as Arbiter
Automation is the tool for scale; humanity is the source of truth. An AI can help grade the test, but a human writes the rubric and decides what an ‘A+’ really means.
Key Takeaways
- Non-determinism breaks traditional QA: Agents fail silently with subtle quality degradations, not explicit crashes
- Four Pillars define quality: Effectiveness, Efficiency, Robustness, and Safety - you cannot measure any without process visibility
- Outside-In evaluation: Start with end-to-end success, then dive into trajectory analysis when failures occur
- Hybrid judgment is essential: Combine automated metrics, LLM-as-a-Judge, and Human-in-the-Loop
- Three Pillars of Observability: Logs (what), Traces (why), Metrics (how well)
- Quality Flywheel drives improvement: Define → Instrument → Evaluate → Feedback
- Build for evaluation from day one: Quality is an architectural choice, not an afterthought
Connecting the Series
This whitepaper builds on concepts from our agentic AI coverage:
| Previous Post | Connection |
|---|---|
| Introduction to Agents | Four Pillars extend the core architecture |
| Agent Tools & MCP | Tool evaluation in trajectory analysis |
| Context Engineering | Memory/session evaluation patterns |
| Anatomy of an AI Agent | Quality framework for building blocks |
The future of AI is agentic. Its success is determined by quality. Organizations that treat agent quality as an afterthought will be stuck in a cycle of promising demos and failed deployments. Those who invest in rigorous, architecturally-integrated evaluation will deploy truly transformative, enterprise-grade AI systems.
Comments