Summary: Google's AgentOps - From Prototype to Production

This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, Context Engineering, and Agent Quality & Evaluation. This fifth installment tackles the critical challenge: how do we move agents from demo to production?

Source: Prototype to Production (PDF) by Google, November 2025

The “Last Mile” Production Gap

You can spin up an AI agent prototype in minutes. But turning that demo into a trusted, production-grade system? That’s where roughly 80% of the effort is spent - not on agent intelligence, but on infrastructure, security, and validation.

The whitepaper opens with a powerful statement:

Building an agent is easy. Trusting it is hard.

What Can Go Wrong

Failure Scenario Root Cause
Customer service agent gives products away free Missing guardrails
User accesses confidential database through agent Improper authentication
Weekend generates massive consumption bill No monitoring configured
Agent that worked yesterday suddenly fails No continuous evaluation

These aren’t just technical problems - they’re business failures. Traditional DevOps and MLOps principles help, but agents introduce unique challenges that require a new operational discipline.

Why Agents Are Different

Unlike traditional software, agents are autonomously interactive, stateful, and follow dynamic execution paths:

Challenge Description
Dynamic Tool Orchestration Agent trajectory assembled on the fly - requires versioning, access control, observability
Scalable State Management Memory across interactions needs secure, consistent management at scale
Unpredictable Cost & Latency Different paths to answers make cost and response time hard to predict
flowchart TB
    subgraph Production["Production-Grade Agent (80% Effort)"]
        direction LR
        P1["Infrastructure"] --> P2["Security"] --> P3["Validation"] --> P4["Monitoring"]
    end

    subgraph Prototype["Prototype Agent (20% Effort)"]
        direction LR
        A1["Prompts"] --> A2["Tools"] --> A3["Logic"]
    end

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff

    class A1,A2,A3 blueClass
    class P1,P2,P3,P4 orangeClass

People and Process

Technology alone isn’t enough. Behind every production-grade agent is a well-orchestrated team of specialists.

Traditional MLOps Teams

Team Responsibilities
Cloud Platform Infrastructure, security, access control, least-privilege roles
Data Engineering Data pipelines, ingestion, preparation, quality standards
Data Science & MLOps Model experimentation, training, CI/CD automation
ML Governance Compliance, transparency, accountability oversight

GenAI-Specific Roles

Role Focus
Prompt Engineers Craft prompts with domain expertise, define expected model behavior
AI Engineers Scale GenAI to production - evaluation, guardrails, RAG/tool integration
DevOps/App Developers Build front-end interfaces integrating with GenAI backend

The scale of your organization influences these roles. Smaller companies may have individuals wearing multiple hats, while mature organizations have specialized teams.

Evaluation as a Quality Gate

Traditional software tests are insufficient for systems that reason and adapt. Evaluating an agent requires assessing the entire trajectory of reasoning and actions - not just the final answer.

Two Implementation Approaches

1. Manual “Pre-PR” Evaluation

For teams starting their evaluation journey:

  • AI Engineer runs evaluation suite locally before PR
  • Performance report linked in PR description
  • Reviewer assesses code AND behavioral changes

2. Automated In-Pipeline Gate

For mature teams:

  • Evaluation harness integrated into CI/CD
  • Failing evaluation automatically blocks deployment
  • Metrics like “tool call success rate” or “helpfulness” must meet thresholds
flowchart LR
    C["Code Change"] --> E["Evaluation Suite"]
    E --> D{Pass?}
    D -->|Yes| P["Deploy to Production"]
    D -->|No| B["Block & Fix"]
    B --> C

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class C,E blueClass
    class P greenClass
    class B redClass

The CI/CD Pipeline

An AI agent is a composite system - code, prompts, tool definitions, and configuration files. The CI/CD pipeline helps teams collaborate, manage complexity, and ensure quality through staged testing.

Three-Phase Pipeline

flowchart LR
    subgraph Phase1["Phase 1: Pre-Merge (CI)"]
        direction TB
        PR["Pull Request"] --> UT["Unit Tests"]
        UT --> LI["Linting"]
        LI --> EV["Evaluation Suite"]
    end

    subgraph Phase2["Phase 2: Post-Merge (CD)"]
        direction TB
        MG["Merge"] --> BU["Build"]
        BU --> ST["Deploy to Staging"]
        ST --> LT["Load Tests"]
    end

    subgraph Phase3["Phase 3: Production"]
        direction TB
        AP["Approval"] --> PD["Deploy to Prod"]
        PD --> MO["Monitor"]
    end

    Phase1 --> Phase2 --> Phase3

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class PR,UT,LI,EV blueClass
    class MG,BU,ST,LT orangeClass
    class AP,PD,MO greenClass
Phase Trigger Purpose Key Activities
Pre-Merge (CI) Pull Request Rapid feedback, gatekeep main branch Unit tests, linting, evaluation suite
Post-Merge (CD) Merge to main Operational readiness validation Deploy to staging, load testing, integration tests
Production Manual approval Safe release Human sign-off, artifact promotion, monitoring

Enabling Technologies

  • Infrastructure as Code (IaC): Tools like Terraform ensure environments are identical, repeatable, and version-controlled
  • Automated Testing: Frameworks like Pytest handle agent-specific artifacts - conversation histories, tool logs, reasoning traces
  • Secrets Management: API keys managed via services like Google Cloud Secret Manager

Safe Rollout Strategies

Rather than switching 100% of users at once, minimize risk through gradual rollouts with careful monitoring.

Strategy Description Use Case
Canary Start with 1% of users, scale up or roll back Early detection of issues
Blue-Green Two identical environments, instant switch Zero downtime, instant recovery
A/B Testing Compare versions on real metrics Data-driven decisions
Feature Flags Deploy code, control release dynamically Test with select users first

All strategies require rigorous versioning of every component - code, prompts, model endpoints, tool schemas, memory structures, evaluation datasets. This enables instant rollback to known-good states.

On Google Cloud:

  • Deploy agents using Agent Engine or Cloud Run
  • Use Cloud Load Balancing for traffic management across versions

Building Security from the Start

Agents reason and act autonomously, creating unique risks that require security embedded from day one.

Agent-Specific Risks

Risk Description
Prompt Injection Users trick agents into unintended actions
Data Leakage Sensitive information exposed through responses or tool usage
Memory Poisoning False information in memory corrupts future interactions

Three Layers of Defense

flowchart TB
    subgraph Layer1["Layer 1: Policy & System Instructions"]
        direction LR
        PO["Define Policies"] --> SI["System Instructions"]
    end

    subgraph Layer2["Layer 2: Guardrails & Filtering"]
        direction LR
        IF["Input Filtering"] --> OF["Output Filtering"]
        OF --> HI["HITL Escalation"]
    end

    subgraph Layer3["Layer 3: Continuous Assurance"]
        direction LR
        RE["Rigorous Evaluation"] --> RT["RAI Testing"]
        RT --> RD["Red Teaming"]
    end

    Layer1 --> Layer2 --> Layer3

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class PO,SI blueClass
    class IF,OF,HI orangeClass
    class RE,RT,RD greenClass
Layer Components Purpose
Policy & System Instructions Defined behaviors, System Instructions Agent’s “constitution”
Guardrails & Filtering Input filters, Output filters, HITL escalation Hard-stop enforcement
Continuous Assurance Vertex AI Evaluation, RAI testing, Red teaming Ongoing validation

Operations in Production

Once live, the focus shifts to keeping the system reliable, cost-effective, and safe. This requires a continuous operational loop.

Observe - Act - Evolve

flowchart LR
    O["Observe
    Logs, Traces, Metrics"] --> A["Act
    Performance, Cost, Security"]
    A --> E["Evolve
    Learn & Improve"]
    E --> O

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class O blueClass
    class A orangeClass
    class E greenClass

Observe: Three Pillars of Observability

Pillar Purpose What It Captures
Logs Factual diary of events Tool calls, errors, decisions
Traces Causal narrative Why agent took certain actions
Metrics Aggregated report card Performance, cost, health at scale

On Google Cloud: Cloud Trace, Cloud Logging, Cloud Monitoring, with ADK providing built-in trace integration.

Act: Operational Levers

Managing System Health:

Goal Strategy
Scale Stateless containers, async processing, externalized state (Agent Engine Sessions or AlloyDB/Cloud SQL)
Speed Parallel execution, aggressive caching, smaller models for routine tasks
Reliability Retry with exponential backoff, idempotent tools
Cost Shorter prompts, cheaper models for easy tasks, request batching

Managing Risk - Security Response Playbook:

  1. Contain: Circuit breaker via feature flag to disable affected tool
  2. Triage: Route suspicious requests to HITL review queue
  3. Resolve: Develop patch, deploy through CI/CD pipeline

Evolve: Learning from Production

Turn observations into durable improvements:

  1. Analyze Production Data: Identify trends in user behavior, success rates, security incidents
  2. Update Evaluation Datasets: Transform failures into test cases
  3. Refine and Deploy: Commit improvements, trigger automated pipeline

This creates a virtuous cycle where the agent continuously improves with every user interaction.

A2A Protocol: Agent-to-Agent Interoperability

As organizations scale to dozens of specialized agents, a new challenge emerges: these agents can’t collaborate. The Agent2Agent (A2A) protocol solves this.

MCP vs A2A

Aspect MCP A2A
Purpose Tool integration Agent collaboration
Interaction Stateless function calls Complex, stateful delegation
Use Case “Do this specific thing” “Achieve this complex goal”
Example Fetch weather data Analyze churn and recommend strategies

Agent Cards

Agent Cards are standardized JSON specifications that act as a business card for each agent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"name": "check_prime_agent",
"version": "1.0.0",
"description": "An agent specialized in checking whether numbers are prime",
"capabilities": {},
"securitySchemes": {
"agent_oauth_2_0": {
"type": "oauth2"
}
},
"defaultInputModes": ["text/plain"],
"defaultOutputModes": ["application/json"],
"skills": [
{
"id": "prime_checking",
"name": "Prime Number Checking",
"description": "Check if numbers are prime using efficient algorithms",
"tags": ["mathematical", "computation", "prime"]
}
],
"url": "http://localhost:8001/a2a/check_prime_agent"
}

Exposing an Agent via A2A

Using Google’s Agent Development Kit (ADK):

1
2
3
4
5
6
7
8
9
10
11
12
13
from google.adk.a2a.utils.agent_to_a2a import to_a2a

# Your existing agent
root_agent = Agent(
name='hello_world_agent',
# ... your agent code ...
)

# Make it A2A-compatible with a single call
a2a_app = to_a2a(root_agent, port=8001)

# Serve with uvicorn
# uvicorn agent:a2a_app --host localhost --port 8001

Consuming a Remote A2A Agent

1
2
3
4
5
6
7
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

prime_agent = RemoteA2aAgent(
name="prime_agent",
description="Agent that handles checking if numbers are prime.",
agent_card="http://localhost:8001/a2a/check_prime_agent/.well-known/agent-card.json"
)

Hierarchical Agent Composition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Local sub-agent for dice rolling
roll_agent = Agent(
name="roll_agent",
instruction="You are an expert at rolling dice."
)

# Remote A2A agent for prime checking
prime_agent = RemoteA2aAgent(
name="prime_agent",
agent_card="http://localhost:8001/.well-known/agent-card.json"
)

# Root orchestrator combining both
root_agent = Agent(
name="root_agent",
instruction="Delegate rolling dice to roll_agent, prime checking to prime_agent.",
sub_agents=[roll_agent, prime_agent]
)

How A2A and MCP Work Together

A2A and MCP are complementary protocols operating at different abstraction levels:

flowchart TB
    U["User"] --> CA["Client/Router Agent"]

    CA -->|A2A| SA["Specialized Agent A"]
    CA -->|A2A| SB["Specialized Agent B"]
    CA -->|A2A| SC["Specialized Agent C"]

    SA --> MCP1["MCP Server X"]
    SB --> MCP2["MCP Server Y"]
    SC --> API["API Hub Z"]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class U,CA blueClass
    class SA,SB,SC orangeClass
    class MCP1,MCP2,API greenClass

Auto Repair Shop Analogy

  1. User-to-Agent (A2A): Customer tells Shop Manager “My car is rattling”
  2. Agent-to-Agent (A2A): Shop Manager delegates to Mechanic agent
  3. Agent-to-Tool (MCP): Mechanic uses scan_vehicle_for_error_codes(), get_repair_procedure()
  4. Agent-to-Agent (A2A): Mechanic contacts Parts Supplier agent for availability

A2A facilitates conversational, task-oriented interactions. MCP provides standardized plumbing for specific tools.

Registry Architectures

When you reach thousands of tools and agents across different teams, you face a discovery problem that demands systematic solutions.

When to Build Registries

Registry Type Build When Benefits
Tool Registry Tool discovery bottleneck, security requires centralized auditing Curated lists, avoid duplicates, audit access
Agent Registry Multiple teams need to discover and reuse agents Reduce redundant work, enable delegation

Tool Registry Patterns

  • Generalist agents: Access full catalog (trade speed for scope)
  • Specialist agents: Predefined subsets (higher performance)
  • Dynamic agents: Query registry at runtime (adapt to new tools)

Registries offer discovery and governance at the cost of maintenance. Consider starting without one and building only when scale demands it.

The AgentOps Lifecycle

The complete reference architecture assembles all pillars into a cohesive system:

flowchart TB
    subgraph Dev["Development Environment"]
        direction LR
        EX["Experimentation"] --> AG["AI Agent"]
        AG --> AP["Application"]
    end

    subgraph Stage["Staging Environment"]
        direction LR
        DE["Deploy"] --> TE["Auto Tests"]
        TE --> SI["Agent Simulation"]
    end

    subgraph Prod["Production Environment"]
        direction LR
        DP["Deploy A/B"] --> OB["Observability"]
        OB --> SE["Security/RAI"]
    end

    subgraph Gov["AI Governance"]
        direction LR
        RE["Repositories"] --> CI["CI/CD"]
        CI --> AR["Agent Registry"]
        AR --> TR["Tool Registry"]
    end

    Dev --> Stage --> Prod
    Gov --> Dev
    Gov --> Stage
    Gov --> Prod

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff

    class EX,AG,AP blueClass
    class DE,TE,SI orangeClass
    class DP,OB,SE greenClass
    class RE,CI,AR,TR purpleClass

The lifecycle flows from developer inner loop (rapid prototyping) through pre-production (evaluation gates) to production (observability and evolution), all governed by centralized AI governance.

Key Takeaways

  1. The “Last Mile” Gap is real: 80% of effort goes to infrastructure, security, and validation - not agent intelligence

  2. People and Process matter: Technology alone isn’t enough - coordinate Cloud Platform, Data Engineering, MLOps, and GenAI-specific roles

  3. Evaluation gates are non-negotiable: No agent reaches production without passing comprehensive quality checks

  4. Three-phase CI/CD: Pre-merge validation, post-merge staging, gated production deployment

  5. Safe rollouts reduce risk: Canary, Blue-Green, A/B, Feature Flags - all require rigorous versioning

  6. Security from day one: Three layers - Policy/System Instructions, Guardrails/Filtering, Continuous Assurance

  7. Observe - Act - Evolve: Continuous operational loop turns every user interaction into improvement

  8. A2A complements MCP: MCP for tools, A2A for agent collaboration - use both in layered architecture

  9. Build registries when needed: Start simple, add centralized discovery when scale demands it

  10. Velocity is the real value: Mature AgentOps enables deploying improvements in hours, not weeks

Connecting the Series

This whitepaper builds on concepts from our agentic AI coverage:

Previous Post Connection
Introduction to Agents Agent architecture foundation for operations
Agent Tools & MCP MCP complements A2A for tool integration
Context Engineering Session/memory requires externalized state management
Agent Quality & Evaluation Evaluation as quality gate in CI/CD pipeline

The future is not just building better individual agents, but orchestrating sophisticated multi-agent systems that learn and collaborate. AgentOps is the foundation that makes this possible.

References

Summary: Google's Agent Quality & Evaluation Framework State and Memory for Trading Agents

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×