Summary: Google's AgentOps - From Prototype to Production

Jan 14 2026 AI agentic-ai

This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, Context Engineering, and Agent Quality & Evaluation. This fifth installment tackles the critical challenge: how do we move agents from demo to production?

Source: Prototype to Production (PDF) by Google, November 2025

The “Last Mile” Production Gap

You can spin up an AI agent prototype in minutes. But turning that demo into a trusted, production-grade system? That’s where roughly 80% of the effort is spent - not on agent intelligence, but on infrastructure, security, and validation.

The whitepaper opens with a powerful statement:

Building an agent is easy. Trusting it is hard.

What Can Go Wrong

Failure Scenario	Root Cause
Customer service agent gives products away free	Missing guardrails
User accesses confidential database through agent	Improper authentication
Weekend generates massive consumption bill	No monitoring configured
Agent that worked yesterday suddenly fails	No continuous evaluation

These aren’t just technical problems - they’re business failures. Traditional DevOps and MLOps principles help, but agents introduce unique challenges that require a new operational discipline.

Why Agents Are Different

Unlike traditional software, agents are autonomously interactive, stateful, and follow dynamic execution paths:

Challenge	Description
Dynamic Tool Orchestration	Agent trajectory assembled on the fly - requires versioning, access control, observability
Scalable State Management	Memory across interactions needs secure, consistent management at scale
Unpredictable Cost & Latency	Different paths to answers make cost and response time hard to predict

flowchart TB
    subgraph Production["Production-Grade Agent (80% Effort)"]
        direction LR
        P1["Infrastructure"] --> P2["Security"] --> P3["Validation"] --> P4["Monitoring"]
    end

    subgraph Prototype["Prototype Agent (20% Effort)"]
        direction LR
        A1["Prompts"] --> A2["Tools"] --> A3["Logic"]
    end

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff

    class A1,A2,A3 blueClass
    class P1,P2,P3,P4 orangeClass

People and Process

Technology alone isn’t enough. Behind every production-grade agent is a well-orchestrated team of specialists.

Traditional MLOps Teams

Team	Responsibilities
Cloud Platform	Infrastructure, security, access control, least-privilege roles
Data Engineering	Data pipelines, ingestion, preparation, quality standards
Data Science & MLOps	Model experimentation, training, CI/CD automation
ML Governance	Compliance, transparency, accountability oversight

GenAI-Specific Roles

Role	Focus
Prompt Engineers	Craft prompts with domain expertise, define expected model behavior
AI Engineers	Scale GenAI to production - evaluation, guardrails, RAG/tool integration
DevOps/App Developers	Build front-end interfaces integrating with GenAI backend

The scale of your organization influences these roles. Smaller companies may have individuals wearing multiple hats, while mature organizations have specialized teams.

Evaluation as a Quality Gate

Traditional software tests are insufficient for systems that reason and adapt. Evaluating an agent requires assessing the entire trajectory of reasoning and actions - not just the final answer.

Two Implementation Approaches

1. Manual “Pre-PR” Evaluation

For teams starting their evaluation journey:

AI Engineer runs evaluation suite locally before PR
Performance report linked in PR description
Reviewer assesses code AND behavioral changes

2. Automated In-Pipeline Gate

For mature teams:

Evaluation harness integrated into CI/CD
Failing evaluation automatically blocks deployment
Metrics like “tool call success rate” or “helpfulness” must meet thresholds

flowchart LR
    C["Code Change"] --> E["Evaluation Suite"]
    E --> D{Pass?}
    D -->|Yes| P["Deploy to Production"]
    D -->|No| B["Block & Fix"]
    B --> C

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class C,E blueClass
    class P greenClass
    class B redClass

The CI/CD Pipeline

An AI agent is a composite system - code, prompts, tool definitions, and configuration files. The CI/CD pipeline helps teams collaborate, manage complexity, and ensure quality through staged testing.

Three-Phase Pipeline

flowchart LR
    subgraph Phase1["Phase 1: Pre-Merge (CI)"]
        direction TB
        PR["Pull Request"] --> UT["Unit Tests"]
        UT --> LI["Linting"]
        LI --> EV["Evaluation Suite"]
    end

    subgraph Phase2["Phase 2: Post-Merge (CD)"]
        direction TB
        MG["Merge"] --> BU["Build"]
        BU --> ST["Deploy to Staging"]
        ST --> LT["Load Tests"]
    end

    subgraph Phase3["Phase 3: Production"]
        direction TB
        AP["Approval"] --> PD["Deploy to Prod"]
        PD --> MO["Monitor"]
    end

    Phase1 --> Phase2 --> Phase3

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class PR,UT,LI,EV blueClass
    class MG,BU,ST,LT orangeClass
    class AP,PD,MO greenClass

Phase	Trigger	Purpose	Key Activities
Pre-Merge (CI)	Pull Request	Rapid feedback, gatekeep main branch	Unit tests, linting, evaluation suite
Post-Merge (CD)	Merge to main	Operational readiness validation	Deploy to staging, load testing, integration tests
Production	Manual approval	Safe release	Human sign-off, artifact promotion, monitoring

Enabling Technologies

Infrastructure as Code (IaC): Tools like Terraform ensure environments are identical, repeatable, and version-controlled
Automated Testing: Frameworks like Pytest handle agent-specific artifacts - conversation histories, tool logs, reasoning traces
Secrets Management: API keys managed via services like Google Cloud Secret Manager

Safe Rollout Strategies

Rather than switching 100% of users at once, minimize risk through gradual rollouts with careful monitoring.

Strategy	Description	Use Case
Canary	Start with 1% of users, scale up or roll back	Early detection of issues
Blue-Green	Two identical environments, instant switch	Zero downtime, instant recovery
A/B Testing	Compare versions on real metrics	Data-driven decisions
Feature Flags	Deploy code, control release dynamically	Test with select users first

All strategies require rigorous versioning of every component - code, prompts, model endpoints, tool schemas, memory structures, evaluation datasets. This enables instant rollback to known-good states.

On Google Cloud:

Deploy agents using Agent Engine or Cloud Run
Use Cloud Load Balancing for traffic management across versions

Building Security from the Start

Agents reason and act autonomously, creating unique risks that require security embedded from day one.

Agent-Specific Risks

Risk	Description
Prompt Injection	Users trick agents into unintended actions
Data Leakage	Sensitive information exposed through responses or tool usage
Memory Poisoning	False information in memory corrupts future interactions

Three Layers of Defense

flowchart TB
    subgraph Layer1["Layer 1: Policy & System Instructions"]
        direction LR
        PO["Define Policies"] --> SI["System Instructions"]
    end

    subgraph Layer2["Layer 2: Guardrails & Filtering"]
        direction LR
        IF["Input Filtering"] --> OF["Output Filtering"]
        OF --> HI["HITL Escalation"]
    end

    subgraph Layer3["Layer 3: Continuous Assurance"]
        direction LR
        RE["Rigorous Evaluation"] --> RT["RAI Testing"]
        RT --> RD["Red Teaming"]
    end

    Layer1 --> Layer2 --> Layer3

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class PO,SI blueClass
    class IF,OF,HI orangeClass
    class RE,RT,RD greenClass

Layer	Components	Purpose
Policy & System Instructions	Defined behaviors, System Instructions	Agent’s “constitution”
Guardrails & Filtering	Input filters, Output filters, HITL escalation	Hard-stop enforcement
Continuous Assurance	Vertex AI Evaluation, RAI testing, Red teaming	Ongoing validation

Operations in Production

Once live, the focus shifts to keeping the system reliable, cost-effective, and safe. This requires a continuous operational loop.

Observe - Act - Evolve

flowchart LR
    O["Observe
    Logs, Traces, Metrics"] --> A["Act
    Performance, Cost, Security"]
    A --> E["Evolve
    Learn & Improve"]
    E --> O

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class O blueClass
    class A orangeClass
    class E greenClass

Observe: Three Pillars of Observability

Pillar	Purpose	What It Captures
Logs	Factual diary of events	Tool calls, errors, decisions
Traces	Causal narrative	Why agent took certain actions
Metrics	Aggregated report card	Performance, cost, health at scale

On Google Cloud: Cloud Trace, Cloud Logging, Cloud Monitoring, with ADK providing built-in trace integration.

Act: Operational Levers

Managing System Health:

Goal	Strategy
Scale	Stateless containers, async processing, externalized state (Agent Engine Sessions or AlloyDB/Cloud SQL)
Speed	Parallel execution, aggressive caching, smaller models for routine tasks
Reliability	Retry with exponential backoff, idempotent tools
Cost	Shorter prompts, cheaper models for easy tasks, request batching

Managing Risk - Security Response Playbook:

Contain: Circuit breaker via feature flag to disable affected tool
Triage: Route suspicious requests to HITL review queue
Resolve: Develop patch, deploy through CI/CD pipeline

Evolve: Learning from Production

Turn observations into durable improvements:

Analyze Production Data: Identify trends in user behavior, success rates, security incidents
Update Evaluation Datasets: Transform failures into test cases
Refine and Deploy: Commit improvements, trigger automated pipeline

This creates a virtuous cycle where the agent continuously improves with every user interaction.

A2A Protocol: Agent-to-Agent Interoperability

As organizations scale to dozens of specialized agents, a new challenge emerges: these agents can’t collaborate. The Agent2Agent (A2A) protocol solves this.

MCP vs A2A

Aspect	MCP	A2A
Purpose	Tool integration	Agent collaboration
Interaction	Stateless function calls	Complex, stateful delegation
Use Case	“Do this specific thing”	“Achieve this complex goal”
Example	Fetch weather data	Analyze churn and recommend strategies

Agent Cards

Agent Cards are standardized JSON specifications that act as a business card for each agent:

{
  "name": "check_prime_agent",
  "version": "1.0.0",
  "description": "An agent specialized in checking whether numbers are prime",
  "capabilities": {},
  "securitySchemes": {
    "agent_oauth_2_0": {
      "type": "oauth2"
    }
  },
  "defaultInputModes": ["text/plain"],
  "defaultOutputModes": ["application/json"],
  "skills": [
    {
      "id": "prime_checking",
      "name": "Prime Number Checking",
      "description": "Check if numbers are prime using efficient algorithms",
      "tags": ["mathematical", "computation", "prime"]
    }
  ],
  "url": "http://localhost:8001/a2a/check_prime_agent"
}

Exposing an Agent via A2A

Using Google’s Agent Development Kit (ADK):

from google.adk.a2a.utils.agent_to_a2a import to_a2a

# Your existing agent
root_agent = Agent(
    name='hello_world_agent',
    # ... your agent code ...
)

# Make it A2A-compatible with a single call
a2a_app = to_a2a(root_agent, port=8001)

# Serve with uvicorn
# uvicorn agent:a2a_app --host localhost --port 8001

Consuming a Remote A2A Agent

from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

prime_agent = RemoteA2aAgent(
    name="prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card="http://localhost:8001/a2a/check_prime_agent/.well-known/agent-card.json"
)

Hierarchical Agent Composition

# Local sub-agent for dice rolling
roll_agent = Agent(
    name="roll_agent",
    instruction="You are an expert at rolling dice."
)

# Remote A2A agent for prime checking
prime_agent = RemoteA2aAgent(
    name="prime_agent",
    agent_card="http://localhost:8001/.well-known/agent-card.json"
)

# Root orchestrator combining both
root_agent = Agent(
    name="root_agent",
    instruction="Delegate rolling dice to roll_agent, prime checking to prime_agent.",
    sub_agents=[roll_agent, prime_agent]
)

How A2A and MCP Work Together

A2A and MCP are complementary protocols operating at different abstraction levels:

flowchart TB
    U["User"] --> CA["Client/Router Agent"]

    CA -->|A2A| SA["Specialized Agent A"]
    CA -->|A2A| SB["Specialized Agent B"]
    CA -->|A2A| SC["Specialized Agent C"]

    SA --> MCP1["MCP Server X"]
    SB --> MCP2["MCP Server Y"]
    SC --> API["API Hub Z"]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class U,CA blueClass
    class SA,SB,SC orangeClass
    class MCP1,MCP2,API greenClass

Auto Repair Shop Analogy

User-to-Agent (A2A): Customer tells Shop Manager “My car is rattling”
Agent-to-Agent (A2A): Shop Manager delegates to Mechanic agent
Agent-to-Tool (MCP): Mechanic uses scan_vehicle_for_error_codes(), get_repair_procedure()
Agent-to-Agent (A2A): Mechanic contacts Parts Supplier agent for availability

A2A facilitates conversational, task-oriented interactions. MCP provides standardized plumbing for specific tools.

Registry Architectures

When you reach thousands of tools and agents across different teams, you face a discovery problem that demands systematic solutions.

When to Build Registries

Registry Type	Build When	Benefits
Tool Registry	Tool discovery bottleneck, security requires centralized auditing	Curated lists, avoid duplicates, audit access
Agent Registry	Multiple teams need to discover and reuse agents	Reduce redundant work, enable delegation

Tool Registry Patterns

Generalist agents: Access full catalog (trade speed for scope)
Specialist agents: Predefined subsets (higher performance)
Dynamic agents: Query registry at runtime (adapt to new tools)

Registries offer discovery and governance at the cost of maintenance. Consider starting without one and building only when scale demands it.

The AgentOps Lifecycle

The complete reference architecture assembles all pillars into a cohesive system:

flowchart TB
    subgraph Dev["Development Environment"]
        direction LR
        EX["Experimentation"] --> AG["AI Agent"]
        AG --> AP["Application"]
    end

    subgraph Stage["Staging Environment"]
        direction LR
        DE["Deploy"] --> TE["Auto Tests"]
        TE --> SI["Agent Simulation"]
    end

    subgraph Prod["Production Environment"]
        direction LR
        DP["Deploy A/B"] --> OB["Observability"]
        OB --> SE["Security/RAI"]
    end

    subgraph Gov["AI Governance"]
        direction LR
        RE["Repositories"] --> CI["CI/CD"]
        CI --> AR["Agent Registry"]
        AR --> TR["Tool Registry"]
    end

    Dev --> Stage --> Prod
    Gov --> Dev
    Gov --> Stage
    Gov --> Prod

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff

    class EX,AG,AP blueClass
    class DE,TE,SI orangeClass
    class DP,OB,SE greenClass
    class RE,CI,AR,TR purpleClass

The lifecycle flows from developer inner loop (rapid prototyping) through pre-production (evaluation gates) to production (observability and evolution), all governed by centralized AI governance.

Key Takeaways

The “Last Mile” Gap is real: 80% of effort goes to infrastructure, security, and validation - not agent intelligence
People and Process matter: Technology alone isn’t enough - coordinate Cloud Platform, Data Engineering, MLOps, and GenAI-specific roles
Evaluation gates are non-negotiable: No agent reaches production without passing comprehensive quality checks
Three-phase CI/CD: Pre-merge validation, post-merge staging, gated production deployment
Safe rollouts reduce risk: Canary, Blue-Green, A/B, Feature Flags - all require rigorous versioning
Security from day one: Three layers - Policy/System Instructions, Guardrails/Filtering, Continuous Assurance
Observe - Act - Evolve: Continuous operational loop turns every user interaction into improvement
A2A complements MCP: MCP for tools, A2A for agent collaboration - use both in layered architecture
Build registries when needed: Start simple, add centralized discovery when scale demands it
Velocity is the real value: Mature AgentOps enables deploying improvements in hours, not weeks

Connecting the Series

This whitepaper builds on concepts from our agentic AI coverage:

Previous Post	Connection
Introduction to Agents	Agent architecture foundation for operations
Agent Tools & MCP	MCP complements A2A for tool integration
Context Engineering	Session/memory requires externalized state management
Agent Quality & Evaluation	Evaluation as quality gate in CI/CD pipeline

The future is not just building better individual agents, but orchestrating sophisticated multi-agent systems that learn and collaborate. AgentOps is the foundation that makes this possible.

References

Prototype to Production (PDF) - Google, November 2025
Agent Starter Pack - Google Cloud Platform
Vertex AI Evaluation
Agent Development Kit (ADK)
A2A Protocol Specification
Model Context Protocol (MCP)
Google’s Secure AI Agents Approach
Google Secure AI Framework (SAIF)
Cloud Run
Vertex AI Agent Engine
AgentOps: Operationalize AI Agents (Video)

#llm #agentic-ai #python #agent-ops #deployment #ci-cd

Summary: Google's AgentOps - From Prototype to Production

The “Last Mile” Production Gap

What Can Go Wrong

Why Agents Are Different

People and Process

Traditional MLOps Teams

GenAI-Specific Roles

Evaluation as a Quality Gate

Two Implementation Approaches

The CI/CD Pipeline

Three-Phase Pipeline

Enabling Technologies

Safe Rollout Strategies

Building Security from the Start

Agent-Specific Risks

Three Layers of Defense

Operations in Production

Observe - Act - Evolve

Observe: Three Pillars of Observability

Act: Operational Levers

Evolve: Learning from Production

A2A Protocol: Agent-to-Agent Interoperability

MCP vs A2A

Agent Cards

Exposing an Agent via A2A

Consuming a Remote A2A Agent

Hierarchical Agent Composition

How A2A and MCP Work Together

Auto Repair Shop Analogy

Registry Architectures

When to Build Registries

Tool Registry Patterns

The AgentOps Lifecycle

Key Takeaways

Connecting the Series

References

Comments

Your browser is out-of-date!

Summary: Google's AgentOps - From Prototype to Production

The “Last Mile” Production Gap

What Can Go Wrong

Why Agents Are Different

People and Process

Traditional MLOps Teams

GenAI-Specific Roles

Evaluation as a Quality Gate

Two Implementation Approaches

The CI/CD Pipeline

Three-Phase Pipeline

Enabling Technologies

Safe Rollout Strategies

Building Security from the Start

Agent-Specific Risks

Three Layers of Defense

Operations in Production

Observe - Act - Evolve

Observe: Three Pillars of Observability

Act: Operational Levers

Evolve: Learning from Production

A2A Protocol: Agent-to-Agent Interoperability

MCP vs A2A

Agent Cards

Exposing an Agent via A2A

Consuming a Remote A2A Agent

Hierarchical Agent Composition

How A2A and MCP Work Together

Auto Repair Shop Analogy

Registry Architectures

When to Build Registries

Tool Registry Patterns

The AgentOps Lifecycle

Key Takeaways

Connecting the Series

References

Related Posts

Comments

Your browser is out-of-date!