This post continues our coverage of Google’s agent whitepaper series, following Introduction to Agents, Agent Tools & MCP, Context Engineering, and Agent Quality & Evaluation. This fifth installment tackles the critical challenge: how do we move agents from demo to production?
Source: Prototype to Production (PDF) by Google, November 2025
The “Last Mile” Production Gap
You can spin up an AI agent prototype in minutes. But turning that demo into a trusted, production-grade system? That’s where roughly 80% of the effort is spent - not on agent intelligence, but on infrastructure, security, and validation.
The whitepaper opens with a powerful statement:
Building an agent is easy. Trusting it is hard.
What Can Go Wrong
| Failure Scenario | Root Cause |
|---|---|
| Customer service agent gives products away free | Missing guardrails |
| User accesses confidential database through agent | Improper authentication |
| Weekend generates massive consumption bill | No monitoring configured |
| Agent that worked yesterday suddenly fails | No continuous evaluation |
These aren’t just technical problems - they’re business failures. Traditional DevOps and MLOps principles help, but agents introduce unique challenges that require a new operational discipline.
Why Agents Are Different
Unlike traditional software, agents are autonomously interactive, stateful, and follow dynamic execution paths:
| Challenge | Description |
|---|---|
| Dynamic Tool Orchestration | Agent trajectory assembled on the fly - requires versioning, access control, observability |
| Scalable State Management | Memory across interactions needs secure, consistent management at scale |
| Unpredictable Cost & Latency | Different paths to answers make cost and response time hard to predict |
flowchart TB
subgraph Production["Production-Grade Agent (80% Effort)"]
direction LR
P1["Infrastructure"] --> P2["Security"] --> P3["Validation"] --> P4["Monitoring"]
end
subgraph Prototype["Prototype Agent (20% Effort)"]
direction LR
A1["Prompts"] --> A2["Tools"] --> A3["Logic"]
end
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
class A1,A2,A3 blueClass
class P1,P2,P3,P4 orangeClass
People and Process
Technology alone isn’t enough. Behind every production-grade agent is a well-orchestrated team of specialists.
Traditional MLOps Teams
| Team | Responsibilities |
|---|---|
| Cloud Platform | Infrastructure, security, access control, least-privilege roles |
| Data Engineering | Data pipelines, ingestion, preparation, quality standards |
| Data Science & MLOps | Model experimentation, training, CI/CD automation |
| ML Governance | Compliance, transparency, accountability oversight |
GenAI-Specific Roles
| Role | Focus |
|---|---|
| Prompt Engineers | Craft prompts with domain expertise, define expected model behavior |
| AI Engineers | Scale GenAI to production - evaluation, guardrails, RAG/tool integration |
| DevOps/App Developers | Build front-end interfaces integrating with GenAI backend |
The scale of your organization influences these roles. Smaller companies may have individuals wearing multiple hats, while mature organizations have specialized teams.
Evaluation as a Quality Gate
Traditional software tests are insufficient for systems that reason and adapt. Evaluating an agent requires assessing the entire trajectory of reasoning and actions - not just the final answer.
Two Implementation Approaches
1. Manual “Pre-PR” Evaluation
For teams starting their evaluation journey:
- AI Engineer runs evaluation suite locally before PR
- Performance report linked in PR description
- Reviewer assesses code AND behavioral changes
2. Automated In-Pipeline Gate
For mature teams:
- Evaluation harness integrated into CI/CD
- Failing evaluation automatically blocks deployment
- Metrics like “tool call success rate” or “helpfulness” must meet thresholds
flowchart LR
C["Code Change"] --> E["Evaluation Suite"]
E --> D{Pass?}
D -->|Yes| P["Deploy to Production"]
D -->|No| B["Block & Fix"]
B --> C
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef redClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
class C,E blueClass
class P greenClass
class B redClass
The CI/CD Pipeline
An AI agent is a composite system - code, prompts, tool definitions, and configuration files. The CI/CD pipeline helps teams collaborate, manage complexity, and ensure quality through staged testing.
Three-Phase Pipeline
flowchart LR
subgraph Phase1["Phase 1: Pre-Merge (CI)"]
direction TB
PR["Pull Request"] --> UT["Unit Tests"]
UT --> LI["Linting"]
LI --> EV["Evaluation Suite"]
end
subgraph Phase2["Phase 2: Post-Merge (CD)"]
direction TB
MG["Merge"] --> BU["Build"]
BU --> ST["Deploy to Staging"]
ST --> LT["Load Tests"]
end
subgraph Phase3["Phase 3: Production"]
direction TB
AP["Approval"] --> PD["Deploy to Prod"]
PD --> MO["Monitor"]
end
Phase1 --> Phase2 --> Phase3
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class PR,UT,LI,EV blueClass
class MG,BU,ST,LT orangeClass
class AP,PD,MO greenClass
| Phase | Trigger | Purpose | Key Activities |
|---|---|---|---|
| Pre-Merge (CI) | Pull Request | Rapid feedback, gatekeep main branch | Unit tests, linting, evaluation suite |
| Post-Merge (CD) | Merge to main | Operational readiness validation | Deploy to staging, load testing, integration tests |
| Production | Manual approval | Safe release | Human sign-off, artifact promotion, monitoring |
Enabling Technologies
- Infrastructure as Code (IaC): Tools like Terraform ensure environments are identical, repeatable, and version-controlled
- Automated Testing: Frameworks like Pytest handle agent-specific artifacts - conversation histories, tool logs, reasoning traces
- Secrets Management: API keys managed via services like Google Cloud Secret Manager
Safe Rollout Strategies
Rather than switching 100% of users at once, minimize risk through gradual rollouts with careful monitoring.
| Strategy | Description | Use Case |
|---|---|---|
| Canary | Start with 1% of users, scale up or roll back | Early detection of issues |
| Blue-Green | Two identical environments, instant switch | Zero downtime, instant recovery |
| A/B Testing | Compare versions on real metrics | Data-driven decisions |
| Feature Flags | Deploy code, control release dynamically | Test with select users first |
All strategies require rigorous versioning of every component - code, prompts, model endpoints, tool schemas, memory structures, evaluation datasets. This enables instant rollback to known-good states.
On Google Cloud:
- Deploy agents using Agent Engine or Cloud Run
- Use Cloud Load Balancing for traffic management across versions
Building Security from the Start
Agents reason and act autonomously, creating unique risks that require security embedded from day one.
Agent-Specific Risks
| Risk | Description |
|---|---|
| Prompt Injection | Users trick agents into unintended actions |
| Data Leakage | Sensitive information exposed through responses or tool usage |
| Memory Poisoning | False information in memory corrupts future interactions |
Three Layers of Defense
flowchart TB
subgraph Layer1["Layer 1: Policy & System Instructions"]
direction LR
PO["Define Policies"] --> SI["System Instructions"]
end
subgraph Layer2["Layer 2: Guardrails & Filtering"]
direction LR
IF["Input Filtering"] --> OF["Output Filtering"]
OF --> HI["HITL Escalation"]
end
subgraph Layer3["Layer 3: Continuous Assurance"]
direction LR
RE["Rigorous Evaluation"] --> RT["RAI Testing"]
RT --> RD["Red Teaming"]
end
Layer1 --> Layer2 --> Layer3
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class PO,SI blueClass
class IF,OF,HI orangeClass
class RE,RT,RD greenClass
| Layer | Components | Purpose |
|---|---|---|
| Policy & System Instructions | Defined behaviors, System Instructions | Agent’s “constitution” |
| Guardrails & Filtering | Input filters, Output filters, HITL escalation | Hard-stop enforcement |
| Continuous Assurance | Vertex AI Evaluation, RAI testing, Red teaming | Ongoing validation |
Operations in Production
Once live, the focus shifts to keeping the system reliable, cost-effective, and safe. This requires a continuous operational loop.
Observe - Act - Evolve
flowchart LR
O["Observe
Logs, Traces, Metrics"] --> A["Act
Performance, Cost, Security"]
A --> E["Evolve
Learn & Improve"]
E --> O
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class O blueClass
class A orangeClass
class E greenClass
Observe: Three Pillars of Observability
| Pillar | Purpose | What It Captures |
|---|---|---|
| Logs | Factual diary of events | Tool calls, errors, decisions |
| Traces | Causal narrative | Why agent took certain actions |
| Metrics | Aggregated report card | Performance, cost, health at scale |
On Google Cloud: Cloud Trace, Cloud Logging, Cloud Monitoring, with ADK providing built-in trace integration.
Act: Operational Levers
Managing System Health:
| Goal | Strategy |
|---|---|
| Scale | Stateless containers, async processing, externalized state (Agent Engine Sessions or AlloyDB/Cloud SQL) |
| Speed | Parallel execution, aggressive caching, smaller models for routine tasks |
| Reliability | Retry with exponential backoff, idempotent tools |
| Cost | Shorter prompts, cheaper models for easy tasks, request batching |
Managing Risk - Security Response Playbook:
- Contain: Circuit breaker via feature flag to disable affected tool
- Triage: Route suspicious requests to HITL review queue
- Resolve: Develop patch, deploy through CI/CD pipeline
Evolve: Learning from Production
Turn observations into durable improvements:
- Analyze Production Data: Identify trends in user behavior, success rates, security incidents
- Update Evaluation Datasets: Transform failures into test cases
- Refine and Deploy: Commit improvements, trigger automated pipeline
This creates a virtuous cycle where the agent continuously improves with every user interaction.
A2A Protocol: Agent-to-Agent Interoperability
As organizations scale to dozens of specialized agents, a new challenge emerges: these agents can’t collaborate. The Agent2Agent (A2A) protocol solves this.
MCP vs A2A
| Aspect | MCP | A2A |
|---|---|---|
| Purpose | Tool integration | Agent collaboration |
| Interaction | Stateless function calls | Complex, stateful delegation |
| Use Case | “Do this specific thing” | “Achieve this complex goal” |
| Example | Fetch weather data | Analyze churn and recommend strategies |
Agent Cards
Agent Cards are standardized JSON specifications that act as a business card for each agent:
1 | { |
Exposing an Agent via A2A
Using Google’s Agent Development Kit (ADK):
1 | from google.adk.a2a.utils.agent_to_a2a import to_a2a |
Consuming a Remote A2A Agent
1 | from google.adk.agents.remote_a2a_agent import RemoteA2aAgent |
Hierarchical Agent Composition
1 | # Local sub-agent for dice rolling |
How A2A and MCP Work Together
A2A and MCP are complementary protocols operating at different abstraction levels:
flowchart TB
U["User"] --> CA["Client/Router Agent"]
CA -->|A2A| SA["Specialized Agent A"]
CA -->|A2A| SB["Specialized Agent B"]
CA -->|A2A| SC["Specialized Agent C"]
SA --> MCP1["MCP Server X"]
SB --> MCP2["MCP Server Y"]
SC --> API["API Hub Z"]
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class U,CA blueClass
class SA,SB,SC orangeClass
class MCP1,MCP2,API greenClass
Auto Repair Shop Analogy
- User-to-Agent (A2A): Customer tells Shop Manager “My car is rattling”
- Agent-to-Agent (A2A): Shop Manager delegates to Mechanic agent
- Agent-to-Tool (MCP): Mechanic uses
scan_vehicle_for_error_codes(),get_repair_procedure() - Agent-to-Agent (A2A): Mechanic contacts Parts Supplier agent for availability
A2A facilitates conversational, task-oriented interactions. MCP provides standardized plumbing for specific tools.
Registry Architectures
When you reach thousands of tools and agents across different teams, you face a discovery problem that demands systematic solutions.
When to Build Registries
| Registry Type | Build When | Benefits |
|---|---|---|
| Tool Registry | Tool discovery bottleneck, security requires centralized auditing | Curated lists, avoid duplicates, audit access |
| Agent Registry | Multiple teams need to discover and reuse agents | Reduce redundant work, enable delegation |
Tool Registry Patterns
- Generalist agents: Access full catalog (trade speed for scope)
- Specialist agents: Predefined subsets (higher performance)
- Dynamic agents: Query registry at runtime (adapt to new tools)
Registries offer discovery and governance at the cost of maintenance. Consider starting without one and building only when scale demands it.
The AgentOps Lifecycle
The complete reference architecture assembles all pillars into a cohesive system:
flowchart TB
subgraph Dev["Development Environment"]
direction LR
EX["Experimentation"] --> AG["AI Agent"]
AG --> AP["Application"]
end
subgraph Stage["Staging Environment"]
direction LR
DE["Deploy"] --> TE["Auto Tests"]
TE --> SI["Agent Simulation"]
end
subgraph Prod["Production Environment"]
direction LR
DP["Deploy A/B"] --> OB["Observability"]
OB --> SE["Security/RAI"]
end
subgraph Gov["AI Governance"]
direction LR
RE["Repositories"] --> CI["CI/CD"]
CI --> AR["Agent Registry"]
AR --> TR["Tool Registry"]
end
Dev --> Stage --> Prod
Gov --> Dev
Gov --> Stage
Gov --> Prod
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef purpleClass fill:#9B59B6,stroke:#333,stroke-width:2px,color:#fff
class EX,AG,AP blueClass
class DE,TE,SI orangeClass
class DP,OB,SE greenClass
class RE,CI,AR,TR purpleClass
The lifecycle flows from developer inner loop (rapid prototyping) through pre-production (evaluation gates) to production (observability and evolution), all governed by centralized AI governance.
Key Takeaways
The “Last Mile” Gap is real: 80% of effort goes to infrastructure, security, and validation - not agent intelligence
People and Process matter: Technology alone isn’t enough - coordinate Cloud Platform, Data Engineering, MLOps, and GenAI-specific roles
Evaluation gates are non-negotiable: No agent reaches production without passing comprehensive quality checks
Three-phase CI/CD: Pre-merge validation, post-merge staging, gated production deployment
Safe rollouts reduce risk: Canary, Blue-Green, A/B, Feature Flags - all require rigorous versioning
Security from day one: Three layers - Policy/System Instructions, Guardrails/Filtering, Continuous Assurance
Observe - Act - Evolve: Continuous operational loop turns every user interaction into improvement
A2A complements MCP: MCP for tools, A2A for agent collaboration - use both in layered architecture
Build registries when needed: Start simple, add centralized discovery when scale demands it
Velocity is the real value: Mature AgentOps enables deploying improvements in hours, not weeks
Connecting the Series
This whitepaper builds on concepts from our agentic AI coverage:
| Previous Post | Connection |
|---|---|
| Introduction to Agents | Agent architecture foundation for operations |
| Agent Tools & MCP | MCP complements A2A for tool integration |
| Context Engineering | Session/memory requires externalized state management |
| Agent Quality & Evaluation | Evaluation as quality gate in CI/CD pipeline |
The future is not just building better individual agents, but orchestrating sophisticated multi-agent systems that learn and collaborate. AgentOps is the foundation that makes this possible.
References
- Prototype to Production (PDF) - Google, November 2025
- Agent Starter Pack - Google Cloud Platform
- Vertex AI Evaluation
- Agent Development Kit (ADK)
- A2A Protocol Specification
- Model Context Protocol (MCP)
- Google’s Secure AI Agents Approach
- Google Secure AI Framework (SAIF)
- Cloud Run
- Vertex AI Agent Engine
- AgentOps: Operationalize AI Agents (Video)
Comments