{"slug": "your-agent-demo-works-your-agent-doesn-t", "title": "Your Agent Demo Works. Your Agent Doesn't.", "summary": "Addy Osmani and Shubham Saboo from Google Cloud published five patterns for building reliable, stateful AI agents, addressing the common failure of agents that work in demos but break in production. The patterns include checkpoint-and-resume, delegated approval with human-in-the-loop, memory-layered context, ambient processing, and fleet orchestration. They also cover interoperability protocols A2A and MCP to enable agent-to-agent and agent-to-tool communication.", "body_md": "Most agent architectures are quietly broken. They look great in a demo — single turn, clean task, instant response — and then fall apart the moment you ask them to do anything real. Like process a week of insurance claims. Or run a multi-day sales sequence. Or reconcile data across systems that don't share a clock.\n\nThe reason is simple, and nobody talks about it: **most agents are stateless under the hood.** They reconstruct context from scratch on every interaction. The reasoning chain that made the last decision make sense? Gone. The soft signals, the confidence gradients, the partial progress? All gone. You get a polite LLM that pretends to know what's going on.\n\nAddy Osmani and Shubham Saboo from Google Cloud just published five patterns for fixing this. I read the whole thing. Here's what actually matters.\n\n**1. Checkpoint-and-Resume.** Treat your agent like a long-running server, not a request handler. Checkpoint progress every N units of work — not every unit (wasteful), not just at the end (risky). If your agent dies on document 201 of 1,000, you resume at 201, not at zero. The code sample in the article checkpoints every 50 docs, which is a reasonable default. Adjust based on how expensive each unit is.\n\n**2. Delegated Approval (HITL that doesn't suck).** Most human-in-the-loop setups are awful. You serialize state to JSON, fire a webhook, and pray someone checks the inbox. When they respond hours later, the agent has to deserialize and re-establish context from scratch — which defeats the entire point. The fix: pause the agent in place. Keep the full execution state intact. Zero compute while waiting. Sub-second cold start when it resumes. Also: build a unified approval queue. Not Slack. Not email. A structured inbox with \"Needs input,\" \"Errors,\" \"Completed.\"\n\n**3. Memory-Layered Context.** Long-term memory plus working memory, kept distinct. Long-term is the knowledge base that accumulates across sessions. Working memory is low-latency, high-accuracy, right-now. This is the pattern most teams underestimate — because of **memory drift**. An agent that \"learns\" from a few atypical interactions can start applying bad shortcuts broadly. And when multiple agents share memory pools, you get data leakage between workflows. The kind that's hard to detect and impossible to explain to compliance. So you need: cryptographic agent identity (IAM for agents), a centralized registry, and a governance layer that blocks bad writes before they happen — not after.\n\n**4. Ambient Processing.** Some agents don't wait to be asked. They watch Pub/Sub streams, BigQuery rows, support tickets. They react continuously. The key architectural call here: **don't hardcode policies into the agent.** Externalize them. When compliance rules change, you update once at the governance layer and every ambient agent in the fleet picks up the new rules. No redeploys. No drift. No agent running an outdated version of your rules while another runs the new one.\n\n**5. Fleet Orchestration.** A coordinator agent delegates to specialist agents. Each specialist has its own identity, its own tool permissions, its own registry entry. This is the coordinator/worker pattern from distributed systems, but defined declaratively through graph-based workflows — so the structure is enforced by the framework, not by an LLM that might decide to shortcut the whole thing. The win: you can update specialists independently. A bad scoring agent doesn't take down the rest of the fleet.\n\nThe article also covers A2A and MCP — the interoperability protocols. A2A is how agents talk to other agents (think OpenAPI spec for agent-to-agent). MCP is how agents talk to tools and data. Together they mean your Python coordinator can delegate to a Go specialist without anyone negotiating a custom integration. Each org keeps its own governance boundaries. The protocol is the interface, the backend is swappable.\n\nHere's the part I want to highlight: **memory drift is the scariest problem nobody's talking about enough.**\n\nEveryone's obsessed with prompt engineering and tool calling. Nobody's asking: what is my agent *remembering*, and how is that changing its behavior over time? When an agent accumulates experience across days and weeks, it starts behaving less like the code you wrote and more like the sum of its interactions. If a few of those interactions were weird, edge-case, or adversarial — your agent has quietly learned something you didn't intend.\n\nThis is why the governance piece isn't optional. It's load-bearing. You need agent identity, a registry, and policy enforcement at the boundary. Treat agents like microservices — because that's what they are, eventually.\n\nThe other thing worth saying: the diagnostic question from the article is the right one. **What's the longest uninterrupted unit of work your agent needs to perform?** If it's minutes, you don't need long-running agents. If it's hours or days, you need all of this, and the patterns compose. A compliance system might use checkpointing for processing, delegated approval for review gates, layered memory for cross-session knowledge, and fleet orchestration to coordinate the specialists.\n\nThe companies building isolated, stateless agents today will be refactoring in twelve months. The ones building with persistence, governance, and interoperability in mind will be compounding their advantage every day.\n\nThat's the bet. And it's the right one.", "url": "https://wpnews.pro/news/your-agent-demo-works-your-agent-doesn-t", "canonical_source": "https://dev.to/archit_aggarwal_5310522d5/your-agent-demo-works-your-agent-doesnt-88l", "published_at": "2026-06-20 07:19:05+00:00", "updated_at": "2026-06-20 07:36:52.379691+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-infrastructure", "developer-tools", "ai-research"], "entities": ["Google Cloud", "Addy Osmani", "Shubham Saboo", "A2A", "MCP"], "alternates": {"html": "https://wpnews.pro/news/your-agent-demo-works-your-agent-doesn-t", "markdown": "https://wpnews.pro/news/your-agent-demo-works-your-agent-doesn-t.md", "text": "https://wpnews.pro/news/your-agent-demo-works-your-agent-doesn-t.txt", "jsonld": "https://wpnews.pro/news/your-agent-demo-works-your-agent-doesn-t.jsonld"}}