Picture this. Dinner guests arriving in an hour. Four people, each capable, each assigned a job.
One handles the grill. One sets the table. One makes the salad. One runs the music.
Thirty minutes in: the grill isn't heating because nobody opened the propane valve. The salad person is waiting on ingredients that were never passed over. The appetizers are cold because the reheating was supposed to happen ten minutes ago. The DJ paired to the wrong speaker and is now blasting techno into the baby's room.
No one was incompetent. Everyone knew their job. The whole thing fell apart because there was no system for coordination.
That is a near-perfect description of most multi-agent AI systems running in production today.
Each agent is capable — a coder, a researcher, a planner, a writer. But without shared memory, deliberate orchestration, and proper state management, capable agents produce incoherent results. The failure isn't in the intelligence of the individual agents. It's in the architecture that's supposed to make them a team.
This post breaks down the five-layer architecture that separates production multi-agent systems from expensive demos — and names the specific failure modes you will hit if you skip any of them.
Before the architecture, the failure taxonomy. These three problems appear, in some combination, in nearly every multi-agent system that didn't make it to production:
The Chaos Problem. No orchestration means agents act in parallel without coordination. One agent fetches data while another modifies it. One writes a response while another has already decided the query requires escalation. The outputs contradict each other, or worse — they corrupt shared state.
The Amnesia Problem. Agents can't access context from previous steps in the workflow. Each call starts fresh. An agent that just retrieved customer history has no way to pass that context to the agent writing the response — unless you explicitly build the memory layer. Most teams don't, until it's too late.
The Black Box Problem. Something goes wrong. You have no trace of which agent made which decision, what state the system was in, or what inputs triggered the failure. You can't reproduce it. You can't fix it. You can only watch it happen again.
If any of these sound familiar from your own experiments, keep reading — the architecture below is designed to close all three gaps.
Here's the framework: five layers that must all be functional before a multi-agent system can deliver consistent value in production. Think of them as load-bearing walls. You can skip one in a prototype. You cannot skip one in production.
┌─────────────────────────────────────────────────────┐
│ Layer 1: Orchestration │
│ Orchestrator · Classifier · Agent Registry │
├─────────────────────────────────────────────────────┤
│ Layer 2: Knowledge │
│ Source Bases (RAG) · Vector DBs │
├─────────────────────────────────────────────────────┤
│ Layer 3: Agents │
│ Specialized Agents · MCP Client · Local/Remote │
├─────────────────────────────────────────────────────┤
│ Layer 4: Storage │
│ Conversation History · Agent State · Registry DB │
├─────────────────────────────────────────────────────┤
│ Layer 5: Integration & Observability │
│ MCP Server · External Tools · Trace · Evals │
└─────────────────────────────────────────────────────┘
This is the component that kills the dinner party chaos problem. Without it, you have a group chat where everyone shouts simultaneously. With it, you have a conductor who decides who plays, when, and with what information.
The orchestrator is responsible for:
The Agent Registry is the orchestrator's phonebook. It knows what agents exist, what capabilities each one exposes, and whether each agent is currently available. At small scale (2–3 agents), this is trivial. At production scale with dozens of specialized agents, a governed registry is the only way to keep routing reliable without hard-coding every path.
Microsoft's Agent Framework (MAF) — a fusion of Semantic Kernel and AutoGen — implements this pattern. But the concepts apply regardless of framework. LangGraph's node-based routing, CrewAI's role-based delegation, and custom orchestrators all need to solve the same problem: deterministic routing with dynamic capability discovery.
Agents need two kinds of knowledge access: domain-specific content and semantic search over unstructured data.
Source Bases are where you store the specialized content that transforms general-purpose AI responses into expert answers. Policy documents. Product FAQs. Regulatory guidelines. Internal runbooks. The implementation varies — knowledge graphs, document repositories, fine-tuned models — but the goal is consistent: give agents the specific information they need to be right about your domain, not just right in general.
Vector databases enable semantic search over that content. When a support agent searches "issues with login after password reset," vector search understands the semantic relationship between authentication state and credential management. Keyword matching doesn't.
The critical retrieval decision that most teams get wrong: RAG vs. MCP is not a style preference. It is a functional distinction.
Use RAG when:
- Content is static or semi-static (policy docs, FAQs, guides)
- Search relevance is the primary quality lever
- You need to synthesize across multiple documents
Use MCP when:
- The agent needs real-time system state
- The operation writes or modifies data
- You need live API access (inventory, CRM, ERP)
"How many units of SKU-123 are in stock right now?" is not a search question. It's an API call to your ERP. Routing it through RAG produces a stale answer. Routing it through an MCP tool call produces the live value.
The mistake teams make: reaching for RAG everywhere because it's simpler to set up, then spending three months debugging why the agent keeps giving wrong inventory data. The answer isn't better embeddings. The answer is the wrong retrieval pattern.
Each agent in the system is a specialist. A finance agent. A coding agent. A research agent. A customer-facing support agent. Each is fine-tuned or prompted for its domain, with access to the relevant subset of the knowledge layer.
Agents communicate with external tools via MCP Client — a standardized interface that handles authentication, manages connections, and formats requests consistently regardless of the target tool. This abstraction is what lets you swap out the underlying tool (say, switching from one search provider to another) without rewriting every agent that uses it.
This is the architectural decision most teams don't think about until something goes wrong.
Local agents run in the same execution environment as the orchestrator. They communicate in-memory. They inherit the orchestrator's trust context. Fast, low-latency, straightforward to reason about.
Remote agents operate across a network boundary. They might live in a different security zone, be owned by a different team, or be an external service. This creates five security requirements that don't apply to local agents:
1. Authentication: Verify the remote agent's identity before accepting its outputs
2. Authorization: Enforce what data and tools the remote agent can access
3. Trust boundary: Never assume a remote agent has the same permissions as the orchestrator
4. Data in transit: Encrypt everything crossing the network boundary
5. Audit: Log every cross-boundary call with identity and payload
Think of local agents as colleagues sharing an office — implicit trust, fast coordination. Remote agents are external contractors calling in. They need a badge, credentials, and an access review before you hand them anything sensitive.
Agent-to-Agent (A2A) protocol handles the standardized communication pattern for remote agents. Microsoft Entra Agent Identity provides the identity infrastructure on Azure. But the discipline is organizational, not just technical — you need policy decisions about which agents can call which other agents before you write a single line of orchestration code.
This is the layer that kills the amnesia problem. It is also, consistently, the layer teams underbuild first.
A production multi-agent system requires three distinct types of persistent storage:
Conversation History — Every interaction, decision, and intermediate output across the workflow. This is what lets an agent in step 7 know what the agent in step 2 found. Without it, each agent starts from zero. With it, context accumulates across the full workflow.
Agent State — The operational status and working configuration of each agent instance. If an agent crashes mid-task, agent state is what lets it recover — or lets a replacement agent pick up exactly where it left off. Without this, a transient failure means restarting the entire workflow.
Registry Storage — Persistent metadata about what agents exist, what capabilities they expose, what their current health status is, and what their recent performance looks like. This is what backs the Agent Registry in Layer 1.
The typical failure pattern: teams build agent state in memory. Works fine in development. Works fine in testing. Falls apart the first time an agent crashes in production, because the state was ephemeral and the workflow can't resume.
Build persistent storage from day one. Retrofitting it into a production system that was designed around ephemeral state is significantly harder than building it correctly at the start.
This is the layer that kills the black box problem. It is also, almost universally, treated as an afterthought — and then desperately retrofitted after the first production incident that nobody could debug.
MCP Server — the standardized interface that external tools expose to your agents. Databases, APIs, web search, calculators, code execution environments. The MCP Server pattern means agents interact with external tools through a consistent interface, with authentication and audit controls baked in, rather than through a proliferation of custom integrations that each have their own auth model and failure mode.
Observability — real-time visibility into every agent action in the system:
Evals (Evaluation Layer) — the feedback loop that makes your system better over time. How accurately are agents completing their assigned tasks? Where are they making errors? What types of inputs cause failures? This data feeds back into the orchestration layer and the knowledge layer, enabling continuous improvement.
Without evals: you know your system is broken when users tell you
With evals: you know your system is degrading before users notice
The evaluation layer is how you close the loop between production behavior and system improvement. Without it, you're not iterating on a system — you're waiting for complaints.
What makes this five-layer model compelling isn't novelty. It's that it solves the concrete problems that show up in real production deployments:
Scalability — New agents can be added without rewriting orchestration logic. The registry discovers new capabilities automatically. The routing classifier routes to them without hardcoded rules.
Debuggability — Proper observability and persistent state mean that when something fails, you can trace exactly what happened. Every agent action is logged. Every state transition is recorded. Failure is reproducible.
Reliability — Persistent agent state means individual failures don't cascade. A crashed agent can be restarted and resume where it left off. The supervisor pattern in the orchestration layer catches local failures before they propagate.
Flexibility — Local and remote agent separation means different parts of the system can scale independently based on load and security requirements. The knowledge layer can be updated without touching the agent layer.
The dinner party didn't fail because the guests were bad at cooking. It failed because there was no system — no shared plan, no handoff protocol, no one tracking dependencies.
Multi-agent AI systems fail the same way, for the same reason. Not because the models are weak. Because the architecture isn't there.
The question worth sitting with: which of these five layers is the weakest link in the system you're currently building?
If you've shipped a multi-agent system and hit one of these failure modes — or found a pattern that works better than what's described here — I want to hear about it. Drop it in the comments.