Multi-Agent Systems in Production: When One Agent Isn't Enough and How We Coordinate Them

A developer team accidentally built a multi-agent system when a single-agent monolith grew unmanageable. They now use patterns like supervisor-worker, pipeline, and event-driven coordination to manage complexity, with structured data contracts between agents to avoid token waste and quality degradation.

We built our first "multi-agent system" by accident. What started as a single agent that could research a topic, draft a report, check it against source data, and send a summary email had grown into a 2,000-token system prompt and a function list so long that the model kept forgetting tools existed. It wasn't a system — it was a monolith pretending to be intelligent. Breaking it apart into coordinated agents fixed most of the problems. It also introduced a new category of problems we hadn't thought about. Here's what we actually learned. The temptation to add more agents is real, but the overhead isn't free. Every agent boundary you add is a place where context can get lost, latency increases, and errors compound. One agent is the right call when: You need multiple agents when: The key question we ask: Is this one job or a pipeline of jobs? If you'd describe it to a human as "first do X, then Y takes that and does Z", you probably have a pipeline, not a single task. A thin orchestrator agent decides what needs doing, dispatches to specialised worker agents, and stitches the results together. The workers are narrow — they do one thing and don't need to know about the rest of the workflow. This is our most common pattern. The supervisor's system prompt stays small because it's routing, not reasoning. The workers' prompts can be highly optimised for their specific job. Each agent's output is the next agent's input. No orchestrator — just a chain. We use this for document processing: extract → chunk → summarise → classify. Each step is independent enough that we can swap out or retrain one without touching the others. Agents subscribe to events rather than being called directly. An intake agent processes a new customer request and emits an event; a triage agent picks it up, classifies it, and emits another; a response agent drafts the reply. We use this with Celery and Redis when the steps can happen asynchronously and we don't need the full chain to complete before responding to the user. Here's a simplified version of how we implement the supervisor pattern. The orchestrator Celery task manages the workflow; individual agent tasks do the actual LLM calls. python tasks/orchestrator.py from celery import chain, chord from .agents import extract data task, analyse data task, draft report task @app.task bind=True, max retries=3 def run report pipeline self, document id: int, user id: int : """ Supervisor: extract → analyse → draft, with error isolation at each step. """ try: Build the pipeline as a Celery chain pipeline = chain extract data task.s document id , analyse data task.s user id=user id , draft report task.s user id=user id , result = pipeline.apply async return {"pipeline id": result.id, "status": "started"} except Exception as exc: Retry with exponential backoff before giving up raise self.retry exc=exc, countdown=2 self.request.retries Each agent task is responsible for its own LLM call and its own error handling. The orchestrator doesn't need to know what model each agent uses, or whether agent two calls a tool — it just cares about the shape of the data passing between steps. The naïve approach is to pass the full output of each agent directly into the next. This breaks down fast: LLM outputs are verbose, and feeding 3,000 tokens of analysis into a drafter that only needs 5 key facts wastes tokens and degrades quality. We use a structured intermediate format — a plain Python dataclass or Pydantic model — as the contract between agents. Each agent's output is validated against this schema before it's passed downstream. python from pydantic import BaseModel from typing import Optional class ExtractionResult BaseModel : document id: int key facts: list str Max 10 bullet points raw data summary: str Under 500 chars confidence score: float 0–1 extraction warnings: list str Anything the agent flagged class AnalysisResult BaseModel : document id: int findings: list str risk flags: list str recommended actions: list str analysis notes: Optional str = None In the extraction agent task: @app.task def extract data task document id: int - dict: raw output = call llm system="You are a data extraction specialist...", user=get document text document id , response format=ExtractionResult, Structured output enforced result = ExtractionResult.model validate raw output return result.model dump Celery serialises as dict Enforcing the schema at the boundary means your analysis agent never has to guess what the extraction agent gave it. When something breaks, the error is at the boundary where it belongs, not buried three steps later. The hardest part of multi-agent systems is failure handling. In a monolithic agent, one failure terminates one task. In a pipeline, a failure in step two means you've wasted step one and need to decide whether to retry from the start or from step two. Our approach: PipelineRun model with status fields for each step. This lets us resume partial pipelines and gives us visibility into where things are breaking. models.py class PipelineRun models.Model : document = models.ForeignKey Document, on delete=models.CASCADE status = models.CharField max length=20, default='pending' Checkpointed results per step extraction result = models.JSONField null=True analysis result = models.JSONField null=True draft result = models.JSONField null=True Step-level status extraction status = models.CharField max length=20, default='pending' analysis status = models.CharField max length=20, default='pending' draft status = models.CharField max length=20, default='pending' error detail = models.TextField blank=True created at = models.DateTimeField auto now add=True updated at = models.DateTimeField auto now=True This makes debugging a failed pipeline actually feasible. You open the admin, find the PipelineRun , see which step failed, and read the error. Without this, you're parsing Celery logs hoping something tells you what happened. Multi-agent architectures solve real problems — context overflow, specialisation, parallelism, and failure isolation. But they introduce coordination overhead that a single agent doesn't have. You're trading simplicity for scalability and resilience. The things this doesn't solve: it won't fix a poorly designed system prompt on an individual agent, it won't save you if your task decomposition is wrong, and it adds latency. Every agent boundary is a round-trip to an LLM. Start with one agent. Add a second when you have a clear reason — not because it sounds more impressive. The moment you're debugging why agent three hallucinated because agent two gave it a vague extraction result, you'll appreciate the value of simple. We run multi-agent pipelines in production for document processing, automated research workflows, and customer triage. They work well, but every one of them started life as a single agent that we only split apart when we had a concrete reason. Lycore builds production AI systems https://www.lycore.com/ai-development-services/ for businesses — we design and implement multi-agent pipelines, RAG systems, and LLM integrations that hold up in production. Get in touch https://www.lycore.com/contact-us/ if you want to talk through your use case.