The Event-Driven Operator: Agents That Wake on Events, Not Cron A new event-driven operator pattern for AI agents replaces polling loops with message-based wake-up, deterministic processing, and durable state checkpointing, eliminating three structural failure modes: CPU waste on empty queues, missed events under load, and total state loss on process death. The pattern requires a message queue, idempotent handler, and dead letter queue to achieve execution durability, where every agent wake cycle is recoverable through checkpointed state. This architectural shift addresses the fundamental failure of polling-based agents, which lose in-flight state and miss event sequences under production conditions. The Event-Driven Operator: Agents That Wake on Events, Not Cron Polling-based agents fail by construction — not occasionally, but structurally. The event-driven operator pattern eliminates that failure class by separating wake-up, execution, and recovery. A polling loop is not an agent architecture. It is a busy-wait with ambitions. The agent sits in a while True loop, hammering a database or API for work that arrives once per minute — burning CPU on the absence of events, missing events when load spikes, and losing all in-flight state the moment the process dies. This is not a tuning problem. The failure is structural. The correct pattern has existed in distributed systems for decades. Agents should wake on message, process deterministically, checkpoint state to durable storage, and return to idle. This is the event-driven operator pattern. It is the production counterpart to durable execution https://arizenai.com/durable-execution/ . The only novel question is how to apply it to the specific failure modes of agentic workloads — where a "task" may span multiple LLM calls, tool invocations, and branching decisions across minutes or hours. TL;DR — Key Takeaways: - Polling agents fail on three structural dimensions: CPU waste on empty queues, missed events under load, and total state loss on process death. - The Principle of Execution Durability states: every agent wake cycle must be recoverable — meaning state is checkpointed before the cycle completes, not after. - The production pattern is: message queue Kafka or SQS + idempotent handler + dead letter queue. All three are required; any two is insufficient. - Idempotency is not optional decoration — it is the mechanism that makes retry safe, which is what makes durability possible. - Exactly-once semantics and strict ordering are where this pattern breaks. The validator asymmetry principle https://arizenai.com/validator-asymmetry-principle/ applies: verifying idempotency is cheaper than guaranteeing it generatively. The post covers both boundary conditions. Why Polling Fails by Construction Three independent failure modes https://arizenai.com/agent-failure-bestiary/ converge in the polling pattern, and they compound under production load. The first is the thundering herd . When multiple agent processes share a polling interval — even slightly staggered — they converge on the same task store simultaneously during bursts. The store sees a spike of read traffic at the moment it is already processing the event that triggered the burst. This is not a race condition in the traditional sense; it is a feedback loop baked into the architecture. The busier the system, the more agents poll, the worse the contention. This is chaos conserved https://arizenai.com/conservation-of-chaos/ , not eliminated — redistributed from one structural flaw to another. The second is missed events under load . A polling agent checks for work at interval T . If two events arrive between polls, the agent sees only the state at poll time — not the sequence of transitions that produced it. For agents whose behavior depends on event ordering a document that was uploaded, then immediately deleted, then re-uploaded , the polling model produces a fundamentally incorrect view of the world. The agent acts on a snapshot, not a history. The third is state loss on process death . A polling agent holds its execution context in process memory. When the process dies — OOM kill, deployment restart, spot instance preemption — that context is gone. There is no recovery path. The task either re-enters the queue if the queue was durable and the agent hadn't ACKed or disappears silently if the agent had already dequeued it . Most polling implementations fall into the second category, because the dequeue and the processing happen in the same unguarded block. These three modes are not edge cases. They are the steady-state behavior of polling under production conditions. The fix is not a shorter poll interval or a smarter backoff — those are local optimizations on a globally broken pattern. Polling-based agents fail by construction because they conflate task discovery with task ownership. Event-driven agents separate the two: the queue owns the task until the handler explicitly commits completion. The Principle of Execution Durability I define this formally because it needs a name to be reasoned about precisely: The Principle of Execution Durability: Every agent wake cycle must be recoverable from its last checkpoint without re-executing side effects. Durability is not a property of the storage layer — it is a property of the execution protocol. The implication is specific: checkpointing must happen before the cycle is considered complete, and every operation with external side effects must be idempotent. If the process dies after a tool call but before the checkpoint, the handler must be able to re-execute that tool call safely on restart. This is why idempotency is not optional decoration on top of durability — it is the mechanism that makes durability possible. This principle connects directly to the Probabilistic State Machine https://arizenai.com/probabilistic-state-machine/ model. An agentic workflow is not a deterministic function — it is a state machine where transitions are probabilistic governed by LLM outputs but where the state itself must be deterministic and recoverable. The event-driven pattern is what makes that recovery possible. Without it, the state machine has no durable substrate to run on. The Production Pattern: Queue, Handler, DLQ The architecture has three required components. Omitting any one breaks the durability guarantee. The message queue Kafka or SQS provides durability of the event itself. The message exists independently of any consumer process. A consumer can die, restart, and re-consume the same message. Kafka retains messages by offset; SQS retains them by visibility timeout. Both provide the same guarantee: the event is not lost when the consumer fails. The idempotent handler is the agent's execution unit. It consumes one message, performs all necessary work including LLM calls and tool invocations , checkpoints final state to durable storage, and only then ACKs the message. The ACK is the commit. If the process dies before the ACK, the queue redelivers. Idempotency ensures redelivery is safe — the handler checks whether the work was already done via a state key in Redis or Postgres before executing. The dead letter queue DLQ is where messages go after exceeding the maximum retry count. Without a DLQ, a poison message — one that consistently causes handler failure — will block the queue indefinitely or be silently dropped. The DLQ makes failure visible and replayable. It is the difference between "the agent stopped working" and "three messages failed with this specific error at 14:32". The following handler implements this pattern against Kafka using confluent-kafka-python , with idempotency keyed on message id and state checkpointed to Postgres before commit: python import json import logging from confluent kafka import Consumer, KafkaError, Producer from psycopg2 import connect as pg connect logger = logging.getLogger name MAX RETRIES = 3 DLQ TOPIC = "agent-tasks-dlq" def build consumer config: dict - Consumer: return Consumer { "bootstrap.servers": config "bootstrap servers" , "group.id": config "group id" , "auto.offset.reset": "earliest", "enable.auto.commit": False, manual commit = explicit ACK } def is already processed conn, message id: str - bool: with conn.cursor as cur: cur.execute "SELECT 1 FROM agent checkpoints WHERE message id = %s", message id, return cur.fetchone is not None def checkpoint conn, message id: str, result: dict - None: with conn.cursor as cur: cur.execute """INSERT INTO agent checkpoints message id, result, completed at VALUES %s, %s, NOW ON CONFLICT message id DO NOTHING""", message id, json.dumps result conn.commit def route to dlq producer: Producer, message: dict, error: str - None: payload = dict message payload "dlq error" = error producer.produce DLQ TOPIC, json.dumps payload .encode producer.flush def run agent handler config: dict - None: consumer = build consumer config producer = Producer {"bootstrap.servers": config "bootstrap servers" } conn = pg connect config "postgres dsn" consumer.subscribe config "topic" try: while True: msg = consumer.poll timeout=1.0 if msg is None: continue if msg.error : if msg.error .code = KafkaError. PARTITION EOF: logger.error "Kafka error: %s", msg.error continue payload = json.loads msg.value .decode message id = payload "message id" retry count = payload.get "retry count", 0 Idempotency gate — safe to re-consume after restart if is already processed conn, message id : consumer.commit message=msg continue try: result = execute agent task payload LLM calls, tool use checkpoint conn, message id, result durable state before ACK consumer.commit message=msg explicit ACK = commit except Exception as exc: logger.exception "Handler failed for %s", message id if retry count = MAX RETRIES: route to dlq producer, payload, str exc consumer.commit message=msg remove from main queue else: Requeue with incremented retry count — do NOT commit Kafka will redeliver on next poll after visibility timeout pass finally: consumer.close conn.close Three design decisions in this handler are load-bearing. First, enable.auto.commit: False — automatic commit would ACK the message before processing completes, destroying the durability guarantee. Second, the idempotency check runs before any work begins — not after, not around the LLM call. Third, the DLQ route commits the offset on the main topic, removing the poison message from normal processing flow while preserving it for inspection. Polling vs. Event-Driven Across Failure Dimensions | Failure Dimension | Polling Agent | Event-Driven Agent | |---|---|---| | CPU under idle load | Constant burn empty polls | Zero blocked on poll | | Missed events under spike | Structural — snapshot, not history | None — queue buffers all events | | State on process death | Lost in-memory context | Recoverable checkpoint + redelivery | | Poison message handling | Silent drop or infinite loop | DLQ after MAX RETRIES | Where This Pattern Breaks The event-driven pattern eliminates the three polling failure modes. It introduces two constraints that polling does not have. Ordering guarantees. Kafka preserves order within a partition; SQS standard queues do not. If agent tasks must be processed in strict causal order — task B depends on the completed state of task A — the queue topology must enforce this. With Kafka, this means keying messages to the same partition by a causal identifier e.g., document id . With SQS, it means using FIFO queues, which carry throughput limits 3,000 messages per second with batching as of current AWS documentation . Architectures that require global ordering across all tasks cannot use this pattern without additional coordination. Exactly-once semantics. The pattern above provides at-least-once delivery. The idempotency gate makes re-execution safe, but it does not make it invisible — downstream systems that are not idempotent will see duplicate calls. Kafka Transactions introduced in Kafka 0.11 provide exactly-once semantics within the Kafka ecosystem, but only when both the consumer and producer are Kafka clients and the handler does not touch external systems. The moment an LLM call or a database write enters the picture, exactly-once becomes a distributed coordination problem, not a queue configuration. The honest answer is that most agentic workloads should design for at-least-once and make every external operation idempotent — not attempt to achieve exactly-once at the infrastructure layer. Exactly-once delivery for agentic workloads is not a queue configuration problem. It is a distributed coordination problem that requires every external operation in the handler to be idempotent. The queue can only guarantee delivery semantics for the message itself. Frequently Asked Questions Can I use this pattern with SQS if I need strict ordering? SQS FIFO queues provide ordering within a message group, with a throughput ceiling documented by AWS at 3,000 messages per second with batching enabled. For most agentic workloads, that ceiling is not the constraint — the LLM call latency is. If your ordering requirement is causal task B after task A for the same document , SQS FIFO with a MessageGroupId keyed to the document identifier is sufficient. If you require total global ordering across all tasks regardless of source, you need Kafka with a single partition — which serializes all processing and eliminates horizontal scale. That trade-off is almost never worth making. How does this pattern interact with LangGraph's checkpointing? LangGraph provides its own persistence layer via checkpointers SqliteSaver , PostgresSaver . The event-driven pattern and LangGraph checkpointing are complementary, not redundant. The queue handles durability of the task event — ensuring the agent wakes exactly when work arrives and recovers if the process dies before starting. LangGraph 's checkpointer handles durability of the graph execution state — the intermediate node outputs within a single task run. A complete production architecture uses both: the queue guarantees the agent receives the task; the graph checkpointer guarantees the agent can resume mid-execution after a failure inside the task. The deeper question this pattern raises is about agent identity across restarts. An event-driven agent that checkpoints state and resumes after process death is, in a meaningful sense, a different process instance running the same logical agent. The Probabilistic State Machine model handles this cleanly — the state is the agent, not the process. But that framing has implications for how we think about agent memory, context windows, and the boundary between a "resumed" execution and a "new" one with access to prior state. That boundary is not as clean as the infrastructure pattern suggests. Related Reading Durable Execution for Agentic Workflows https://arizenai.com/durable-execution/ — Why agents need crash-resilient state before they can wake on events The Probabilistic State Machine https://arizenai.com/probabilistic-state-machine/ — The formal model for agent state transitions that event-driven operators implement The Principle of Conservation of Chaos https://arizenai.com/conservation-of-chaos/ — Event-driven architectures redistribute chaos — they don't eliminate it This post explores ideas from Production-Ready AI Agents — the Three Pillars framework Observability, Reliability, Security for shipping AI systems that stay in production. Learn more about the book → https://arizenai.com/ai-agents/