The Event-Driven Operator: Agents That Wake on Events, Not Cron

wpnews.pro

Polling-based agents fail by construction — not occasionally, but structurally. The event-driven operator pattern eliminates that failure class by separating wake-up, execution, and recovery.

A polling loop is not an agent architecture. It is a busy-wait with ambitions. The agent sits in a while True

loop, hammering a database or API for work that arrives once per minute — burning CPU on the absence of events, missing events when load spikes, and losing all in-flight state the moment the process dies. This is not a tuning problem. The failure is structural.

The correct pattern has existed in distributed systems for decades. Agents should wake on message, process deterministically, checkpoint state to durable storage, and return to idle. This is the event-driven operator pattern. It is the production counterpart to durable execution. The only novel question is how to apply it to the specific failure modes of agentic workloads — where a "task" may span multiple LLM calls, tool invocations, and branching decisions across minutes or hours.

TL;DR — Key Takeaways:

Polling agents fail on three structural dimensions: CPU waste on empty queues, missed events under load, and total state loss on process death.
The Principle of Execution Durability states: every agent wake cycle must be recoverable — meaning state is checkpointed before the cycle completes, not after. - The production pattern is: message queue (Kafka or SQS) + idempotent handler + dead letter queue. All three are required; any two is insufficient.
Idempotency is not optional decoration — it is the mechanism that makes retry safe, which is what makes durability possible.
Exactly-once semantics and strict ordering are where this pattern breaks. The validator asymmetry principleapplies: verifying idempotency is cheaper than guaranteeing it generatively. The post covers both boundary conditions.

Why Polling Fails by Construction #

Three independent failure modes converge in the polling pattern, and they compound under production load.

The first is the thundering herd. When multiple agent processes share a polling interval — even slightly staggered — they converge on the same task store simultaneously during bursts. The store sees a spike of read traffic at the moment it is already processing the event that triggered the burst. This is not a race condition in the traditional sense; it is a feedback loop baked into the architecture. The busier the system, the more agents poll, the worse the contention. This is chaos conserved, not eliminated — redistributed from one structural flaw to another.

The second is missed events under load. A polling agent checks for work at interval T. If two events arrive between polls, the agent sees only the state at poll time — not the sequence of transitions that produced it. For agents whose behavior depends on event ordering (a document that was uploaded, then immediately deleted, then re-uploaded), the polling model produces a fundamentally incorrect view of the world. The agent acts on a snapshot, not a history.

The third is state loss on process death. A polling agent holds its execution context in process memory. When the process dies — OOM kill, deployment restart, spot instance preemption — that context is gone. There is no recovery path. The task either re-enters the queue (if the queue was durable and the agent hadn't ACKed) or disappears silently (if the agent had already dequeued it). Most polling implementations fall into the second category, because the dequeue and the processing happen in the same unguarded block.

These three modes are not edge cases. They are the steady-state behavior of polling under production conditions. The fix is not a shorter poll interval or a smarter backoff — those are local optimizations on a globally broken pattern.

Polling-based agents fail by construction because they conflate task discovery with task ownership. Event-driven agents separate the two: the queue owns the task until the handler explicitly commits completion.

The Principle of Execution Durability #

I define this formally because it needs a name to be reasoned about precisely:

The Principle of Execution Durability: Every agent wake cycle must be recoverable from its last checkpoint without re-executing side effects. Durability is not a property of the storage layer — it is a property of the execution protocol.

The implication is specific: checkpointing must happen before the cycle is considered complete, and every operation with external side effects must be idempotent. If the process dies after a tool call but before the checkpoint, the handler must be able to re-execute that tool call safely on restart. This is why idempotency is not optional decoration on top of durability — it is the mechanism that makes durability possible.

This principle connects directly to the Probabilistic State Machine model. An agentic workflow is not a deterministic function — it is a state machine where transitions are probabilistic (governed by LLM outputs) but where the state itself must be deterministic and recoverable. The event-driven pattern is what makes that recovery possible. Without it, the state machine has no durable substrate to run on.

The Production Pattern: Queue, Handler, DLQ #

The architecture has three required components. Omitting any one breaks the durability guarantee.

The message queue (Kafka or SQS) provides durability of the event itself. The message exists independently of any consumer process. A consumer can die, restart, and re-consume the same message. Kafka retains messages by offset; SQS retains them by visibility timeout. Both provide the same guarantee: the event is not lost when the consumer fails.

The idempotent handler is the agent's execution unit. It consumes one message, performs all necessary work (including LLM calls and tool invocations), checkpoints final state to durable storage, and only then ACKs the message. The ACK is the commit. If the process dies before the ACK, the queue redelivers. Idempotency ensures redelivery is safe — the handler checks whether the work was already done (via a state key in Redis or Postgres) before executing.

The dead letter queue (DLQ) is where messages go after exceeding the maximum retry count. Without a DLQ, a poison message — one that consistently causes handler failure — will block the queue indefinitely or be silently dropped. The DLQ makes failure visible and replayable. It is the difference between "the agent stopped working" and "three messages failed with this specific error at 14:32".

The following handler implements this pattern against Kafka using confluent-kafka-python

, with idempotency keyed on message_id

and state checkpointed to Postgres before commit:

import json
import logging
from confluent_kafka import Consumer, KafkaError, Producer
from psycopg2 import connect as pg_connect

logger = logging.getLogger(__name__)

MAX_RETRIES = 3
DLQ_TOPIC = "agent-tasks-dlq"

def build_consumer(config: dict) -> Consumer:
    return Consumer({
        "bootstrap.servers": config["bootstrap_servers"],
        "group.id": config["group_id"],
        "auto.offset.reset": "earliest",
        "enable.auto.commit": False,  # manual commit = explicit ACK
    })

def is_already_processed(conn, message_id: str) -> bool:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT 1 FROM agent_checkpoints WHERE message_id = %s",
            (message_id,)
        )
        return cur.fetchone() is not None

def checkpoint(conn, message_id: str, result: dict) -> None:
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO agent_checkpoints (message_id, result, completed_at)
               VALUES (%s, %s, NOW())
               ON CONFLICT (message_id) DO NOTHING""",
            (message_id, json.dumps(result))
        )
    conn.commit()

def route_to_dlq(producer: Producer, message: dict, error: str) -> None:
    payload = dict(message)
    payload["dlq_error"] = error
    producer.produce(DLQ_TOPIC, json.dumps(payload).encode())
    producer.flush()

def run_agent_handler(config: dict) -> None:
    consumer = build_consumer(config)
    producer = Producer({"bootstrap.servers": config["bootstrap_servers"]})
    conn = pg_connect(config["postgres_dsn"])

    consumer.subscribe([config["topic"]])

    try:
        while True:
            msg = consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                if msg.error().code() != KafkaError._PARTITION_EOF:
                    logger.error("Kafka error: %s", msg.error())
                continue

            payload = json.loads(msg.value().decode())
            message_id = payload["message_id"]
            retry_count = payload.get("retry_count", 0)

            if is_already_processed(conn, message_id):
                consumer.commit(message=msg)
                continue

            try:
                result = execute_agent_task(payload)      # LLM calls, tool use
                checkpoint(conn, message_id, result)      # durable state before ACK
                consumer.commit(message=msg)              # explicit ACK = commit

            except Exception as exc:
                logger.exception("Handler failed for %s", message_id)
                if retry_count >= MAX_RETRIES:
                    route_to_dlq(producer, payload, str(exc))
                    consumer.commit(message=msg)          # remove from main queue
                else:
                    pass

    finally:
        consumer.close()
        conn.close()

Three design decisions in this handler are load-bearing. First, enable.auto.commit: False

— automatic commit would ACK the message before processing completes, destroying the durability guarantee. Second, the idempotency check runs before any work begins — not after, not around the LLM call. Third, the DLQ route commits the offset on the main topic, removing the poison message from normal processing flow while preserving it for inspection.

Polling vs. Event-Driven Across Failure Dimensions #

Failure Dimension	Polling Agent	Event-Driven Agent
CPU under idle load	Constant burn (empty polls)	Zero (blocked on `poll()` )
Missed events under spike	Structural — snapshot, not history	None — queue buffers all events
State on process death	Lost (in-memory context)	Recoverable (checkpoint + redelivery)
Poison message handling	Silent drop or infinite loop	DLQ after `MAX_RETRIES`

Where This Pattern Breaks #

The event-driven pattern eliminates the three polling failure modes. It introduces two constraints that polling does not have.

Ordering guarantees. Kafka preserves order within a partition; SQS standard queues do not. If agent tasks must be processed in strict causal order — task B depends on the completed state of task A — the queue topology must enforce this. With Kafka, this means keying messages to the same partition by a causal identifier (e.g., document_id

). With SQS, it means using FIFO queues, which carry throughput limits (3,000 messages per second with batching as of current AWS documentation). Architectures that require global ordering across all tasks cannot use this pattern without additional coordination.

Exactly-once semantics. The pattern above provides at-least-once delivery. The idempotency gate makes re-execution safe, but it does not make it invisible — downstream systems that are not idempotent will see duplicate calls. Kafka Transactions (introduced in Kafka 0.11) provide exactly-once semantics within the Kafka ecosystem, but only when both the consumer and producer are Kafka clients and the handler does not touch external systems. The moment an LLM call or a database write enters the picture, exactly-once becomes a distributed coordination problem, not a queue configuration. The honest answer is that most agentic workloads should design for at-least-once and make every external operation idempotent — not attempt to achieve exactly-once at the infrastructure layer.

Exactly-once delivery for agentic workloads is not a queue configuration problem. It is a distributed coordination problem that requires every external operation in the handler to be idempotent. The queue can only guarantee delivery semantics for the message itself.

Frequently Asked Questions #

Can I use this pattern with SQS if I need strict ordering?

SQS FIFO queues provide ordering within a message group, with a throughput ceiling documented by AWS at 3,000 messages per second with batching enabled. For most agentic workloads, that ceiling is not the constraint — the LLM call latency is. If your ordering requirement is causal (task B after task A for the same document), SQS FIFO with a MessageGroupId

keyed to the document identifier is sufficient. If you require total global ordering across all tasks regardless of source, you need Kafka with a single partition — which serializes all processing and eliminates horizontal scale. That trade-off is almost never worth making.

How does this pattern interact with LangGraph's checkpointing?

LangGraph

provides its own persistence layer via checkpointers (SqliteSaver

, PostgresSaver

). The event-driven pattern and LangGraph

checkpointing are complementary, not redundant. The queue handles durability of the task event — ensuring the agent wakes exactly when work arrives and recovers if the process dies before starting. LangGraph

's checkpointer handles durability of the graph execution state — the intermediate node outputs within a single task run. A complete production architecture uses both: the queue guarantees the agent receives the task; the graph checkpointer guarantees the agent can resume mid-execution after a failure inside the task.

The deeper question this pattern raises is about agent identity across restarts. An event-driven agent that checkpoints state and resumes after process death is, in a meaningful sense, a different process instance running the same logical agent. The Probabilistic State Machine model handles this cleanly — the state is the agent, not the process. But that framing has implications for how we think about agent memory, context windows, and the boundary between a "resumed" execution and a "new" one with access to prior state. That boundary is not as clean as the infrastructure pattern suggests.

Durable Execution for Agentic Workflows— Why agents need crash-resilient state before they can wake on eventsThe Probabilistic State Machine— The formal model for agent state transitions that event-driven operators implementThe Principle of Conservation of Chaos— Event-driven architectures redistribute chaos — they don't eliminate it

This post explores ideas from Production-Ready AI Agents — the Three Pillars framework (Observability, Reliability, Security) for shipping AI systems that stay in production.

Learn more about the book →

source & further reading

arizenai.com — original article Red Teaming: Breaking Your Own Agent Before the Internet Does Judgment Compression: The Missing Layer in AI System Design The Agent Failure Bestiary: Sycophancy, Drift, and Recursion Loops