Polling-based agents fail by construction β not occasionally, but structurally. The event-driven operator pattern eliminates that failure class by separating wake-up, execution, and recovery.
A polling loop is not an agent architecture. It is a busy-wait with ambitions. The agent sits in a while True
loop, hammering a database or API for work that arrives once per minute β burning CPU on the absence of events, missing events when load spikes, and losing all in-flight state the moment the process dies. This is not a tuning problem. The failure is structural.
The correct pattern has existed in distributed systems for decades. Agents should wake on message, process deterministically, checkpoint state to durable storage, and return to idle. This is the event-driven operator pattern. It is the production counterpart to durable execution. The only novel question is how to apply it to the specific failure modes of agentic workloads β where a "task" may span multiple LLM calls, tool invocations, and branching decisions across minutes or hours.
TL;DR β Key Takeaways:
- Polling agents fail on three structural dimensions: CPU waste on empty queues, missed events under load, and total state loss on process death.
- The Principle of Execution Durability states: every agent wake cycle must be recoverable β meaning state is checkpointed before the cycle completes, not after. - The production pattern is: message queue (Kafka or SQS) + idempotent handler + dead letter queue. All three are required; any two is insufficient.
- Idempotency is not optional decoration β it is the mechanism that makes retry safe, which is what makes durability possible.
- Exactly-once semantics and strict ordering are where this pattern breaks. The validator asymmetry principleapplies: verifying idempotency is cheaper than guaranteeing it generatively. The post covers both boundary conditions.
Why Polling Fails by Construction #
Three independent failure modes converge in the polling pattern, and they compound under production load.
The first is the thundering herd. When multiple agent processes share a polling interval β even slightly staggered β they converge on the same task store simultaneously during bursts. The store sees a spike of read traffic at the moment it is already processing the event that triggered the burst. This is not a race condition in the traditional sense; it is a feedback loop baked into the architecture. The busier the system, the more agents poll, the worse the contention. This is chaos conserved, not eliminated β redistributed from one structural flaw to another.
The second is missed events under load. A polling agent checks for work at interval T. If two events arrive between polls, the agent sees only the state at poll time β not the sequence of transitions that produced it. For agents whose behavior depends on event ordering (a document that was uploaded, then immediately deleted, then re-uploaded), the polling model produces a fundamentally incorrect view of the world. The agent acts on a snapshot, not a history.
The third is state loss on process death. A polling agent holds its execution context in process memory. When the process dies β OOM kill, deployment restart, spot instance preemption β that context is gone. There is no recovery path. The task either re-enters the queue (if the queue was durable and the agent hadn't ACKed) or disappears silently (if the agent had already dequeued it). Most polling implementations fall into the second category, because the dequeue and the processing happen in the same unguarded block.
These three modes are not edge cases. They are the steady-state behavior of polling under production conditions. The fix is not a shorter poll interval or a smarter backoff β those are local optimizations on a globally broken pattern.
Polling-based agents fail by construction because they conflate task discovery with task ownership. Event-driven agents separate the two: the queue owns the task until the handler explicitly commits completion.
The Principle of Execution Durability #
I define this formally because it needs a name to be reasoned about precisely:
The Principle of Execution Durability: Every agent wake cycle must be recoverable from its last checkpoint without re-executing side effects. Durability is not a property of the storage layer β it is a property of the execution protocol.
The implication is specific: checkpointing must happen before the cycle is considered complete, and every operation with external side effects must be idempotent. If the process dies after a tool call but before the checkpoint, the handler must be able to re-execute that tool call safely on restart. This is why idempotency is not optional decoration on top of durability β it is the mechanism that makes durability possible.
This principle connects directly to the Probabilistic State Machine model. An agentic workflow is not a deterministic function β it is a state machine where transitions are probabilistic (governed by LLM outputs) but where the state itself must be deterministic and recoverable. The event-driven pattern is what makes that recovery possible. Without it, the state machine has no durable substrate to run on.
The Production Pattern: Queue, Handler, DLQ #
The architecture has three required components. Omitting any one breaks the durability guarantee.
The message queue (Kafka or SQS) provides durability of the event itself. The message exists independently of any consumer process. A consumer can die, restart, and re-consume the same message. Kafka retains messages by offset; SQS retains them by visibility timeout. Both provide the same guarantee: the event is not lost when the consumer fails.
The idempotent handler is the agent's execution unit. It consumes one message, performs all necessary work (including LLM calls and tool invocations), checkpoints final state to durable storage, and only then ACKs the message. The ACK is the commit. If the process dies before the ACK, the queue redelivers. Idempotency ensures redelivery is safe β the handler checks whether the work was already done (via a state key in Redis or Postgres) before executing.
The dead letter queue (DLQ) is where messages go after exceeding the maximum retry count. Without a DLQ, a poison message β one that consistently causes handler failure β will block the queue indefinitely or be silently dropped. The DLQ makes failure visible and replayable. It is the difference between "the agent stopped working" and "three messages failed with this specific error at 14:32".
The following handler implements this pattern against Kafka using confluent-kafka-python
, with idempotency keyed on message_id
and state checkpointed to Postgres before commit:
import json
import logging
from confluent_kafka import Consumer, KafkaError, Producer
from psycopg2 import connect as pg_connect
logger = logging.getLogger(__name__)
MAX_RETRIES = 3
DLQ_TOPIC = "agent-tasks-dlq"
def build_consumer(config: dict) -> Consumer:
return Consumer({
"bootstrap.servers": config["bootstrap_servers"],
"group.id": config["group_id"],
"auto.offset.reset": "earliest",
"enable.auto.commit": False, # manual commit = explicit ACK
})
def is_already_processed(conn, message_id: str) -> bool:
with conn.cursor() as cur:
cur.execute(
"SELECT 1 FROM agent_checkpoints WHERE message_id = %s",
(message_id,)
)
return cur.fetchone() is not None
def checkpoint(conn, message_id: str, result: dict) -> None:
with conn.cursor() as cur:
cur.execute(
"""INSERT INTO agent_checkpoints (message_id, result, completed_at)
VALUES (%s, %s, NOW())
ON CONFLICT (message_id) DO NOTHING""",
(message_id, json.dumps(result))
)
conn.commit()
def route_to_dlq(producer: Producer, message: dict, error: str) -> None:
payload = dict(message)
payload["dlq_error"] = error
producer.produce(DLQ_TOPIC, json.dumps(payload).encode())
producer.flush()
def run_agent_handler(config: dict) -> None:
consumer = build_consumer(config)
producer = Producer({"bootstrap.servers": config["bootstrap_servers"]})
conn = pg_connect(config["postgres_dsn"])
consumer.subscribe([config["topic"]])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() != KafkaError._PARTITION_EOF:
logger.error("Kafka error: %s", msg.error())
continue
payload = json.loads(msg.value().decode())
message_id = payload["message_id"]
retry_count = payload.get("retry_count", 0)
if is_already_processed(conn, message_id):
consumer.commit(message=msg)
continue
try:
result = execute_agent_task(payload) # LLM calls, tool use
checkpoint(conn, message_id, result) # durable state before ACK
consumer.commit(message=msg) # explicit ACK = commit
except Exception as exc:
logger.exception("Handler failed for %s", message_id)
if retry_count >= MAX_RETRIES:
route_to_dlq(producer, payload, str(exc))
consumer.commit(message=msg) # remove from main queue
else:
pass
finally:
consumer.close()
conn.close()
Three design decisions in this handler are load-bearing. First, enable.auto.commit: False
β automatic commit would ACK the message before processing completes, destroying the durability guarantee. Second, the idempotency check runs before any work begins β not after, not around the LLM call. Third, the DLQ route commits the offset on the main topic, removing the poison message from normal processing flow while preserving it for inspection.
Polling vs. Event-Driven Across Failure Dimensions #
| Failure Dimension | Polling Agent | Event-Driven Agent |
|---|---|---|
| CPU under idle load | Constant burn (empty polls) | Zero (blocked on poll() ) |
| Missed events under spike | Structural β snapshot, not history | None β queue buffers all events |
| State on process death | Lost (in-memory context) | Recoverable (checkpoint + redelivery) |
| Poison message handling | Silent drop or infinite loop | DLQ after MAX_RETRIES |
Where This Pattern Breaks #
The event-driven pattern eliminates the three polling failure modes. It introduces two constraints that polling does not have.
Ordering guarantees. Kafka preserves order within a partition; SQS standard queues do not. If agent tasks must be processed in strict causal order β task B depends on the completed state of task A β the queue topology must enforce this. With Kafka, this means keying messages to the same partition by a causal identifier (e.g., document_id
). With SQS, it means using FIFO queues, which carry throughput limits (3,000 messages per second with batching as of current AWS documentation). Architectures that require global ordering across all tasks cannot use this pattern without additional coordination.
Exactly-once semantics. The pattern above provides at-least-once delivery. The idempotency gate makes re-execution safe, but it does not make it invisible β downstream systems that are not idempotent will see duplicate calls. Kafka Transactions (introduced in Kafka 0.11) provide exactly-once semantics within the Kafka ecosystem, but only when both the consumer and producer are Kafka clients and the handler does not touch external systems. The moment an LLM call or a database write enters the picture, exactly-once becomes a distributed coordination problem, not a queue configuration. The honest answer is that most agentic workloads should design for at-least-once and make every external operation idempotent β not attempt to achieve exactly-once at the infrastructure layer.
Exactly-once delivery for agentic workloads is not a queue configuration problem. It is a distributed coordination problem that requires every external operation in the handler to be idempotent. The queue can only guarantee delivery semantics for the message itself.
Frequently Asked Questions #
Can I use this pattern with SQS if I need strict ordering?
SQS FIFO queues provide ordering within a message group, with a throughput ceiling documented by AWS at 3,000 messages per second with batching enabled. For most agentic workloads, that ceiling is not the constraint β the LLM call latency is. If your ordering requirement is causal (task B after task A for the same document), SQS FIFO with a MessageGroupId
keyed to the document identifier is sufficient. If you require total global ordering across all tasks regardless of source, you need Kafka with a single partition β which serializes all processing and eliminates horizontal scale. That trade-off is almost never worth making.
How does this pattern interact with LangGraph's checkpointing?
LangGraph
provides its own persistence layer via checkpointers (SqliteSaver
, PostgresSaver
). The event-driven pattern and LangGraph
checkpointing are complementary, not redundant. The queue handles durability of the task event β ensuring the agent wakes exactly when work arrives and recovers if the process dies before starting. LangGraph
's checkpointer handles durability of the graph execution state β the intermediate node outputs within a single task run. A complete production architecture uses both: the queue guarantees the agent receives the task; the graph checkpointer guarantees the agent can resume mid-execution after a failure inside the task.
The deeper question this pattern raises is about agent identity across restarts. An event-driven agent that checkpoints state and resumes after process death is, in a meaningful sense, a different process instance running the same logical agent. The Probabilistic State Machine model handles this cleanly β the state is the agent, not the process. But that framing has implications for how we think about agent memory, context windows, and the boundary between a "resumed" execution and a "new" one with access to prior state. That boundary is not as clean as the infrastructure pattern suggests.
Related Reading #
Durable Execution for Agentic Workflowsβ Why agents need crash-resilient state before they can wake on eventsThe Probabilistic State Machineβ The formal model for agent state transitions that event-driven operators implementThe Principle of Conservation of Chaosβ Event-driven architectures redistribute chaos β they don't eliminate it
This post explores ideas from Production-Ready AI Agents β the Three Pillars framework (Observability, Reliability, Security) for shipping AI systems that stay in production.