You probably don't need event-driven architecture

wpnews.pro

← Blog engineeringdistributed-systemsschedulers

Event-driven is what everyone reaches for, and for a lot of work it's right. For expensive, stateful jobs that aren't in a hurry, a boring polling loop usually beats it.

One of my agents slept through its own work, and I didnt notice for hours.

A message had come in for it. The thing it was waiting on had finished an hour earlier, so it was free to run. It didnt, and nothing told me why. No error, no crash, nothing in the logs. The work sat there ready and nobody picked it up. The agent was up the whole time. It never woke up.

Those are the bugs I hate, the ones that dont page you at all. You only find them when someone asks why nothing happened.

Quick context, because the lesson has nothing to do with what my agents actually do: I run a bunch of programs that mostly sit idle and now and then wake up to do something expensive. Each one is backed by an LLM, so a wake is a model call or two, real money and a few real seconds. Swap "agent" for "any slow, costly job that touches state you care about" and nothing changes. A nightly export, a VM that takes a minute to boot, some rate-limited API you can only poke so often. Same story.

Something has to decide when each of these wakes up. Thats the whole post. And the answer that took me way too long and a lot of deleted code to accept is a dumb loop on a timer. I only got there after building the clever version first and watching it fall over.

I want to be exact about the order, because its easy to hear this as crawling back to where I started. My first version did have a timer, a bad one, and I ripped it out on purpose, because reacting the moment something happens is obviously better, right? Message lands, handle it now. Dependency finishes, wake the thing now. No wasted work. So the order went bad timer, then a clean event-driven version I was proud of, then back to a timer. It felt like going backwards. It was right anyway.

To be clear, this isn't true for most things. Event-driven is the right call for plenty of work, anything cheap and latency-sensitive especially. But for the kind that's expensive to run, holds state, and isn't in a hurry, you usually don't need it, and that kind is more common than people admit. That's what this post is about.

The usual version of this meme ends in something baroque and clever. Mine ends in a loop.

Where it broke Reacting the moment something happens sounds simple. It isn't, because by the time you react, the world has usually moved out from under the event.

The event-driven version grew the way these always grow, one reasonable patch at a time. Signals arrived in bursts, so I added something to squash a flurry into one wake instead of ten. An agent could get stuck waking itself, so I added a rate limit. An agent's own actions echoed back as new signals and woke it again, so I added a filter to ignore its own echo. Every one was a sensible fix to a real problem. And that's the trap. I was so busy patching symptoms I never asked whether the thing I kept patching was the problem. The pile of patches was the answer, and I stared at it for months.

The silent failure I opened with wasn't a one-off. There were a handful, and they all rhyme. A signal shows up for an agent whose situation has quietly changed, gets routed nowhere, and vanishes. A signal that matters gets mistaken for the agent's own echo and dropped, so the one wake I needed is the one the system ate. A wake fires at an agent that's already mid-reply to a person, and now two things are writing to the same place and stomping each other.

You've felt that last one even if you've never touched a scheduler. You ask a chatbot something, then send a second message before its done. Now two answers are being written into one conversation. Which wins? In my system, nothing decided. They raced.

None of these is a bug in the logic. The rules were fine. Every one lived in the gap between a signal firing and a busy, expensive agent being ready for it. Races, dropped messages, stale assumptions. You don't get those from your business logic. You get them from reacting.

So I fixed them, all of them. And then I wrote a watchdog. If an agent woke a few times in a row and found nothing to do, it would step in and calm it down, because the event system could fire an agent at nothing, over and over, burning money on calls that did nothing. I had written a program whose whole job was to babysit my scheduler and protect me from it. You don't write that for a system that works.

What I did instead Instead of another patch, I stopped reacting altogether.

I deleted the event-driven path and dropped in a loop. Every sixty seconds it wakes up, walks the agents, asks each one "anything to do here?", and if so, does it. That's the whole scheduler.

The events didn't go away. They stopped being triggers and became data. Before, a signal fired and something reacted right then. Now a signal gets written to a list and waits, and the next time the loop comes around it reads the list and handles whatever's there. Your second chatbot message doesn't race the first anymore, it gets in line. Same information. It stopped interrupting.

And the machinery evaporated. The burst-squasher, the rate limit, the echo filter, the watchdog, the special cases, the bookkeeping tracking who was owed a wake, all deleted, because every piece of it only existed to survive reacting in real time, and I'd stopped reacting in real time. I deleted far more than I added.

The watchdog is the part I keep thinking about. The thing I'd been proud of wasn't needed anymore. A loop that runs once a minute can't spin. An idle agent wakes, sees nothing, goes back to sleep, once a minute, harmless. The protection I'd carefully built came for free the moment I stopped reacting.

All the complexity of the reacting design lives in the gap between the signal and the thing that has to act on it. The loop empties that gap. Signals become notes in a list, and one steady beat clears them.

Why polling, not events So why does reacting instantly fall apart here when it works everywhere else? Because it was built for a different kind of work, and I dragged it somewhere it didn't belong.

For a web server, reacting the instant a request lands is exactly right. A request is cheap, it brings its own context, there are thousands a second, and the whole game is to answer each one now, in parallel, in milliseconds. A wasted reaction costs nothing, so you optimize for never missing one. That is not my world. My work is the opposite on every axis that counts. It's expensive. A wake is a model call or two, so a wasted wake is a line item, not a rounding error. It's one at a time. Each agent has one running transcript that every wake writes onto, so fire two at once and you get two replies tangled into one history that makes sense to nobody. And it's in no rush. Nobody is watching a spinner for a background agent, so a minute of lag before it starts is invisible. I spent a lot of effort shaving a delay to zero that nobody could perceive.

There's a fourth, and it's the sharpest. Every big model provider caches the front of your prompt now, the standing instructions plus the conversation so far, and if your next call starts with the same text they charge a fraction for that part. The discount only kicks in when the start of the request is identical to last time. A scheduler that reacts to every change keeps rebuilding and resending, so that cached prefix keeps shifting and the discount evaporates. The calm loop leaves things alone between beats, so the prefix stays put and you keep it. You know this shape from a database that's only fast once its cache is warm. Reacting to everything keeps it cold.

None of those four are about AI, which is the point. Wherever they line up, expensive, one writer, no rush, reacting to every change is the wrong reflex, whatever the work is.

So when each reaction is expensive and a slow one is invisible, you want a few deliberate, well-timed reactions, not a thousand instant ones. The exact opposite of the web request that event-driven was built around. I'd grabbed the wrong tool and paid for it one bug at a time.

And none of this is new. There's a name for it. Reacting the moment something changes is edge-triggered. You catch the instant of change, and if you miss it, it's gone, which is exactly the lost wake I opened with. Checking the current state on a beat is level-triggered. Miss a beat and the state is still there next time. The most trusted infrastructure most of us run all day works the second way. Kubernetes doesn't subscribe to a "server died" firehose and panic. It wakes on a timer, looks at how things are versus how they should be, and nudges. It's a loop. It's a poll. The thing I was too clever to reach for is the thing the pros already trust.

The catch I won't pretend the loop is free. It's slower. An agent can take up to a minute to notice new work, unless I nudge it for the rare thing that genuinely can't wait, which I kept as a small escape hatch instead of the whole design. So instant reaction isn't gone. It runs a narrow corner now instead of the whole place. The loop also does a little pointless work, an idle agent waking once a minute to confirm there's nothing to do. For my work that's fine. For yours, maybe not, and that's your call.

But look at the trade. I gave up instant reactions and a class of silent failures that could cost anything, and got back failures that are loud and cheap. A pointless wake costs nothing and shows up on a graph. A lost wake costs you the work and shows up nowhere. I'll take the first one every time.

Strip out my situation and the rule generalizes. When you have to wake something expensive that holds state and isn't in a hurry, don't let events pull the trigger. Let them pile up. You pull the trigger, on a steady beat, by looking at how things actually are. Keep instant reaction for the few cases that truly can't wait.

The rewrite was easy. The hard part was admitting the primitive answer was right. A loop on a timer is the thing you write in your first week, and reaching for it after years of supposedly knowing better felt like losing. It wasn't. It's been boring ever since, no watchdog, no special cases, no silent bug I find out about a week late. And boring is about the highest praise there is.

Comments

source & further reading

openacme.org — original article

You probably don't need event-driven architecture

Run your AI side-project on zahid.host