Every guardrail in my agent-built project was earned from a real failure, not designed up front. A cost test for when to build one, when to wait, and when to retire it.
On a Wednesday in late May I caught a bug by reading. The project's glossary — the
canonical list of the domain terms my coding agent is required to use — had drifted
from the domain model I actually carried in my head. Nothing flagged it. No test failed, no check fired, no compiler complained. I noticed because I happened to be
reading the file and the words were wrong.
What I typed next is the whole argument in one line. Not "fix the glossary," not
"I'll be more careful," but a question aimed at the toolchain instead of the error:
the GLOSSARY already drifted away from my domain model vision and I would like
to prevent this in future refactors
That reframes the task. The drift stops being an error to correct once
and becomes a signal about the toolchain: if I caught a class of drift by reading, an enforcement axis is missing. The fix isn't to read harder next time.
I did not sit down and design a governance system, and that's the most transferable
thing here. Every check in this project — a couple dozen now, wired as pre-commit
hooks — was born from a particular drift I'd already been bitten by. The ordering is
the point. The failure came first; the gate came second, shaped to the exact failure.
The clearest place this showed up was renaming. (I told the measurement side of
this story in a companion piece, Rails, Not Rules: The detail that stung was where the residue hid. In one rename, the leftover
instances of the old term weren't in some forgotten corner of the codebase. They
were in the specification documents, the very files being rewritten to drive the
rename in the first place. The agent, under my direction, was editing the spec to
declare the new vocabulary while leaving the old vocabulary scattered through the
same spec. The document asserting the rename was itself out of compliance with it.
You cannot out-discipline that, and watching the diff doesn't save you either: no
amount of review reliably catches the contradiction being introduced in the very
pass that's concentrating on the change. Only a check that reads the document back
to you after the fact will, and that's what eventually went in.
It helps to be clear about why renames recur at all, because it isn't sloppiness.
The glossary is the written home of what Domain-Driven Design (DDD) calls the
ubiquitous language (UL): the single vocabulary that runs from a domain expert's
sentence to a class name. Eric Evans is emphatic that this language is not written
once and frozen; it is continuously distilled, sharpened every time the team's
understanding improves. Each sharpening is a rename. So the renames are the system
working as designed, not a mess to stamp out — which is exactly why the residue
problem is permanent and worth a gate rather than a resolution to be tidier.
After that, a rename was never just "use the new word now." It became three things
that ship together: rename the term, add a glossary entry recording the old word as
retired, and add a check keyed to that retired word so it can't silently return.
The rename and the rail land in the same commit. Skip the rail and you're trusting
that every future sweep will be perfect, which is the assumption that just failed.
So far this is a tidy story: let each gate be earned by a real, observed drift,
and your instruction file never bloats into the 200-line document nobody, human or
model, can hold in their head. A rule you add from imagination is a guess about a
failure that may never come. A rule you add from a failure you actually hit is paid
for. It has a body behind it. The failure log is the design document.
I believed that cleanly for about a month. Then I had to argue with my own agent
about it, and the rule came out more interesting than it went in.
The trouble with "wait for the drift" is that some drifts you cannot afford to
observe even once. If the first occurrence is a deleted production table, or
customer data in a log line, or a fabricated citation in something already
published, "let it go wrong cheaply once" is a contradiction. There is no cheap
once.
I put this to the agent as a proposed amendment: build a gate proactively when the
failure mode is already well-attested and the check is cheap and
low-false-positive, or when the first occurrence is expensive or irreversible. Stay
reactive when the rule is judgment-heavy, the convention is still in flux, or the
drift is cheap to catch and fix once. The reasoning is mundane risk math, not
ideology: you pre-pay for a gate exactly when the expected cost of waiting exceeds
the standing cost of the gate.
This repo runs on that amendment, and I can point at two receipts. The writing
project, the one these essays live in, has a check that blocks any cited URL not
logged in an evidence file. I built it before it had earned itself in the usual
reactive way, because the failure it guards is the expensive kind and it had
already happened once: a fabricated citation reached a draft. A published
fabrication doesn't get a cheap first occurrence. So the gate went in proactively,
licensed by the cost test: is the failure high enough cost, and is the check free
of false positives?
The second receipt is the other half of the amendment, the well-attested-and-cheap
case rather than the irreversible one. The same writing project runs a prose check on these very drafts, flagging the small set of mechanical tells that mark
machine-generated text. That failure mode wasn't hypothetical and it wasn't
expensive-once; it was a known, recurring, deterministic pattern I'd have to
correct on every essay forever. Well-attested, cheap to detect, low false-positive:
that's the second carve-out, and it's the reason this paragraph can't lean on more
than two em-dashes without the gate stopping the commit. The check is the essay's
own argument applied to the essay. The rule "gates are earned" survives both times,
with named carve-outs for failures you can't afford to rehearse and failures you've
already seen enough times to name precisely.
Between "earn it from a failure" and "stay reactive" sits the move that does most
of the real work, and it's the one I'd most want a reader to steal. A lot of rules
look un-gateable because the thing they protect is a meaning, and a script can't
read meaning. "Don't restate the protocol prose in two places" isn't grep-able.
Neither is "don't leave a dangling reference." The reflex is to give up and write
the rule into an instructions file as a wish. The better move is to stop checking
the rule and check a structural proxy for it: a syntactic shadow the real rule
casts, one a script can see and that almost never flags an innocent.
It works more often than it has any right to. "Don't duplicate the protocol in
prose" became "no narrative text in column zero of these files," because the
protocol lived in indented blocks and unindented prose was the tell. "No dangling
references" became "every TODO(
carries a live task slug." The em-dash rule
guarding these very drafts is the same trick: I can't gate "don't write like a
language model," but I can gate "no paragraph carries more than two em-dashes,"
which is a measurable shadow of the thing I actually want. Each time, a rule I'd
have sworn was judgment-only turned out to cast a shape a grep could catch.
The proxy is never the rule, though. A structural shadow is an approximation by
construction, so every proxy gate ships with a built-in gap between what it checks
and what you wish it proved. That gap is the next failure.
Here is the part I got wrong, and the agent caught.
I had been treating "cheap to write" as most of what makes a gate worth building.
The agent pushed back on two fronts. First, cheap has to mean cheap to maintain,
not cheap to write. A gate is a standing liability, not a one-time cost. Every
legitimate evolution of the convention now has to route through the check, and you
pay an update tax and a false-positive-triage tax for the life of the gate. The
authoring cost is the small number.
The second point is the one that changed how I think. The agent put it more
sharply than I had:
Once a green gate exists, people stop eyeballing the thing the gate appears to
cover. So a cheap-but-approximate gate over a real invariant can be worse than
no gate — it converts active vigilance into misplaced trust.
The dominant hidden failure of a cheap gate isn't noise. It's false confidence. A
cheap check usually only approximates the invariant you care about. A text
search proves a token is
absent; it does not prove a concept was actually renamed everywhere it lives. A
fast compile passes against a stale cache. A sub-agent reports "tests passed" and
you don't re-run them. In every case the green checkmark covers less than it
appears to.
An outside read lands in the same place. Birgitta Böckeler, whose harness
vocabulary I lean on below, notes that test feedback is weaker than it looks once
"the agent also generated the tests." A verifier the generator authored is not
independent of the thing it checks, so a passing suite certifies less than it
seems to — the same gap, one layer up.
The rename gate from earlier is a perfect specimen — a structural proxy of exactly
the kind I just praised, which makes it the more humbling, because it's one I'm
proud of. It greps for the retired word and passes when the word is gone. But "the
old word is absent" and "the concept was correctly migrated" are not the same claim. You can delete every instance of the old term and still
have left the idea it named half-translated, split across two new words that
should have been one, or attached to the wrong entity. The gate is green and the
migration is wrong, and now nobody's reading the diff with the old suspicion,
because the check says it's handled. The check is real and worth having. It just
proves a narrower thing than its green checkmark advertises, and the discipline is
to keep knowing the difference instead of letting the green absolve you.
That's the trap in one move. You were watching the thing carefully when nothing
claimed to watch it for you. Now something claims to, so you look away, and the gap
between what the check proves and what it looks like it proves is exactly where
the next bug lives. When those two diverge and the invariant matters, that argues
against the cheap gate, not for it.
This isn't an argument against gates. It's an argument for knowing which of your
green checkmarks are load-bearing and which are decorative.
The other thing I had been collapsing was the choice itself. I'd been treating it
as binary: gate the thing, or stay reactive and fix it by hand when it breaks.
There are at least four rungs, and most failures want one of the middle two.
The distinction isn't mine — manufacturing got there decades ago. Shigeo Shingo's
poka-yoke, the mistake-proofing of the Toyota Production System, already splits a control that makes the error
impossible from a warning that only signals it. The two middle rungs below are
those two ideas wearing software clothes. Birgitta Böckeler's
harness engineering gives the orthogonal axis: guides that steer the agent before it acts, sensors
that observe after. A blocking gate is a sensor with teeth; a default path is a
guide that leaves nothing to sense.
The strongest rung isn't a gate at all. It's the default path — Shingo's
control type: make the wrong artifact hard to author in the first place. A
template, a generator, a structured section that only has slots for the right
shape. There's nothing to false-positive on, because you're not checking after the
fact, you're removing the way to get it wrong. When a failure is common but a
precise gate would be noisy, this beats the gate.
Below the blocking gate sits the tripwire — Shingo's warning type: warn without
blocking, or ask for confirmation before an irreversible step. This is the right
reach for failures that are catastrophic on first occurrence but genuinely hard to
gate cleanly — the data deletion, the leak, the fabrication. You don't need a perfect detector. A loud "are
you sure, here's what you're about to overwrite" buys most of the protection at a
fraction of the false-positive cost. The mistake is letting "we can't build a clean
gate for this" collapse into "so we stay fully reactive" on a failure you can't
take twice.
The rung that doesn't work is the one that looks like discipline: a script everyone
is supposed to run but nothing enforces. It's neither a gate nor a default path,
so under any real deadline it just gets skipped. A rule that lives only in prose is
a preference, however firmly phrased.
So the ladder, top to bottom: blocking gate, default path, tripwire, stay reactive.
"Should I gate this" was always the wrong question. The question is which rung the
failure earns.
Because a gate is a standing liability, "earned" is a real filter, and a filter
that only ever says yes isn't filtering. The same cost test that licenses a
proactive gate also tells you when to leave something ungated, and I find I reach
for it in that direction about as often. A concrete one: I recently changed a documentation convention across the project. A
pure convention change, no behavior, no domain term retired. The reflex in a repo
that leans this hard on enforcement is to add a check for the new convention. What
I wrote instead was an instruction to not:
Don't add a gate or script — this is a convention change only; the enforcement
hook stays deferred per the cost test.
The convention isn't load-bearing, a check on it would generate churn every time
the convention itself evolves, and the failure if someone gets it wrong is
cosmetic. No gate. Not yet, maybe not ever.
The same logic runs in reverse, against gates that already exist. A check family
only grows: one iteration of mine added six at once, and nothing in the workflow
ever pushes the other way. The tidy instinct is a recurring "prune the gates"
ritual at every close. That instinct is right about the pressure and wrong about
the mechanism, for two reasons that separate a gate from a stale paragraph of
prose. A doc paragraph taxes every session, so doc-bloat is pure, constant rent; a
redundant gate taxes only when it runs, so its bloat is mostly latent. And a stale
doc is just noise, while a gate is a load-bearing invariant whose removal risks
silently dropping a correctness check no one notices is gone. So pruning earns more
care and less frequency than tidying prose, not the same recurring slot. A full
portfolio review every close would usually find nothing, and a ritual that usually
finds nothing is the skipped-discipline trap again, dressed as housekeeping.
Two corollaries fall out. "Merge these two similar gates" defaults to no: two
precise checks beat one fuzzy merged one, and a merge is a design act with its own
failure mode, not janitorial cleanup. And the counter-pressure that is warranted
should itself be mechanized rather than left to willpower — a redundancy meta-check
that reads the coupling each gate already declares in its header and flags real
overlap, fired on a threshold or a phase boundary, never once per commit. Even
pruning gates wants a rung, not a resolution to be tidier.
A good gate even pays a rent rebate. Every rule you load into the agent's
always-present instructions is standing context it has to carry, and that context
isn't free. A deterministic check lets you delete the paragraph of prose that
used to beg the model to remember a rule and replace it with a script plus a
reference read only when it fires. The gate does the remembering, so the context
window doesn't have to — fewer gates can mean a heavier prompt, not a lighter one.
The recursion is the part I like best: the system diagnosed the bias on itself.
Everything in this project leans toward enforcement over discipline. The agent
named the hazard better than I had:
the whole
check-*
script family leans hard toward enforce-over-discipline, so
a fresh session inherits a "when in doubt, add a check" texture with no stated
boundary. That absence is a latent gap.
A bias toward gating with nothing written down about when not to. The
honest response, by this essay's own logic, is not to immediately write the
boundary rule into the always-loaded instructions. That would be exactly the
proactive over-gating the rule warns against. The boundary rule hasn't failed yet.
So we left it ungated, with a concrete trigger: the first time any session ships a
gate that turns out high-false-positive or merely approximate, that's the
observed failure, and the boundary rule gets promoted then — and into a
read-on-demand reference, not the always-loaded file, because it's judgment-grade
material rather than a per-session directive. The meta-rule about earning gates is
itself being made to earn its place. The failure log is the design document, all
the way up.
This worked for one person, on one roughly 150,000-line codebase, where I own the
glossary and the domain model is a single coherent thing in a single head. "Earned
from failure" is cheap when the failure is yours and you hit it the same day you build the gate. At team scale the picture is harder: the standing context cost
compounds, the false-positive triage falls on people who didn't feel the original
pain, and "let it go wrong cheaply once" is a different proposition when the once is
someone else's afternoon. I don't have evidence for the team case. I have evidence
for the solo case and a suspicion the principles survive the move, which is not the same thing.
One tempting claim I'll resist. I had a hunch, watching the sessions, that the agent
resists proposing gates proactively — that it fixes the immediate issue and only
designs the enforcement when pushed. There's a clean-looking receipt for it. Asked
to prevent a recurrence after I'd approved a fix, the agent replied:
Approved. Let me make the fix first, then design the enforcement check.
Fix first, gate second, only once prompted. But when I went looking for the
pattern, the support was mixed. In the flow of a task it does tend to fix first and
gate when asked. Hand it the same question as a matter of policy, though, and it
argues for more proaction than I'd proposed, not less. So I won't dress that up as
a clean behavioral finding. It's a real tension I don't yet understand well enough
to assert.
None of this is specific to my domain. Any agent operating on a governed system —
a customer-service platform, a billing pipeline, anything where words have to mean
the same thing across many surfaces — accumulates the same class of drift, and the
same four rungs apply.
Picture the Contact Center case concretely, since it's the domain I know best. A
single concept gets named in a routing rule, a reporting dashboard, an agent-facing
script, and the schema of the system that ties them together. Rename it in three of those
four and the fourth keeps quietly emitting the old word, so the dashboard and the
router disagree about what they're counting, and nothing crashes. That's the drift,
and it's invisible to every test that checks behavior rather than vocabulary. The
gate that couples the surfaces is the only thing that catches a disagreement no
single component is wrong about. That coupling pattern, applied in both directions,
is the subject of a companion essay, Couple Both Ways. The transferable move is to stop treating your instruction file as the place where
correctness lives. Prose is where preferences live. Correctness lives in the gates,
default paths, and tripwires you build, each one shaped to a failure specific
enough to point at. You don't govern an agent by
predicting how it'll go wrong. You let it go wrong cheaply where you can, and you
convert each real miss into the cheapest rung that makes it impossible to miss that
way again.
This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The argument-back, in this case, is most of beats four and five.
I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.
Published at vasyltretiakov.dev.