Gates Earned From Failure: a cost test for agent guardrails

wpnews.pro

Every guardrail in my agent-built project was earned from a real failure, not designed up front. A cost test for when to build one, when to wait, and when to retire it.

On a Wednesday in late May I caught a bug by reading. The project's glossary — the

canonical list of the domain terms my coding agent is required to use — had drifted

from the domain model I actually carried in my head. Nothing flagged it. No test failed, no check fired, no compiler complained. I noticed because I happened to be

reading the file and the words were wrong.

What I typed next is the whole argument in one line. Not "fix the glossary," not

"I'll be more careful," but a question aimed at the toolchain instead of the error:

the GLOSSARY already drifted away from my domain model vision and I would like

to prevent this in future refactors

That reframes the task. The drift stops being an error to correct once

and becomes a signal about the toolchain: if I caught a class of drift by reading, an enforcement axis is missing. The fix isn't to read harder next time.

I did not sit down and design a governance system, and that's the most transferable

thing here. Every check in this project — a couple dozen now, wired as pre-commit

hooks — was born from a particular drift I'd already been bitten by. The ordering is

the point. The failure came first; the gate came second, shaped to the exact failure.

The clearest place this showed up was renaming. (I told the measurement side of

this story in a companion piece, Rails, Not Rules: The detail that stung was where the residue hid. In one rename, the leftover

instances of the old term weren't in some forgotten corner of the codebase. They

were in the specification documents, the very files being rewritten to drive the

rename in the first place. The agent, under my direction, was editing the spec to

declare the new vocabulary while leaving the old vocabulary scattered through the

same spec. The document asserting the rename was itself out of compliance with it.

You cannot out-discipline that, and watching the diff doesn't save you either: no

amount of review reliably catches the contradiction being introduced in the very

pass that's concentrating on the change. Only a check that reads the document back

to you after the fact will, and that's what eventually went in.

It helps to be clear about why renames recur at all, because it isn't sloppiness.

The glossary is the written home of what Domain-Driven Design (DDD) calls the

ubiquitous language (UL): the single vocabulary that runs from a domain expert's

sentence to a class name. Eric Evans is emphatic that this language is not written

once and frozen; it is continuously distilled, sharpened every time the team's

understanding improves. Each sharpening is a rename. So the renames are the system

working as designed, not a mess to stamp out — which is exactly why the residue

problem is permanent and worth a gate rather than a resolution to be tidier.

After that, a rename was never just "use the new word now." It became three things

that ship together: rename the term, add a glossary entry recording the old word as

retired, and add a check keyed to that retired word so it can't silently return.

The rename and the rail land in the same commit. Skip the rail and you're trusting

that every future sweep will be perfect, which is the assumption that just failed.

So far this is a tidy story: let each gate be earned by a real, observed drift,

and your instruction file never bloats into the 200-line document nobody, human or

model, can hold in their head. A rule you add from imagination is a guess about a

failure that may never come. A rule you add from a failure you actually hit is paid

for. It has a body behind it. The failure log is the design document.

I believed that cleanly for about a month. Then I had to argue with my own agent

about it, and the rule came out more interesting than it went in.

The trouble with "wait for the drift" is that some drifts you cannot afford to

observe even once. If the first occurrence is a deleted production table, or

customer data in a log line, or a fabricated citation in something already

published, "let it go wrong cheaply once" is a contradiction. There is no cheap

once.

I put this to the agent as a proposed amendment: build a gate proactively when the

failure mode is already well-attested and the check is cheap and

low-false-positive, or when the first occurrence is expensive or irreversible. Stay

reactive when the rule is judgment-heavy, the convention is still in flux, or the

drift is cheap to catch and fix once. The reasoning is mundane risk math, not

ideology: you pre-pay for a gate exactly when the expected cost of waiting exceeds

the standing cost of the gate.

This repo runs on that amendment, and I can point at two receipts. The writing

project, the one these essays live in, has a check that blocks any cited URL not

logged in an evidence file. I built it before it had earned itself in the usual

reactive way, because the failure it guards is the expensive kind and it had

already happened once: a fabricated citation reached a draft. A published

fabrication doesn't get a cheap first occurrence. So the gate went in proactively,

licensed by the cost test: is the failure high enough cost, and is the check free

of false positives?

The second receipt is the other half of the amendment, the well-attested-and-cheap

case rather than the irreversible one. The same writing project runs a prose check on these very drafts, flagging the small set of mechanical tells that mark

machine-generated text. That failure mode wasn't hypothetical and it wasn't

expensive-once; it was a known, recurring, deterministic pattern I'd have to

correct on every essay forever. Well-attested, cheap to detect, low false-positive:

that's the second carve-out, and it's the reason this paragraph can't lean on more

than two em-dashes without the gate stopping the commit. The check is the essay's

own argument applied to the essay. The rule "gates are earned" survives both times,

with named carve-outs for failures you can't afford to rehearse and failures you've

already seen enough times to name precisely.

Between "earn it from a failure" and "stay reactive" sits the move that does most

of the real work, and it's the one I'd most want a reader to steal. A lot of rules

look un-gateable because the thing they protect is a meaning, and a script can't

read meaning. "Don't restate the protocol prose in two places" isn't grep-able.

Neither is "don't leave a dangling reference." The reflex is to give up and write

the rule into an instructions file as a wish. The better move is to stop checking

the rule and check a structural proxy for it: a syntactic shadow the real rule

casts, one a script can see and that almost never flags an innocent.

It works more often than it has any right to. "Don't duplicate the protocol in

prose" became "no narrative text in column zero of these files," because the

protocol lived in indented blocks and unindented prose was the tell. "No dangling

references" became "every TODO(

carries a live task slug." The em-dash rule

guarding these very drafts is the same trick: I can't gate "don't write like a

language model," but I can gate "no paragraph carries more than two em-dashes,"

which is a measurable shadow of the thing I actually want. Each time, a rule I'd

have sworn was judgment-only turned out to cast a shape a grep could catch.

The proxy is never the rule, though. A structural shadow is an approximation by

construction, so every proxy gate ships with a built-in gap between what it checks

and what you wish it proved. That gap is the next failure.

Here is the part I got wrong, and the agent caught.

I had been treating "cheap to write" as most of what makes a gate worth building.

The agent pushed back on two fronts. First, cheap has to mean cheap to maintain,

not cheap to write. A gate is a standing liability, not a one-time cost. Every

legitimate evolution of the convention now has to route through the check, and you

pay an update tax and a false-positive-triage tax for the life of the gate. The

authoring cost is the small number.

The second point is the one that changed how I think. The agent put it more

sharply than I had:

Once a green gate exists, people stop eyeballing the thing the gate appears to

cover. So a cheap-but-approximate gate over a real invariant can be worse than

no gate — it converts active vigilance into misplaced trust.

The dominant hidden failure of a cheap gate isn't noise. It's false confidence. A

cheap check usually only approximates the invariant you care about. A text

search proves a token is

absent; it does not prove a concept was actually renamed everywhere it lives. A

fast compile passes against a stale cache. A sub-agent reports "tests passed" and

you don't re-run them. In every case the green checkmark covers less than it

appears to.

An outside read lands in the same place. Birgitta Böckeler, whose harness

vocabulary I lean on below, notes that test feedback is weaker than it looks once

"the agent also generated the tests." A verifier the generator authored is not

independent of the thing it checks, so a passing suite certifies less than it

seems to — the same gap, one layer up.

The rename gate from earlier is a perfect specimen — a structural proxy of exactly

the kind I just praised, which makes it the more humbling, because it's one I'm

proud of. It greps for the retired word and passes when the word is gone. But "the

old word is absent" and "the concept was correctly migrated" are not the same claim. You can delete every instance of the old term and still

have left the idea it named half-translated, split across two new words that

should have been one, or attached to the wrong entity. The gate is green and the

migration is wrong, and now nobody's reading the diff with the old suspicion,

because the check says it's handled. The check is real and worth having. It just

proves a narrower thing than its green checkmark advertises, and the discipline is

to keep knowing the difference instead of letting the green absolve you.

That's the trap in one move. You were watching the thing carefully when nothing

claimed to watch it for you. Now something claims to, so you look away, and the gap

between what the check proves and what it looks like it proves is exactly where

the next bug lives. When those two diverge and the invariant matters, that argues

against the cheap gate, not for it.

This isn't an argument against gates. It's an argument for knowing which of your

green checkmarks are load-bearing and which are decorative.

The other thing I had been collapsing was the choice itself. I'd been treating it

as binary: gate the thing, or stay reactive and fix it by hand when it breaks.

There are at least four rungs, and most failures want one of the middle two.

The distinction isn't mine — manufacturing got there decades ago. Shigeo Shingo's

poka-yoke, the mistake-proofing of the Toyota Production System, already splits a control that makes the error

impossible from a warning that only signals it. The two middle rungs below are

those two ideas wearing software clothes. Birgitta Böckeler's

harness engineering gives the orthogonal axis: guides that steer the agent before it acts, sensors

that observe after. A blocking gate is a sensor with teeth; a default path is a

guide that leaves nothing to sense.

The strongest rung isn't a gate at all. It's the default path — Shingo's

control type: make the wrong artifact hard to author in the first place. A

template, a generator, a structured section that only has slots for the right

shape. There's nothing to false-positive on, because you're not checking after the

fact, you're removing the way to get it wrong. When a failure is common but a

precise gate would be noisy, this beats the gate.

Below the blocking gate sits the tripwire — Shingo's warning type: warn without

blocking, or ask for confirmation before an irreversible step. This is the right

reach for failures that are catastrophic on first occurrence but genuinely hard to

gate cleanly — the data deletion, the leak, the fabrication. You don't need a perfect detector. A loud "are

you sure, here's what you're about to overwrite" buys most of the protection at a

fraction of the false-positive cost. The mistake is letting "we can't build a clean

gate for this" collapse into "so we stay fully reactive" on a failure you can't

take twice.

The rung that doesn't work is the one that looks like discipline: a script everyone

is supposed to run but nothing enforces. It's neither a gate nor a default path,

so under any real deadline it just gets skipped. A rule that lives only in prose is

a preference, however firmly phrased.

So the ladder, top to bottom: blocking gate, default path, tripwire, stay reactive.

"Should I gate this" was always the wrong question. The question is which rung the

failure earns.

Because a gate is a standing liability, "earned" is a real filter, and a filter

that only ever says yes isn't filtering. The same cost test that licenses a

proactive gate also tells you when to leave something ungated, and I find I reach

for it in that direction about as often. A concrete one: I recently changed a documentation convention across the project. A

pure convention change, no behavior, no domain term retired. The reflex in a repo

that leans this hard on enforcement is to add a check for the new convention. What

I wrote instead was an instruction to not:

Don't add a gate or script — this is a convention change only; the enforcement

hook stays deferred per the cost test.

The convention isn't load-bearing, a check on it would generate churn every time

the convention itself evolves, and the failure if someone gets it wrong is

cosmetic. No gate. Not yet, maybe not ever.

The same logic runs in reverse, against gates that already exist. A check family

only grows: one iteration of mine added six at once, and nothing in the workflow

ever pushes the other way. The tidy instinct is a recurring "prune the gates"

ritual at every close. That instinct is right about the pressure and wrong about

the mechanism, for two reasons that separate a gate from a stale paragraph of

prose. A doc paragraph taxes every session, so doc-bloat is pure, constant rent; a

redundant gate taxes only when it runs, so its bloat is mostly latent. And a stale

doc is just noise, while a gate is a load-bearing invariant whose removal risks

silently dropping a correctness check no one notices is gone. So pruning earns more

care and less frequency than tidying prose, not the same recurring slot. A full

portfolio review every close would usually find nothing, and a ritual that usually

finds nothing is the skipped-discipline trap again, dressed as housekeeping.

Two corollaries fall out. "Merge these two similar gates" defaults to no: two

precise checks beat one fuzzy merged one, and a merge is a design act with its own

failure mode, not janitorial cleanup. And the counter-pressure that is warranted

should itself be mechanized rather than left to willpower — a redundancy meta-check

that reads the coupling each gate already declares in its header and flags real

overlap, fired on a threshold or a phase boundary, never once per commit. Even

pruning gates wants a rung, not a resolution to be tidier.

A good gate even pays a rent rebate. Every rule you load into the agent's

always-present instructions is standing context it has to carry, and that context

isn't free. A deterministic check lets you delete the paragraph of prose that

used to beg the model to remember a rule and replace it with a script plus a

reference read only when it fires. The gate does the remembering, so the context

window doesn't have to — fewer gates can mean a heavier prompt, not a lighter one.

The recursion is the part I like best: the system diagnosed the bias on itself.

Everything in this project leans toward enforcement over discipline. The agent

named the hazard better than I had:

the whole

check-*

script family leans hard toward enforce-over-discipline, so

a fresh session inherits a "when in doubt, add a check" texture with no stated

boundary. That absence is a latent gap.

A bias toward gating with nothing written down about when not to. The

honest response, by this essay's own logic, is not to immediately write the

boundary rule into the always-loaded instructions. That would be exactly the

proactive over-gating the rule warns against. The boundary rule hasn't failed yet.

So we left it ungated, with a concrete trigger: the first time any session ships a

gate that turns out high-false-positive or merely approximate, that's the

observed failure, and the boundary rule gets promoted then — and into a

read-on-demand reference, not the always-loaded file, because it's judgment-grade

material rather than a per-session directive. The meta-rule about earning gates is

itself being made to earn its place. The failure log is the design document, all

the way up.

This worked for one person, on one roughly 150,000-line codebase, where I own the

glossary and the domain model is a single coherent thing in a single head. "Earned

from failure" is cheap when the failure is yours and you hit it the same day you build the gate. At team scale the picture is harder: the standing context cost

compounds, the false-positive triage falls on people who didn't feel the original

pain, and "let it go wrong cheaply once" is a different proposition when the once is

someone else's afternoon. I don't have evidence for the team case. I have evidence

for the solo case and a suspicion the principles survive the move, which is not the same thing.

One tempting claim I'll resist. I had a hunch, watching the sessions, that the agent

resists proposing gates proactively — that it fixes the immediate issue and only

designs the enforcement when pushed. There's a clean-looking receipt for it. Asked

to prevent a recurrence after I'd approved a fix, the agent replied:

Approved. Let me make the fix first, then design the enforcement check.

Fix first, gate second, only once prompted. But when I went looking for the

pattern, the support was mixed. In the flow of a task it does tend to fix first and

gate when asked. Hand it the same question as a matter of policy, though, and it

argues for more proaction than I'd proposed, not less. So I won't dress that up as

a clean behavioral finding. It's a real tension I don't yet understand well enough

to assert.

None of this is specific to my domain. Any agent operating on a governed system —

a customer-service platform, a billing pipeline, anything where words have to mean

the same thing across many surfaces — accumulates the same class of drift, and the

same four rungs apply.

Picture the Contact Center case concretely, since it's the domain I know best. A

single concept gets named in a routing rule, a reporting dashboard, an agent-facing

script, and the schema of the system that ties them together. Rename it in three of those

four and the fourth keeps quietly emitting the old word, so the dashboard and the

router disagree about what they're counting, and nothing crashes. That's the drift,

and it's invisible to every test that checks behavior rather than vocabulary. The

gate that couples the surfaces is the only thing that catches a disagreement no

single component is wrong about. That coupling pattern, applied in both directions,

is the subject of a companion essay, Couple Both Ways. The transferable move is to stop treating your instruction file as the place where

correctness lives. Prose is where preferences live. Correctness lives in the gates,

default paths, and tripwires you build, each one shaped to a failure specific

enough to point at. You don't govern an agent by

predicting how it'll go wrong. You let it go wrong cheaply where you can, and you

convert each real miss into the cheapest rung that makes it impossible to miss that

way again.

This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The argument-back, in this case, is most of beats four and five.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.

Published at vasyltretiakov.dev.

source & further reading

dev.to — original article Looking to Collaborate with Developers on AI, Web, or Startup Projects I Wrote Integration Tests for My MCP Failure Library. Here's the Pattern That Caught 3 Hidden Bugs. Why I Believe Vibe Coding Is Becoming a Real Engineering skills

Gates Earned From Failure: a cost test for agent guardrails

Run your AI side-project on zahid.host