Your Fuzzer Is Only as Smart as Its Oracle

wpnews.pro

A migration my schema tool generated passed every check I had. The final schema matched the target exactly — convergence

, green. Then I looked at the plan it took to get there: DROP TABLE; CREATE TABLE

. On a table with data. The destination was right; the path would have erased a production database.

The test was green because I was checking the wrong thing. I had an oracle for where it ended up and none for how it got there.

That gap is most of what this post is about. I build developer tools — SDKs, a compiler, a declarative schema-management system — the kind of software where one wrong edge case ships to everyone downstream. And I spent a long time trying to get coding agents to test this stuff for me. Claude Code, Codex — with enough prompting and the right skills you can get something that looks like a test suite. What I could never get was coverage that followed the real dev flow without quietly skipping the case that mattered. It's convincing right up until you check whether it checked anything.

What finally worked better was semantic fuzzing: constrained, deterministic random generation, run against properties instead of a reference output, with the agent writing the generators and domain rules rather than playing tester. It catches a lot. And the more I run the harness, the more it improves.

But the interesting part isn't the fuzzing. AI made the building of the harness cheap — generators, adapters, the long tail of domain rules. It did not make the harness run itself correctly, and it did not write the part that decides what "correct" means when you have no second implementation to diff against. That part was always the real work, and it still is.

So this is a post about oracles wearing a fuzzing costume. As usual: notes on where my thinking has drifted, not advice.

Where I've landed for now (and expect to revise):

Randomness doesn't find bugs. The oracledoes; randomness just walks it there.- So the value isn't the generator. It's the lawyou check.- The laws worth having need no reference implementation— they're algebraic relations the tool must satisfy by its own logic.- AI dropped the cost of the harness. The scarce skill moved from headcounttoseeing what the law is.

"I fuzz it" says almost nothing on its own. At one end of the dial is crash fuzzing — uniform noise at a parser, asking only "did it fall over?" At the other is reference-model testing — diffing every output against a full second implementation you now have to build and trust. Semantic fuzzing sits in between, and the whole game is what you do with the middle.

Three properties make it work. Generation is distribution-constrained (biased toward inputs with real structure — schemas and migration sequences that could plausibly exist — not uniform noise), deterministic (every case carries a seed, so a failure replays exactly), and shrinkable (a failing case minimizes itself to the smallest thing that still breaks).

That's property-based testing's family tree, and I'd rather say so than pretend it's new. The goal isn't maximum randomness — it's productive randomness: reproducible, minimizable, aimed where bugs live. The one thing it refuses to do is what reference-model testing does — build a second implementation to check the answer. It checks relations the answer must obey instead, which is the move the rest of this post turns on.

A fuzzer is never smarter than its oracle.

The generator is just search. It produces inputs; it has no idea what's wrong with any of them. The thing that decides this is a bug is the oracle — and whatever the oracle can't recognize, the fuzzer cannot find, however many cases you burn.

Point a billion inputs at a crash oracle and you get crashes. You won't get a migration that converges to the right state via a catastrophic path, or a rollback that silently drops a constraint. Those don't crash. They're wrong semantically, and a crash oracle is blind to exactly that.

The trap here is reaching for a reference implementation — "I'll build a correct model and diff against it." But a full reference model of a schema engine is a second schema engine, with its own bugs to maintain. You doubled the work and got a thing you also can't trust.

The way out: stop asking for a reference, ask for laws — relations the tool must satisfy regardless of the "right answer."

Law	Relation checked	Reference needed?	Axis
convergence
`apply(spec)` → introspect → residual drift `== 0`

no	correctness
idempotency
`apply(spec)` twice → 2nd run is a no-op
no	correctness
rollback
`base → target → base` → back to base, exactly
no	correctness
safety	destructive change → must surface hazard / require approval	no	safety

None needs the correct schema in advance — each is true by the meaning of the operation. convergence

doesn't ask "is this schema right?", it asks "did you reach the state you claimed?" That's the metamorphic move: don't verify the answer, verify a relation the answer must obey.

Note safety

as a peer, not a footnote. "Reached the right state" and "took a safe path" are different axes — the DROP TABLE; CREATE TABLE

from the opening satisfies convergence and still erases your data. The oracle must check the path, not just the destination. This is the law that would have caught my green-but-catastrophic migration.

One generator feeds a pipeline; every stage is a place to hang a law.

 generate schema + mutation sequence
            │
            ▼
   IR / shadow DB                ── convergence, idempotency  (in-process)
            │
            ▼
   diff + online-migration plan  ── safety, plan honesty      (in-process)
            │
            ▼
   apply to Docker Postgres/MySQL── locks, extensions, races  (Docker)
            │
            ▼
   introspect + review           ── drift == 0

Run that across thousands of seeds and you're not testing "does ADD COLUMN

work." You're hitting combinations a human suite never reaches — especially extension combinations. Base PostgreSQL/MySQL is the easy part; bugs hide where citext

meets a generated column meets a partial index meets an old trigger. Enumerating those by hand used to need a team. Generated, it's a distribution you tune.

But be honest about which layer answers:

Layer	Runtime	Answers	Can't answer
In-process	PGlite / `node:sqlite`

convergence, idempotency, real SQL exec	multi-connection, lock fidelity, full extension catalog
Docker	real Postgres / MySQL	locks, isolation, extensions, version matrix	provider parity, hosted-backend quirks
Live	real cloud / hosted	IAM/KMS enforcement, audit immutability	(bulk testing — too slow/expensive)

The same generated case fans out: cheap laws run in-process by the thousand; expensive laws that need a real engine run in Docker, sampled. The layer is chosen per property, not per pipeline. Putting a lock-contention law in the in-process tier wouldn't make it fast — it would make it lie.

"Run it more, it gets better" is true — but not because green accumulates. A suite at 100% pass proves nothing about whether it can catch anything. The flywheel only turns toward value if, on every failure:

fault injected ──▶ does a law fail?
                     ├─ yes → oracle bites. trust the green.
                     └─ no  → green is decorative. fix the oracle.

That second step is the one I see skipped most, and the one that keeps the whole thing from becoming theater. An oracle you never test is a claim, not a check. Honest flywheel: minimize failures into the corpus, and verify the verifier. Do that and it genuinely compounds.

A few years ago this pipeline was a staffing decision: generators, shadow DB, migration planner, Docker orchestration, per-dialect adapters — a team and a quarter. The barrier wasn't ideas; it was implementation labor, and that priced most people out of serious simulation testing.

That wall came down, and I can point to where. A per-dialect adapter — the layer that maps my IR onto one database's quirks — used to be a multi-day slog of reading docs and discovering edge cases by getting burned. The last one I added came out in an afternoon: I described the IR and the target's introspection format, the agent drafted the adapter and a first pass of its quirk-rules, and I spent my time reviewing rather than typing. Parsing is a commodity, local execution is cheap, and the long tail of boilerplate writes itself faster than I can spec it.

But notice exactly what got cheap. Building the harness did. Knowing which relations are actually invariant did not. Routing each law to a layer that can honestly answer it did not. Checking that the checker still bites did not. The agent wrote the adapter; it could not tell me whether safety

belonged on the same axis as convergence

, or whether my green meant anything. The scarce skill went from "can you build the harness" to "can you see what the law is." The implementation got democratized. The judgment didn't.

Local simulation kills a startling fraction of bugs for almost no money. It does not replace a remote environment:

These live on a rare live tier, used as calibration anchors, not bulk testing. Local says the logic is right; only live says the platform agrees.

I came in through a practical door — semantic fuzzing catches a lot of bugs cheaply — and walked out somewhere narrower: the random part was never the point. The oracle was, and AI made everything except the oracle cheap enough to finally see that.

In the AI era, rigor is what pays off: it compresses a vague situation into a few hard facts an agent can act on. Here that takes its sharpest form. Rigor is what turns blind randomness into a bug-finding machine: the law does the finding; the noise just does the walking.

The implementation barrier is down. Knowing what "correct" even means for the thing you built is the whole job now.

Which brings me back to that green migration with DROP TABLE

hiding in its path. The fuzzer didn't miss it because the randomness was weak. It missed it because I hadn't yet written the law that calls erasing your data a bug. The day I did, generation found the case in minutes — it had been walking past it the whole time, waiting for me to say it mattered. That's the job now: not running more cases, but learning to name the failures worth catching, one law at a time.

source & further reading

dev.to — original article I Let My AI Agent Build a Bedrock RAG Knowledge Base, Here Are the 2 Mistakes the AWS Agent Toolkit Caught Cara pakai API Claude & DeepSeek dari Indonesia — bayar Rupiah via QRIS (tanpa kartu kredit) How to use the Claude & DeepSeek APIs from Indonesia — pay in Rupiah via QRIS (no credit card)

Your Fuzzer Is Only as Smart as Its Oracle

Run your AI side-project on zahid.host