{"slug": "your-fuzzer-is-only-as-smart-as-its-oracle", "title": "Your Fuzzer Is Only as Smart as Its Oracle", "summary": "A developer building schema-management tools discovered that their migration test passed despite using a destructive DROP TABLE; CREATE TABLE path, because the oracle only checked the final state, not the process. They advocate for semantic fuzzing—constrained, deterministic random generation against algebraic properties rather than reference outputs—as a more effective testing approach. The key insight is that a fuzzer's value depends entirely on its oracle, and AI has lowered the cost of building test harnesses but not of defining what correctness means.", "body_md": "A migration my schema tool generated passed every check I had. The final schema matched the target exactly — `convergence`\n\n, green. Then I looked at the plan it took to get there: `DROP TABLE; CREATE TABLE`\n\n. On a table with data. The destination was right; the path would have erased a production database.\n\nThe test was green because I was checking the wrong thing. I had an oracle for *where it ended up* and none for *how it got there*.\n\nThat gap is most of what this post is about. I build developer tools — SDKs, a compiler, a declarative schema-management system — the kind of software where one wrong edge case ships to everyone downstream. And I spent a long time trying to get coding agents to test this stuff for me. Claude Code, Codex — with enough prompting and the right skills you can get something that *looks* like a test suite. What I could never get was coverage that followed the real dev flow without quietly skipping the case that mattered. It's convincing right up until you check whether it checked anything.\n\nWhat finally worked better was **semantic fuzzing**: constrained, deterministic random generation, run against *properties* instead of a reference output, with the agent writing the generators and domain rules rather than playing tester. It catches a lot. And the more I run the harness, the more it improves.\n\nBut the interesting part isn't the fuzzing. AI made the *building* of the harness cheap — generators, adapters, the long tail of domain rules. It did **not** make the harness run itself correctly, and it did not write the part that decides what \"correct\" means when you have no second implementation to diff against. That part was always the real work, and it still is.\n\nSo this is a post about oracles wearing a fuzzing costume. As usual: notes on where my thinking has drifted, not advice.\n\nWhere I've landed for now (and expect to revise):\n\n- Randomness doesn't find bugs. The\noracledoes; randomness just walks it there.- So the value isn't the generator. It's the\nlawyou check.- The laws worth having need\nno reference implementation— they're algebraic relations the tool must satisfy by its own logic.- AI dropped the cost of the harness. The scarce skill moved from\nheadcounttoseeing what the law is.\n\n\"I fuzz it\" says almost nothing on its own. At one end of the dial is crash fuzzing — uniform noise at a parser, asking only \"did it fall over?\" At the other is reference-model testing — diffing every output against a full second implementation you now have to build and trust. Semantic fuzzing sits in between, and the whole game is what you do with the middle.\n\nThree properties make it work. Generation is **distribution-constrained** (biased toward inputs with real structure — schemas and migration sequences that could plausibly exist — not uniform noise), **deterministic** (every case carries a seed, so a failure replays exactly), and **shrinkable** (a failing case minimizes itself to the smallest thing that still breaks).\n\nThat's property-based testing's family tree, and I'd rather say so than pretend it's new. The goal isn't *maximum* randomness — it's *productive* randomness: reproducible, minimizable, aimed where bugs live. The one thing it refuses to do is what reference-model testing does — build a second implementation to check the answer. It checks *relations* the answer must obey instead, which is the move the rest of this post turns on.\n\n**A fuzzer is never smarter than its oracle.**\n\nThe generator is just search. It produces inputs; it has no idea what's wrong with any of them. The thing that decides *this is a bug* is the oracle — and whatever the oracle can't recognize, the fuzzer cannot find, however many cases you burn.\n\nPoint a billion inputs at a crash oracle and you get crashes. You won't get a migration that converges to the right state via a catastrophic path, or a rollback that silently drops a constraint. Those don't crash. They're wrong *semantically*, and a crash oracle is blind to exactly that.\n\nThe trap here is reaching for a reference implementation — \"I'll build a correct model and diff against it.\" But a full reference model of a schema engine *is* a second schema engine, with its own bugs to maintain. You doubled the work and got a thing you also can't trust.\n\nThe way out: stop asking for a reference, ask for **laws** — relations the tool must satisfy regardless of the \"right answer.\"\n\n| Law | Relation checked | Reference needed? | Axis |\n|---|---|---|---|\n| convergence |\n`apply(spec)` → introspect → residual drift `== 0`\n|\nno | correctness |\n| idempotency |\n`apply(spec)` twice → 2nd run is a no-op |\nno | correctness |\n| rollback |\n`base → target → base` → back to base, exactly |\nno | correctness |\n| safety | destructive change → must surface hazard / require approval | no | safety |\n\nNone needs the correct schema in advance — each is true *by the meaning of the operation*. `convergence`\n\ndoesn't ask \"is this schema right?\", it asks \"did you reach the state you claimed?\" That's the metamorphic move: don't verify the answer, verify a *relation* the answer must obey.\n\nNote `safety`\n\nas a peer, not a footnote. \"Reached the right state\" and \"took a safe path\" are different axes — the `DROP TABLE; CREATE TABLE`\n\nfrom the opening satisfies convergence and still erases your data. The oracle must check the path, not just the destination. This is the law that would have caught my green-but-catastrophic migration.\n\nOne generator feeds a pipeline; every stage is a place to hang a law.\n\n```\n generate schema + mutation sequence\n            │\n            ▼\n   IR / shadow DB                ── convergence, idempotency  (in-process)\n            │\n            ▼\n   diff + online-migration plan  ── safety, plan honesty      (in-process)\n            │\n            ▼\n   apply to Docker Postgres/MySQL── locks, extensions, races  (Docker)\n            │\n            ▼\n   introspect + review           ── drift == 0\n```\n\nRun that across thousands of seeds and you're not testing \"does `ADD COLUMN`\n\nwork.\" You're hitting combinations a human suite never reaches — especially **extension combinations**. Base PostgreSQL/MySQL is the easy part; bugs hide where `citext`\n\nmeets a generated column meets a partial index meets an old trigger. Enumerating those by hand used to need a team. Generated, it's a distribution you tune.\n\nBut be honest about *which layer answers*:\n\n| Layer | Runtime | Answers | Can't answer |\n|---|---|---|---|\n| In-process | PGlite / `node:sqlite`\n|\nconvergence, idempotency, real SQL exec | multi-connection, lock fidelity, full extension catalog |\n| Docker | real Postgres / MySQL | locks, isolation, extensions, version matrix | provider parity, hosted-backend quirks |\n| Live | real cloud / hosted | IAM/KMS enforcement, audit immutability | (bulk testing — too slow/expensive) |\n\nThe same generated case **fans out**: cheap laws run in-process by the thousand; expensive laws that need a real engine run in Docker, sampled. The layer is chosen per *property*, not per pipeline. Putting a lock-contention law in the in-process tier wouldn't make it fast — it would make it lie.\n\n\"Run it more, it gets better\" is true — but not because green accumulates. A suite at 100% pass proves nothing about whether it can *catch* anything. The flywheel only turns toward value if, on every failure:\n\n```\nfault injected ──▶ does a law fail?\n                     ├─ yes → oracle bites. trust the green.\n                     └─ no  → green is decorative. fix the oracle.\n```\n\nThat second step is the one I see skipped most, and the one that keeps the whole thing from becoming theater. An oracle you never test is a claim, not a check. Honest flywheel: *minimize failures into the corpus, and verify the verifier.* Do that and it genuinely compounds.\n\nA few years ago this pipeline was a staffing decision: generators, shadow DB, migration planner, Docker orchestration, per-dialect adapters — a team and a quarter. The barrier wasn't ideas; it was implementation labor, and that priced most people out of serious simulation testing.\n\nThat wall came down, and I can point to where. A per-dialect adapter — the layer that maps my IR onto one database's quirks — used to be a multi-day slog of reading docs and discovering edge cases by getting burned. The last one I added came out in an afternoon: I described the IR and the target's introspection format, the agent drafted the adapter and a first pass of its quirk-rules, and I spent my time reviewing rather than typing. Parsing is a commodity, local execution is cheap, and the long tail of boilerplate writes itself faster than I can spec it.\n\nBut notice exactly *what* got cheap. Building the harness did. Knowing which relations are actually invariant did not. Routing each law to a layer that can honestly answer it did not. Checking that the checker still bites did not. The agent wrote the adapter; it could not tell me whether `safety`\n\nbelonged on the same axis as `convergence`\n\n, or whether my green meant anything. The scarce skill went from \"can you build the harness\" to \"can you see what the law is.\" The implementation got democratized. The judgment didn't.\n\nLocal simulation kills a startling fraction of bugs for almost no money. It does not replace a remote environment:\n\nThese live on a rare **live tier**, used as calibration anchors, not bulk testing. Local says *the logic is right*; only live says *the platform agrees*.\n\nI came in through a practical door — semantic fuzzing catches a lot of bugs cheaply — and walked out somewhere narrower: the random part was never the point. The oracle was, and AI made everything *except* the oracle cheap enough to finally see that.\n\nIn the AI era, rigor is what pays off: it compresses a vague situation into a few hard facts an agent can act on. Here that takes its sharpest form. Rigor is what turns blind randomness into a bug-finding machine: the law does the finding; the noise just does the walking.\n\nThe implementation barrier is down. Knowing what \"correct\" even means for the thing you built is the whole job now.\n\nWhich brings me back to that green migration with `DROP TABLE`\n\nhiding in its path. The fuzzer didn't miss it because the randomness was weak. It missed it because I hadn't yet written the law that calls erasing your data a bug. The day I did, generation found the case in minutes — it had been walking past it the whole time, waiting for me to say it mattered. That's the job now: not running more cases, but learning to name the failures worth catching, one law at a time.", "url": "https://wpnews.pro/news/your-fuzzer-is-only-as-smart-as-its-oracle", "canonical_source": "https://dev.to/gyu07/your-fuzzer-is-only-as-smart-as-its-oracle-4mbd", "published_at": "2026-06-24 14:07:23+00:00", "updated_at": "2026-06-24 14:09:19.146107+00:00", "lang": "en", "topics": ["developer-tools", "ai-agents"], "entities": ["Claude Code", "Codex"], "alternates": {"html": "https://wpnews.pro/news/your-fuzzer-is-only-as-smart-as-its-oracle", "markdown": "https://wpnews.pro/news/your-fuzzer-is-only-as-smart-as-its-oracle.md", "text": "https://wpnews.pro/news/your-fuzzer-is-only-as-smart-as-its-oracle.txt", "jsonld": "https://wpnews.pro/news/your-fuzzer-is-only-as-smart-as-its-oracle.jsonld"}}