{"slug": "your-ai-writes-tests-that-can-never-fail", "title": "Your AI Writes Tests That Can Never Fail", "summary": "A developer warns that AI-generated tests often pass without actually verifying behavior, creating a false sense of security. The solution is mutation testing, which checks if tests can detect code changes. Tools like Gremlins automate this process to ensure tests are meaningful.", "body_md": "You ask the AI for tests. It hands you twelve, all green. CI passes. You merge. Three days later a bug ships, on a function those tests were supposed to cover. You reopen the test file and it clicks: it ran, it passed, and it tested nothing.\n\nA green test isn't a proof. It's a hypothesis. And an AI, left to its own devices, is very good at writing hypotheses that can never be disproved.\n\nTake a dead-simple function, a discount above 100 euros:\n\n```\nfunc Discount(total int) int {\n    if total > 100 {\n        return total - 10\n    }\n    return total\n}\n```\n\nHere's the kind of test an AI produces when you ask \"write me a test for this\" with no further framing:\n\n```\nfunc TestDiscount(t *testing.T) {\n    got := Discount(150)\n    if got < 0 {\n        t.Errorf(\"result should not be negative\")\n    }\n}\n```\n\nThis test is green. It does run the discount branch (so your coverage climbs). But look at the assertion: `got < 0`\n\nis never true, whatever `Discount`\n\ndoes. Replace `total - 10`\n\nwith `total + 10`\n\n, with `total * 2`\n\n, with `42`\n\n: the test stays green. It doesn't check behavior, it checks that the lights are on.\n\nThe trap is that this phantom test inflates your coverage. Coverage counts lines *executed*, not assertions that *bite*. A line crossed by a test that asserts nothing useful counts as much as a line genuinely verified. So a 90% coverage report can hide half a suite of tests that will never fall, even if you break the code on purpose.\n\nThat's exactly an LLM's playground. Its reward signal is \"the tests pass\". Not \"the tests catch a bug\". With no external oracle to stop it, it drifts toward the shortest path to green: soft assertions, mocks that test themselves, cases that never exercise the risky branch.\n\nThe counter is one move, and it's as old as TDD: before trusting a test, check that it knows how to fail. Mutate the line it's meant to protect, rerun, and expect to see it go red. If it stays green, it's vacant.\n\nOn our function, I change the discount for one second:\n\n```\n// temporary mutation: - becomes +\nreturn total + 10\n```\n\nThe phantom test stays green. Verdict: bin it. Here's the one that earns your trust:\n\n```\nfunc TestDiscount(t *testing.T) {\n    if got := Discount(150); got != 140 {\n        t.Errorf(\"Discount(150) = %d, want 140\", got)\n    }\n}\n```\n\nWith the same mutation, `Discount(150)`\n\nreturns 160, the test goes red instantly. It bites. That's a test: not one that passes, one that knows why it might not.\n\nDoing this by hand on every test doesn't scale. That's precisely what **mutation testing** automates: the tool applies hundreds of small mutations to your code (a `>`\n\nthat becomes `>=`\n\n, a `+`\n\nthat becomes `-`\n\n, a gutted `return`\n\n) and reruns your suite after each one. Every mutation that makes no test go red is a *surviving mutant*: a hole your tests can't see.\n\nIn Go, [gremlins](https://github.com/go-gremlins/gremlins) does the job:\n\n```\ngo install github.com/go-gremlins/gremlins/cmd/gremlins@latest\ngremlins unleash ./...\n```\n\nIt gives you a *mutation score*: the percentage of mutants killed. Where coverage tells you \"this line is crossed\", the mutation score tells you \"this line is actually tested\". The two numbers have nothing to do with each other, and it's the second that counts.\n\nWhen I let an agent write code and its tests, I don't let it declare itself done. Before any review, an objective gate runs: build, lint, test suite, then a red-check on the critical tests. The agent mutates the target line itself, checks the test goes red, restores it. A test still green after mutation gets rewritten, not negotiated. The LLM doesn't get a vote on \"does this actually test something\": the mutation decides, it only observes.\n\nThe rule that falls out is simple: no generated test enters the suite without proving it can fail. The cost is tiny, the payoff huge, because a vacant test is worse than no test. The absence, you see it. The vacant one lulls you.\n\nWe've learned to distrust AI-written code, so we review it. We still extend blind trust to the tests it writes, because they're green. But green doesn't prove itself: a test is only worth the red it's able to produce. Until you've watched a test fail at least once, you don't have a test, you have a decoration.", "url": "https://wpnews.pro/news/your-ai-writes-tests-that-can-never-fail", "canonical_source": "https://dev.to/ohugonnot/your-ai-writes-tests-that-can-never-fail-3i57", "published_at": "2026-06-28 12:38:00+00:00", "updated_at": "2026-06-28 13:04:34.004526+00:00", "lang": "en", "topics": ["artificial-intelligence", "developer-tools", "machine-learning"], "entities": ["Gremlins", "Go"], "alternates": {"html": "https://wpnews.pro/news/your-ai-writes-tests-that-can-never-fail", "markdown": "https://wpnews.pro/news/your-ai-writes-tests-that-can-never-fail.md", "text": "https://wpnews.pro/news/your-ai-writes-tests-that-can-never-fail.txt", "jsonld": "https://wpnews.pro/news/your-ai-writes-tests-that-can-never-fail.jsonld"}}