cd /news/artificial-intelligence/your-ai-writes-tests-that-can-never-… · home topics artificial-intelligence article
[ARTICLE · art-42532] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=· neutral

Your AI Writes Tests That Can Never Fail

A developer warns that AI-generated tests often pass without actually verifying behavior, creating a false sense of security. The solution is mutation testing, which checks if tests can detect code changes. Tools like Gremlins automate this process to ensure tests are meaningful.

read4 min views1 publishedJun 28, 2026

You ask the AI for tests. It hands you twelve, all green. CI passes. You merge. Three days later a bug ships, on a function those tests were supposed to cover. You reopen the test file and it clicks: it ran, it passed, and it tested nothing.

A green test isn't a proof. It's a hypothesis. And an AI, left to its own devices, is very good at writing hypotheses that can never be disproved.

Take a dead-simple function, a discount above 100 euros:

func Discount(total int) int {
    if total > 100 {
        return total - 10
    }
    return total
}

Here's the kind of test an AI produces when you ask "write me a test for this" with no further framing:

func TestDiscount(t *testing.T) {
    got := Discount(150)
    if got < 0 {
        t.Errorf("result should not be negative")
    }
}

This test is green. It does run the discount branch (so your coverage climbs). But look at the assertion: got < 0

is never true, whatever Discount

does. Replace total - 10

with total + 10

, with total * 2

, with 42

: the test stays green. It doesn't check behavior, it checks that the lights are on.

The trap is that this phantom test inflates your coverage. Coverage counts lines executed, not assertions that bite. A line crossed by a test that asserts nothing useful counts as much as a line genuinely verified. So a 90% coverage report can hide half a suite of tests that will never fall, even if you break the code on purpose.

That's exactly an LLM's playground. Its reward signal is "the tests pass". Not "the tests catch a bug". With no external oracle to stop it, it drifts toward the shortest path to green: soft assertions, mocks that test themselves, cases that never exercise the risky branch.

The counter is one move, and it's as old as TDD: before trusting a test, check that it knows how to fail. Mutate the line it's meant to protect, rerun, and expect to see it go red. If it stays green, it's vacant.

On our function, I change the discount for one second:

// temporary mutation: - becomes +
return total + 10

The phantom test stays green. Verdict: bin it. Here's the one that earns your trust:

func TestDiscount(t *testing.T) {
    if got := Discount(150); got != 140 {
        t.Errorf("Discount(150) = %d, want 140", got)
    }
}

With the same mutation, Discount(150)

returns 160, the test goes red instantly. It bites. That's a test: not one that passes, one that knows why it might not.

Doing this by hand on every test doesn't scale. That's precisely what mutation testing automates: the tool applies hundreds of small mutations to your code (a >

that becomes >=

, a +

that becomes -

, a gutted return

) and reruns your suite after each one. Every mutation that makes no test go red is a surviving mutant: a hole your tests can't see.

In Go, gremlins does the job:

go install github.com/go-gremlins/gremlins/cmd/gremlins@latest
gremlins unleash ./...

It gives you a mutation score: the percentage of mutants killed. Where coverage tells you "this line is crossed", the mutation score tells you "this line is actually tested". The two numbers have nothing to do with each other, and it's the second that counts.

When I let an agent write code and its tests, I don't let it declare itself done. Before any review, an objective gate runs: build, lint, test suite, then a red-check on the critical tests. The agent mutates the target line itself, checks the test goes red, restores it. A test still green after mutation gets rewritten, not negotiated. The LLM doesn't get a vote on "does this actually test something": the mutation decides, it only observes.

The rule that falls out is simple: no generated test enters the suite without proving it can fail. The cost is tiny, the payoff huge, because a vacant test is worse than no test. The absence, you see it. The vacant one lulls you.

We've learned to distrust AI-written code, so we review it. We still extend blind trust to the tests it writes, because they're green. But green doesn't prove itself: a test is only worth the red it's able to produce. Until you've watched a test fail at least once, you don't have a test, you have a decoration.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @gremlins 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/your-ai-writes-tests…] indexed:0 read:4min 2026-06-28 ·