# Green Tests Don't Mean Better Software

> Source: <https://dev.to/reporails/green-tests-dont-mean-better-software-59of>
> Published: 2026-06-30 17:15:05+00:00

You spent the week on a refactor because it was supposed to make the next change cheaper. The tests go green, the PR merges, and that's the last anyone thinks about it. Nobody comes back to check whether the next change actually got cheaper. The green bar said the code conforms to its spec, everyone read that as done, and the reason you did the work in the first place went unmeasured.

Your CI is green. The test asserts `assert response.status == 200`

, and it passes. It does not assert that the caching layer you just shipped cut p99 latency, or that the reworded onboarding step lifted activation, or that the prompt change made the agent follow instructions more often.

The green bar answered a question you never asked.

What did that green actually tell you? That the change conforms to the spec — the inputs map to the asserted outputs, the contract holds, nothing you wrote a test for regressed. That is correctness. It is real and it is necessary. It told you nothing about whether the system got better. You shipped because the bar was green, and the bar was never measuring the thing you shipped the change for.

Those latency, activation, and adherence numbers are effects. CI does not measure effects. It measures conformance, and then we read the green bar as if it answered the other question too.

The silent assumption is that "passes" and "better" are the same axis — that a change which is correct is, by that fact, an improvement. They are not the same axis. A perfectly correct change can make the system worse, and a perfectly correct change can leave every metric exactly where it was; the suite goes green for both. Green is necessary for "better" and nowhere near sufficient for it, and almost no team instruments the difference.

The industry already named both halves of this. It named them as rivals.

[Spec-Driven Development](https://github.com/github/spec-kit) says: write the specification first, then build to it, then prove the build conforms. The test suite is the executable form of the spec. Correctness is the whole game. SDD is disciplined, auditable, and it is what most engineering orgs actually run. As a flow it terminates the moment the bar goes green:

Notice where it stops. It stops at *correct*. Nothing in that loop asks whether the shipped change improved anything.

[Hypothesis-Driven Development](https://www.thoughtworks.com/insights/articles/how-implement-hypothesis-driven-development) asks the opposite question. Thoughtworks framed it as a triple: *we believe* `X`

*will result in outcome* `Y`

*; we will know we are right when we see measurable signal* `Z`

. The unit of progress is not a passing build — it is validated learning. You predict an effect, you ship, you measure, and the measurement tells you whether you were right. PMI's expectation-management literature says the same thing from a different room: an expectation is a managed object with an owner and a due date, tracked and surfaced early rather than discovered late. As a flow it begins where SDD ends — at the prediction, and it runs past the ship:

These get pitched as competing philosophies — spec-first versus hypothesis-first, prove-correct versus prove-better, pick a camp. Look at the two flows again. One ends at *correct*. The other starts at *predicted* and ends at *better*. They answer two different questions, and the pick-a-camp framing assumes they answer one. Thoughtworks HDD and PMI expectation-management are not novel claims I am making here; they are prior art, decades of it, and I am citing them as validation. What none of them did was notice the two flows were never aimed at the same question.

Two different questions means two different axes. You cannot answer one by measuring the other — and a single pass/fail bar is built to answer only the first. Correct sits on one axis; better sits on the one perpendicular to it.

One axis is spec-conformance: does the change do what it was specified to do? Pass or fail, and `pytest`

already answers it. The other axis is effect-verdict: did the change move the metric it was supposed to move? Confirmed, refuted, or not yet known — and nothing in your pipeline answers it today.

Lay them on a 2×2.

```
|                         | **effect confirmed**     | **effect refuted / unmeasured** |
|-------------------------|--------------------------|---------------------------------|
| **spec passes (green)** | shipped and works        | **green but no better**         |
| **spec fails (red)**    | blocked (correctly)      | blocked (correctly)             |
```

Three of those quadrants are familiar. Red blocks the merge, whichever way the effect would have gone. Green-and-confirmed is the win you wanted. The quadrant nobody names is the top-right: **green but no better**. The change is correct, it conforms to spec, and it shipped — and it did not improve the system, or you never checked, which from the system's point of view is the same thing.

That quadrant is where most shipped changes actually live, and it is invisible because the only instrument pointed at it is the green bar, which is pointed at the wrong axis. This is the reusable mental model: stop asking "did it pass" as if it were one question. It is two questions on two axes, and you are only instrumenting one of them.

Make it concrete with a case every Python team recognizes. You adopt the hexagonal layout — Ports and Adapters. A pure core (`contract/`

, `dto/`

, `policy/`

) imports only the standard library. Adapters wrap the I/O. Subsystems compose the adapters. The dependency rule is one sentence: dependencies point inward, and the pure core points at nothing.

You enforce it with an architecture test that runs on every commit:

``` python
def test_core_purity():
    for module in pure_core_modules():
        assert no_imports(module, {"yaml", "requests", "subprocess", "open"})
```

It is green. Every pure module is import-clean. The dependency graph conforms to the rule — proven correct, mechanically, on every push.

Now ask the other question. You did not adopt the hexagonal layout for its own sake. You adopted it on a claim: changes would get cheaper, blast radius would shrink, the core would be testable without a single mock. Did that happen? `test_core_purity`

cannot tell you. It measures the shape of the import graph, not whether the shape paid off. Green here means *conformant*. A conformant architecture that made nothing cheaper is the green-but-no-better quadrant with a `.py`

file pointing straight at it.

That is the whole thesis in one test file. The architecture test lives on the spec axis. The promise that sold the refactor lives on the effect axis. They do not touch, and only one of them has an instrument.

If "better" is its own axis, it needs its own instrument. That instrument is an expectation — the HDD flow above, made into a first-class artifact.

An expectation is a change-scoped, falsifiable, deadline-carrying prediction. It is the HDD triple — believe `X`

, expect outcome `Y`

, know it by criterion `Z`

— turned into an artifact that travels with the change instead of living in a planning doc nobody reopens. It carries a baseline, a bound measurement view — the query, tied to that change, that reads the metric — a threshold, and a due date. Concretely: *the caching change should cut p99 latency below 200ms, measured against last week's baseline, re-checked Friday.* That sentence is the whole artifact, and it rides attached to the diff.

Here is the part that is genuinely new, and the part I want to be precise about: the verdict is **measured by the system**.

HDD and PMI both already told you to make a falsifiable prediction with an owner and a deadline. That is not the advance. The advance is that the prediction binds to a real measurement view, and a system reads that view and stamps the verdict itself. The human who made the prediction does not come back and grade it. The human leaves the verification loop.

This is not a dashboard alert. A dashboard alert fires on a metric crossing a line, untethered to any one change. An expectation binds one prediction to one change and grades *that prediction* — the dashboard tells you latency rose; the expectation tells you the change you predicted would cut it did the opposite.

The verdict vocabulary is three words, and the discipline is in the third:

`confirmed`

`refuted`

`inconclusive`

That third word is the one most dashboards quietly drop. A prediction you cannot yet grade is not a pass. Keeping `inconclusive`

distinct from `confirmed`

is what stops the green-but-no-better quadrant from refilling under a different name.

The selector routes by the change's *effect*, not its *shape*. A one-line diff can carry a large measurable effect and owe an expectation; a thousand-line refactor can be pure conformance and owe none — and you pick the lane when you ship, not after.

Run the selector over that same hexagonal project. Re-homing a module to clear a `test_subsystem_isolation`

failure is a shape change. It restores the boundary, its effect resists measurement, it owes nothing past the green test. The refactor that *promised* cheaper changes is the other lane — it made a measurable claim, so it owes an expectation: *adding the next artifact class touches three files or fewer, re-checked when the next one lands.* Same codebase, two changes, two lanes.

This is what the spec-versus-hypothesis fight always missed. The selector does not pick a winner between SDD and HDD. It is conditional. It runs SDD's discipline on changes whose effect resists measurement and HDD's discipline on changes whose effect can be measured, and it picks per change. A discipline that reconciles two camps looks like a routing rule that knows when each camp is right.

There is one trap to name: do not attach an expectation to an unmeasurable change just to satisfy the selector. A prediction with no real threshold is conformance wearing a costume, and it teaches everyone to ignore the verdicts. If the effect resists measurement, declare none. The honesty of the `refuted`

and `inconclusive`

verdicts depends on the selector being willing to say "this change owes nothing."

This is the part that has to survive contact with a real backlog. The tempting first cut is to flag every shipped change as owing a prediction. Try it and it floods on contact: the renames, the prose-clarity passes, the reorganizations — the bulk of any change history — all light up red against a rule they cannot satisfy, because there was never a metric for them to move. The flood is the proof. A selector that cannot say "this change owes nothing" does not enforce the discipline; it discredits it. The conditional design is not a softening of the rule — it is the only version of the rule that does not collapse the first time you run it over changes that already shipped.

A coordination system shipped exactly this layer, in working code. A change to a governed surface declares an expectation: predicted delta, bound measurement view, due date. The system re-measures when the due date passes, stamps `confirmed`

/ `refuted`

/ `inconclusive`

against the threshold, auto-resolves the confirmed ones silently, and surfaces the refuted and inconclusive ones the next session, one line each. The selector decides which changes owe a prediction and which fall back to conformance. The point here is only that the pairing — a governance selector plus an auto-measured verdict — is buildable today, because it has been built. The mechanism is the proof; the product is not the subject.

`confirmed`

is the minority verdict
What running it actually teaches is the part the diagram cannot. Before the verdicts come back, you assume most predictions will land — you shipped the change believing it would help, so of course the measurement will agree.

It does not. Once an instrument is actually pointed at the effect axis, `confirmed`

turns out to be the minority verdict. Most changes come back `inconclusive`

or `refuted`

: the metric did not move past the threshold, or there was no scalar to read in the window at all.

The first time a batch of verdicts surfaces and almost none of them are `confirmed`

, the 2×2 stops being a diagram. The green-but-no-better quadrant is not a rare corner you occasionally fall into — it is where most changes sit, and you only learn that the moment you can finally see it.

`inconclusive`

is the common case
`inconclusive`

is the verdict you underestimate most. "We didn't measure" happens far more often than any prediction would lead you to expect — the window was too short, the metric needed traffic that had not arrived, the effect was real but not yet legible.

The entire value of the third word is that it refuses to round itself up. Without it, every one of those un-graded predictions quietly refiles as a pass, and the quadrant refills under a name that looks like success. The discipline lives in holding `inconclusive`

open, in plain sight, until there is a scalar — and discovering, run after run, how often that takes longer than you thought.

`confirmed`

is not permanent
Even the `confirmed`

ones are not closed for good. A threshold set too loose can confirm a change that later regresses, and a silent auto-resolve would bury exactly that. A confirmed verdict earns its silence; it does not earn permanent trust.

One discipline already exists for this. [SkillOpt](https://arxiv.org/abs/2605.23904) (Yang et al., 2026) accepts a self-authored skill edit only when it improves a *held-out* validation split — not the data the edit was tuned against. The held-out split is the guard against exactly that failure: a verdict that reads `confirmed`

because the threshold was set on the same signal the change was shaped to move. An auto-measured verdict is only as honest as the separation between what tunes the change and what grades it.

The claim worth taking away is the framework, not any one implementation: a **governance selector** that routes each change to the right discipline, paired with an **auto-measured verdict** that proves the effect without a human grading it, is a reconciliation of spec-driven and hypothesis-driven development. It is not a third methodology stacked beside the two you have; it tells you which of the two you already know to apply, and then it does the grading.

That reconciliation stops being optional the moment the machine ships the change.

When a human writes every diff, the green-but-no-better quadrant is a slow tax — improvements that were not improvements, paid for in drift you notice eventually. When an AI-assisted or autonomous system writes and ships changes, there is no human in the loop reading the effect at all. The tests still run, so correctness still gets checked; nothing measures the effect, so nothing catches it slipping. Unmeasured-effect is precisely where a self-running system's regressions hide — each change correct, each change green, the system quietly getting worse along the one axis nobody instrumented.

The practice fits on one line. Before you merge the next change that makes a measurable claim, write down what you expect it to move, where that gets measured, and when you will look — then let the verdict come back to you instead of going to find it.

Green tells you the machine built the thing right. Only a measured verdict tells you the thing was worth building. If you are handing more of the building to machines, that second instrument is the one you cannot afford to leave off.

[Reporails](https://reporails.com) — deterministic instruction diagnostics and governance for the rules, prompts, and agent files that steer AI coding agents.