{"slug": "when-pytest-said-passed-it-was-lying", "title": "When pytest Said \"Passed,\" It Was Lying", "summary": "A developer building a coding agent called ForgeFlow discovered that their pytest test suite reported 186 passed tests even though the tests were running in the wrong Python virtual environment. The error went unnoticed because the wrong environment had the same packages at different versions, causing no failures. This incident highlights the danger of trusting green checkmarks without verifying the test environment.", "body_md": "*Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.*\n\nFor a few days, I made decisions on top of a number that wasn't true.\n\nThe number was `186 passed`\n\n. It came out of pytest, green, at the bottom of the terminal, the way it had dozens of times before. I trusted it the way you trust a number that has never been wrong before. Then I found out the run had been measured inside the wrong environment, and the green had very little to do with the code I thought I was checking.\n\nTo be fair to the tool: pytest wasn't wrong. It answered exactly the question I handed it — it just wasn't the question I meant to ask. This post is about that gap. Not a bug in a test, but a bug in *how I measured the tests*. It turned out to be one of the more uncomfortable lessons in the project so far, because it sat underneath everything else. If the floor is tilted, every measurement you take on top of it inherits the tilt, and you don't see it, because the floor looks like the floor.\n\nI'm writing it down mostly because I suspect I'm not the only person who has trusted a green checkmark that didn't earn it.\n\nForgeFlow is a coding agent that runs a test-driven loop. Plan, write a failing test, write code, run the tests, decide what happened, repeat. Because so much of the system's behavior is judged by \"did the tests pass,\" I keep a **baseline**: a known set of test files that should report a known set of numbers. Before and after almost any change to the engine, I re-run the baseline and compare. If the counts move when they shouldn't, something is wrong.\n\nOne thing to be precise about, because it matters later: the agent *executes the code it generates* inside a Docker sandbox, but this baseline — the engine's own test suite — I was running from my host shell. Two different executions. The sandbox was fine. The host shell was the problem.\n\nThe baseline is the closest thing the project has to a source of truth about its own health. That's exactly why this hurt.\n\nOne afternoon I was moving between two things on the same machine — the agent's own codebase, and an unrelated project I'd been poking at earlier. I ran the baseline. Green. `186 passed`\n\n. I noted it, moved on, and built the next decision on top of it.\n\nWhat I didn't notice was a single line of state that had carried over from the earlier work.\n\nI'd left a different project's virtual environment active.\n\nThat's the whole bug, mechanically. The shell still had another project's `VIRTUAL_ENV`\n\nset, so when I ran pytest, it resolved `pytest`\n\nthrough *that* environment, and Python resolved imports against that environment's installed packages.\n\nHere's the question a careful reader asks immediately: if it was the wrong environment, why didn't it just fail? Why no `ModuleNotFoundError`\n\n, no loud red collection error?\n\nBecause nothing was missing. The polluted environment happened to have the same packages installed — only at different versions. So nothing errored out. The tests collected, ran, and passed; they just passed against a version matrix the baseline doesn't assume. And that's the genuinely unsettling part: **if a dependency had been entirely absent, I'd have gotten a loud error and caught it in seconds.** The danger was precisely that everything was *present* — present and subtly wrong. Wrong in the one way that doesn't announce itself.\n\nThe problem is that \"green\" had stopped meaning what I read it as. I read it as *\"the code is correct in the environment it's meant to run in.\"* What it actually meant was *\"the code passed in whatever environment happened to be active.\"* Those are different sentences. For a few days I couldn't tell them apart, because the terminal prints the same word for both.\n\nHere's the part worth sitting with: **nothing failed.** A failing test is a gift — it's loud, it points at itself, you go fix it. This didn't fail. It passed, and the passing was the problem. The signal I rely on to catch mistakes was itself the mistake, wearing the costume of everything being fine.\n\nWhen I finally caught it — by comparing against a run from a fresh shell using the project's intended environment, and noticing the counts didn't line up — I reduced the lesson to a sentence I've kept since:\n\nWhether the tests pass and whether the\n\nmeasurement of the tests is trustworthyare two separate questions, and I had been treating them as one.\n\nAlmost all of my testing discipline had been aimed at the first question. I had careful tests. What I didn't have was anything checking the second — the integrity of the act of measuring. The environment the measurement runs in is an input to the result, and I'd been treating it as a constant when it was actually a variable I'd left lying around.\n\nIt's easy to invest heavily in test *correctness* while leaving test *measurement integrity* implicit — to treat it as the environment's job, or the tooling's job, rather than something to check directly. Plenty of teams do handle it, with lockfiles, containerized test runs, hermetic builds. I just wasn't one of them at this layer, for this particular command, on this particular day.\n\nTwo things, deliberately small.\n\n**A guard before the measurement, not just inside it.** The cheapest fix is a single check that runs before the baseline: confirm the environment is the one I think it is. In my case, simply asserting that no foreign virtual environment was active would have caught it — the testing equivalent of checking the floor before you measure the wall. It's almost embarrassingly simple, and it would have caught this in one second. (A stricter guard wouldn't stop at the `VIRTUAL_ENV`\n\nvariable; it would also check `sys.executable`\n\n, the resolved `pytest`\n\npath, and the expected project root. The variable was the obvious giveaway here, but it's the weakest of the four.)\n\n**A separate set of checks for the measurement itself.** Beyond the one-line guard, I added a small, dedicated set of tests whose only job is to protect the *invariants of how I measure* — not the features, the measurement. They're counted separately from the normal baseline on purpose, so they can't be quietly folded into the same number they're supposed to be watching. The exact count is secondary. What matters is that \"is my measurement honest\" became something the system checks for me, instead of something I assume.\n\nNeither of these is clever. That's sort of the lesson. The failure wasn't subtle once I saw it; it was invisible only because I'd never thought to look there.\n\nI want to be careful not to inflate this into a grand principle.\n\nThis is one incident, on one machine, caused by one careless bit of leftover state. It doesn't prove that everyone's test suites are secretly lying, and it doesn't prove you need an elaborate measurement-verification layer. For a small throwaway script, a clean shell and a moment of attention is the entire fix, and the machinery would be overkill.\n\nTo be clear about which part scales: environment hygiene matters at any size — it's the *guards and dedicated checks* that scale with how much you're betting on the number. I'm betting a lot on mine; the agent makes real decisions off these signals. So for me the machinery earned its place. The hygiene would have been worth it regardless of project size.\n\nI'm also aware the \"fix\" mostly moves the trust down one level. Now I trust the guard. If the guard is wrong, I'm back where I started. There's no absolute bottom here — just a level low enough that I'm willing to stop and call it ground. I picked one. I could be wrong about whether it's low enough.\n\nIf I had to compress it: a green test run answers \"did the code pass?\" It does *not* answer \"did I measure that in the environment I meant to?\" The second question has its own failure mode, and because the failure mode is *passing*, your normal instincts — chase the red, fix what's loud — never fire.\n\nThe verification pyramid most of us picture has tests at the bottom. I'd now put one more layer underneath it: *the environment the tests run in.* When that layer shifts, every green light above it is reporting on an environment you're not actually in.\n\nI'd like to know how other people handle this. Do you guard your test environment explicitly, or rely on convention and attention? Have you been bitten by a *passing* result that turned out to be measured wrong — and if so, how did you finally catch it? I caught mine by luck and a mismatched count. I'd rather not depend on luck next time, and I suspect some of you have better answers than I do.\n\n*Next in the series: a quality gate in the same system blocked code 198 times — and why I was wrong to call that \"working.\"*", "url": "https://wpnews.pro/news/when-pytest-said-passed-it-was-lying", "canonical_source": "https://dev.to/josephyeo/when-pytest-said-passed-it-was-lying-43p7", "published_at": "2026-06-20 13:05:33+00:00", "updated_at": "2026-06-20 13:37:09.412352+00:00", "lang": "en", "topics": ["developer-tools", "ai-agents", "machine-learning"], "entities": ["ForgeFlow", "pytest", "Claude", "Ollama", "Docker", "M5 Max"], "alternates": {"html": "https://wpnews.pro/news/when-pytest-said-passed-it-was-lying", "markdown": "https://wpnews.pro/news/when-pytest-said-passed-it-was-lying.md", "text": "https://wpnews.pro/news/when-pytest-said-passed-it-was-lying.txt", "jsonld": "https://wpnews.pro/news/when-pytest-said-passed-it-was-lying.jsonld"}}