{"slug": "tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent", "title": "Tests Pass, Design Breaks: Why TDD Can't Hold the Line on Design Intent", "summary": "A Forward Deployed Engineer working on AI adoption reports that TDD cannot guarantee design correctness, especially when AI generates tests. The engineer describes an incident where a function signature change broke callers despite all tests passing, because the tests themselves were based on a misunderstood design intent. The engineer warns that AI accelerates test generation but does not verify whether tests express the correct design intent, leading to a growing quality debt.", "body_md": "There is a popular misconception that if you do TDD, your design also stays correct. That if the tests pass, quality is guaranteed. In AI-assisted development, this misconception is the kind that quietly accumulates — the more tests you have, the more invisible damage builds up underneath.\n\nHere is what happened today.\n\nA function called `safe_post.py`\n\nhad its signature changed. Two arguments — `notify_sh`\n\nand `doctor_sh`\n\n— were removed. The test suite passed in full.\n\nBut the callers were still using the old signature. They were silently broken.\n\nWhy did the tests pass? Because the test code itself was using the old signature. The tests had been written (by AI) at a time when the design intent was already misunderstood. The misunderstanding was baked into the tests from the start.\n\nTests passing and the design being correct are two different things.\n\n\"All tests pass\" tells you only one thing: the implementation matches what the tests expect. Whether the tests express the right design intent is a separate question.\n\nLet me restate the TDD definition.\n\nRed → Green → Refactor. Write a test. Write the implementation that passes the test. Refactor.\n\nIn this loop, what the test verifies is whether the implementation meets the test's expectation. That is one verification — and only one.\n\nWhat TDD does not verify is whether the test itself correctly expresses the design intent.\n\nThe structure looks like this:\n\n```\nDesign intent  →  Tests  (← this link is not verified)\n                    ↓\n                  Implementation  (← this link is verified by tests)\n```\n\nIf the person writing the tests misunderstands the design intent, the tests will pass and the design will still be wrong. Machine learning engineer Hamel Husain calls this the \"Gulf of Specification\" — the gap between what you intended to measure and what your metric actually measures. Optimize hard against a flawed metric and you optimize hard in the wrong direction. The same dynamic plays out in TDD.\n\nThis is not a critique of TDD. It is a statement that TDD, by its structure, cannot solve this particular problem.\n\n\"Then review the tests,\" is the natural counter. Yes — but how do you review the review?\n\n```\nDesign intent  →  Tests  →  Implementation\n                    ↑\n               Human reviews (does it express intent?)\n                    ↑\n               Who reviews the reviewer?\n                    ↑\n               ... (infinite regress)\n```\n\nThe only way out of this snowball is to design a terminator for the review chain. And the terminator must, eventually, be a human.\n\nThe problem is that AI accelerates this loop. AI writes the implementation quickly, writes the tests quickly, makes them pass quickly. The faster the AI side moves, the more \"is this test expressing intent?\" work piles up on the human side. The paradox is sharp: the more you automate, the more confirmation work humans inherit.\n\nAs a Forward Deployed Engineer working on AI adoption in the field, I run into this paradox often. The pattern goes: \"AI made our development faster\" — then a few weeks later — \"but the design is getting more tangled.\"\n\nWhen speed goes up, the share of time allocated to design review goes down in relative terms. When the number of tests goes up, the cost of asking \"is this test correct?\" goes up with it. Use AI without being aware of this, and the speed benefit converts itself into a quality cost.\n\nAI is good at writing tests. \"Write tests for this code\" — a few seconds, and you have a plausible test file.\n\nThat is exactly where the problem is.\n\nThe tests AI writes tend to be \"tests reverse-engineered from the implementation.\" They describe what the code currently does. This is excellent for verifying \"implementation against tests.\" It is nearly useless for verifying \"tests against design intent.\"\n\nThe reason is simple: AI does not know the design intent. Unless it is in the context, AI reads the implementation, observes the behavior, and turns that behavior into tests. It converts \"this is how it currently behaves\" into a test, not \"this is how it was supposed to behave.\"\n\nThe `safe_post.py`\n\nstory is exactly this. The tests had been written against the old signature. Nobody noticed. The tests faithfully verified that the implementation matched a now-outdated assumption. After the signature changed, the tests stayed where they were.\n\nWhat makes this nasty is that the discovery is delayed.\n\nNormal bugs are caught the moment the implementation fails the test. But \"tests don't express the design intent\" bugs only surface when the actual runtime behavior diverges from what was intended. From the test output, everything looks fine.\n\nIn the `safe_post.py`\n\ncase, the fact that callers were using the old signature didn't surface until the code path actually ran. From the test suite alone, the answer was \"all green.\"\n\nThe only way to stop the snowball is to separate what can be machine-verified from what cannot.\n\nMachine-verifiable:\n\nNot machine-verifiable:\n\nHumans only confirm the second category. Everything in the first goes to machines.\n\nIf you skip this split and march forward under the belief that \"more tests = more safety,\" every new test adds another item to the \"do I trust this test?\" pile. Confirmation cost grows linearly with test count.\n\nIn the `safe_post.py`\n\ncase, the signature change was something a type checker could have caught. With Python type annotations, `mypy`\n\ncould have pointed straight at the caller using the old signature.\n\nA different layer from TDD. A different mechanism. Widening the machine-verifiable surface is the realistic way to keep design integrity intact. Be explicit about which range tests own, which range the type checker owns, and which range humans own.\n\nTo shrink the human surface, externalize design intent as context.\n\nWhen asking AI to write tests, lead with the intent. Not \"write tests for this function\" but \"this function's responsibility is X and Y; it does not handle Z; please write tests that verify those two.\" When you change a signature, write: \"this function's responsibility now excludes the notification side; tests should reflect that exclusion.\"\n\nEven then, misunderstandings happen. But the divergence between intent and generated test is smaller than when you hand AI nothing but implementation code.\n\nTo be clear: I am not against TDD.\n\nTests are necessary. Automated tests are the only practical way to verify boundary conditions. They are the only mechanism that can flag \"did this signature change break the callers?\" — provided the prerequisite holds, that the tests themselves correctly express the design intent.\n\nThe problem is the belief that \"if you do TDD, your design is also safe.\"\n\nTDD is a tool that raises implementation quality. It is not a tool that verifies design intent. Use it with that distinction in mind, and TDD becomes a powerful weapon. Confuse the two and you get a state where \"confidence rises but the actual coverage of quality assurance shrinks.\"\n\nIn AI-assisted development this distinction matters more, not less. The faster AI can generate tests, the more the gap between \"tests written\" and \"intent verified\" widens — unless you deliberately design the mechanism that closes it.\n\nA practical organizing frame:\n\n**Layer 1: implementation correctness (TDD)**\n\nTests carry expectations; the implementation must satisfy them. Red/Green/Refactor. The layer AI is best at.\n\n**Layer 2: design integrity (types / static analysis)**\n\nSignature consistency, type matching, contracts with callers. Type checkers and linters do this. Machine-owned.\n\n**Layer 3: alignment with design intent (humans)**\n\nWhether the test truly expresses \"why this should behave this way.\" Whether the implementation's \"why\" matches the design intent. Humans only.\n\nWhen AI accelerates test generation, Layers 1 and 2 stay machine-owned. Build the discipline of confirming only Layer 3 by human. That is the realistic design for keeping speed and quality together.\n\nThe conversation broadens slightly from here.\n\nAs AI-assisted development accelerates, the value of being able to articulate design intent rises.\n\nThe cost of writing code has dropped. The cost of writing tests has dropped. Both can be generated in seconds. But \"what should we build?\" and \"why does this design have to look like this?\" — these AI does not figure out for you. More precisely: unless you put the intent into the context, AI defaults to \"the design inferred from the current implementation.\"\n\nA person who can verbalize design intent gives AI more concrete instructions. \"This function's responsibility is X and Y. Z is out of scope. Tests should verify these two.\" Hand AI that, and the gap between intent and generated tests shrinks.\n\nA person whose design intent lives only in their own head, hands AI nothing concrete. Every confirmation step boomerangs back to the human. When the design intent is not verbalized, the faster AI goes, the more confirmation cost the human inherits.\n\nI see this pattern more often in the field now: \"we introduced AI, development sped up, but quality confirmation has become exhausting.\" The \"exhausting\" part is mostly the design-intent verbalization gap. Speed exposes what was tacit.\n\nTDD does not guarantee design intent for the same reason AI does not guarantee design intent. Both are tools that process what is written. Design intent, unless humans put it into writing, lives nowhere a machine can read it.\n\nA concrete question: where should you put it?\n\n**In code, via test names.** Not in comments, in the test name itself. The test name is the place to say \"what should this implementation be doing, and why.\" `test_safe_post_handles_missing_file`\n\nsays less than `test_safe_post_completes_without_notify_when_notify_sh_is_absent`\n\n. The longer name carries the intent.\n\n**In documents, via ADRs (Architecture Decision Records).** Why you chose this design, what alternatives existed, the assumptions behind the choice. You do not need perfection. A single paragraph — \"the current signature is X and Y for these two reasons\" — drastically lowers the cost of judging a future signature change.\n\n**In conversation, via PR comments and issue threads.** A code review comment that carries design intent becomes a future tracer for \"why is it like this?\"\n\nThe common move across all three: externalize design intent. Do not keep it in your head. Put it where a machine can reference it.\n\nThere is no shortcut to verifying design intent. The region machines cannot handle stays with humans.\n\nWhat you can do is shrink the human region. Automate the machine-verifiable side aggressively. Confirm only what is left.\n\nNot \"tests pass, so we are correct.\" But \"did I confirm that the tests express the design intent correctly?\"\n\nTDD is a powerful tool. Use it with a clear sense of what it covers and what it does not. Without that distinction, the faster AI development gets, the more quietly things break underneath.\n\nThat is the lesson from today.\n\n*This post was adapted (not literally translated) from a Japanese original at nomuraya-hub.pages.dev. I am the same author writing under different pen names — \"nomuraya / shimajima / 中翔\" — depending on the medium.*", "url": "https://wpnews.pro/news/tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent", "canonical_source": "https://dev.to/nomurasan/tests-pass-design-breaks-why-tdd-cant-hold-the-line-on-design-intent-42ml", "published_at": "2026-06-27 00:21:34+00:00", "updated_at": "2026-06-27 00:34:01.787551+00:00", "lang": "en", "topics": ["developer-tools", "ai-tools", "ai-safety", "machine-learning", "large-language-models"], "entities": ["Hamel Husain", "Forward Deployed Engineer"], "alternates": {"html": "https://wpnews.pro/news/tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent", "markdown": "https://wpnews.pro/news/tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent.md", "text": "https://wpnews.pro/news/tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent.txt", "jsonld": "https://wpnews.pro/news/tests-pass-design-breaks-why-tdd-can-t-hold-the-line-on-design-intent.jsonld"}}