My MCP server had 32 green tests. Not one of them had ever called a tool. A developer discovered that their open-source Safari MCP project had 32 passing tests, none of which actually tested the tool's behavior. The test suite only verified tool existence and schema validity, leaving critical security logic untested. After writing behavioral tests for tab ownership rules, the developer refactored safely, highlighting the danger of green builds that provide false confidence. For weeks my CI badge was green and I believed it. Safari MCP https://github.com/achiya-automation/safari-mcp is an open-source tool that lets an AI coding agent drive a real, logged-in Safari — click, type, read the page, switch tabs. It registers 96 tools. The test suite ran on every push across three Node versions and came back 32 passed, 0 failed. Green is green. I shipped on it. Then I went to extract a chunk of index.js into its own module, and while staring at the diff I asked a question I should have asked months earlier: Which of these 32 tests would fail if I broke the security boundary? The answer was none of them. I read the suite line by line. Two of the tests carried almost all the weight: server starts and lists all registered tools — boots the server, asserts the tool count. valid schemas + unique names — every tool has a schema, no duplicate names.The rest were string-escaping and JS-injection helpers. All useful. All real. And all of them answered the same kind of question: does the thing exist and is it shaped correctly? Not one of them answered: does calling it do the right thing? It's the test-suite equivalent of checking into a hotel by confirming the building has 96 doors with correct room numbers — and never once trying a key in a lock. The most security-critical code in Safari MCP is tab ownership. The rule is simple to say and easy to get subtly wrong: the agent may only touch tabs it opened. It must never navigate, click, or read a tab the human opened — that's someone's half-written email, their banking session, their unsaved work. That logic lived in a tangle of module-local state: a map of owned tabs, a TTL so stale entries expire, a blank-URL sentinel for tabs mid-load, a matcher that decides whether https://app.example.com/org is "the same" tab as https://app.example.com/org-evil . Read that last one again. /org vs /org-evil . If the matcher is even slightly too loose — a startsWith where it needed a path-boundary check — the agent could decide it "owns" a look-alike tab and start typing into it. There was not a single test exercising that comparison. The suite was 100% green the whole time the security boundary had zero behavioral coverage. A regression there wouldn't have turned CI red. It would have turned CI green and wrong — the worst color a test suite can be. A red build is honest. It stops you. The failure is the feature. A green build that proves nothing gives you the feeling of safety without the substance — and you make decisions on that feeling. You refactor confidently. You merge contributor PRs confidently. You tell users the boundary holds. Every one of those is a small bet placed on a test that was never actually watching the thing you care about. This is the same failure mode I keep running into in this project, wearing a different costume each time: The pattern is always the same: the system doesn't fail loudly. It quietly does less, and the signal you're trusting keeps saying "fine." Before extracting anything, I wrote the tests that should have existed from day one — behavioral tests that call the ownership logic and assert on its decisions : /org does /org-evil the path-boundary case .Nine of them. The suite went from 32 to 41 it's 46 today, after a later round for macOS compatibility . More importantly: now if I loosen that matcher by one character, a test goes red and names the boundary in the failure message. The security rule finally has a tripwire. Only then did I do the refactor — extract the state layer into its own module — and the new tests held identity across the move, which is exactly the confidence I'd been pretending to have. Counting your tools is not testing your tools. Schema validation is not behavior. A green suite tells you what it checks — and stays silent about everything it doesn't, in the most reassuring tone possible. So the question to ask of any test suite, especially one you've been trusting: What is the single worst thing that could break in this codebase — and would a test go red if it did? If the honest answer is "no," your CI badge isn't lying. You just never asked it the right question. Safari MCP is open source — the ownership tests are in test/ownership-state.test.mjs if you want to see what "test the boundary" looks like in practice. More on what I'm building at achiya-automation.com. What's the security boundary in your project that your test suite has never once exercised?