Breaking Build: Kiro and Claude delivered exactly what I asked, and it wasn't what I wanted

A developer building with AI agents at Agentis Lux discovered that their deployed scanner returned the same score of 62 for every site, including a false finding about a checkout button on a site without one. The pipeline had run only once in May, deploying old code, while newer changes sat untested in the repo. The developer realized the agents executed instructions literally, exposing a gap between intention and instruction, a pattern echoed in Anthropic's talk at the AWS Summit.

Building in public means showing the part where the robots did great work on the wrong thing. The deploy on Agentis Lux succeeded. Green check, no errors, site live. I scanned my own site to grab a "before" shot for a before-and-after, and the scanner handed back a score of 62. It handed back 62 for the next site too. And the next one. Same score, same findings, every time, including a finding about a "checkout button" on a site that has no checkout button. The build worked. It was running a version of the scanner I'd written weeks ago and abandoned. Everything I'd built since then was sitting in the repo, merged, tested, and not deployed. The deploy pipeline had run exactly once, in May, and never again. AND I NEVER NOTICED So the live site was a confident, well-tested, fully-green stub. Technically , nothing went wrong. That's the part I keep mulling over...and over...and over. I build with AI agents. I direct, they generate. One agent writes the infrastructure, another audits it, I make the calls and merge. It's fast and it's good, and the failure mode is not what I expected. I expected the agents to make mistakes. They mostly don't. What they do instead is build exactly what I asked for, correctly, when what I asked for wasn't what I wanted. The bug isn't in the code. The bug is in the gap between my instruction and my intention, and the agent fills that gap with whatever's most literally true. This exact thing, context engineering, came up at Anthropic's talk at the AWS Summit https://dev.to/earlgreyhot1701d/aws-summit-los-angeles-2026-why-am-i-always-learning-the-hard-way-46lb . A human orchestrator, in this case...me, pushes back. "You said deploy, but the pipeline hasn't run since May, did you mean redeploy the current code?" An agent says "deploy succeeded" because the deploy did, in fact, succeed. It answered the question I asked. I asked the wrong question that sat clearly in my blind spot. I hit this four times on one project in about a week. Same shape every time. The stub that shipped. The 62 that came back for every single site, the Groundhog Day score. The infrastructure was real, the tests were green, the deploy worked. It just deployed code I'd left behind. "Is it deployed" was true. "Is the thing I built deployed" was the question I forgot to ask. Lesson: Don't assume. The three doors, one of them real. My scanner takes three kinds of input: a URL, a code repo, an API spec. The interface showed three tabs for them. Clean, obvious, exactly what the design implied. Only the URL one was wired up. The other two were built to the spec I gave, which described three tabs, and I'd later decided to ship only URL scanning first and never updated the interface to match. So a visitor clicks "API spec," types something in, and hits a polite wall. The tabs were correct. My scope had moved and the tabs hadn't heard about it. Lesson: Kiro and Claude can't read my mind The findings only an engineer could read. My whole audience is people who build with AI and may not know what a <ul is. The scanner's findings said things like "repeated sibling elements not wrapped in ul or ol." That is a correct finding. It is also useless to the person I built the tool for. I'd asked for accurate, technical, no-fluff findings. I got them. I forgot to ask "can my actual user read this." Lesson: Don't forget you're building for the end user, a real person, not a theoretical one. The card that rendered nothing. A social card route, built, deployed, working. I saved the image and got a zero-byte file. The route fetched three fonts from the web, and when one came back empty instead of failing outright, the image renderer got fed garbage and produced nothing. The catch block that was supposed to handle font failures never fired, because the fetch didn't fail. It "succeeded" with an empty hand. The error handling was correct for the error it was watching for. The actual failure walked in through the one door nobody was watching. Lesson: Don't skip testing the live workflow. Every one of these passed its own test. The deploy deployed. The tabs matched the spec. The findings were accurate. The card route ran. If I'd trusted "it works," all four would have shipped. The thing that caught them was not better prompting and not a smarter agent. It was me looking at the actual output and asking a more simplified question than the agent was capable of asking. Not "did it run." "Is this the thing I wanted." A 62 on every site is suspicious if you bother to scan a second site. Three tabs are a trap if you click the ones you didn't finish. A finding is useless if you read it as your own user instead of as the engineer who wrote it. Agents optimize for what you said. The whole job of the human in the loop is to keep checking what you said against what you meant, because the agent can't see the difference and you're the only one who can. This reads like I haven't learned my own lessons that I've been writing about. So, yes and no? The agents did weeks of real work in days. The audit agent caught real bugs the tests missed. The infrastructure is solid. I would not give that back. But there's a reason the model is "I direct, they generate" and not "they build, I watch." Direction is not a one-time instruction you hand off. It's the continuous act of holding the work up against intent and saying "close, but that's not it." The agents are extraordinary at "exactly what you asked." Knowing what to ask, and noticing when the answer is technically perfect and quietly wrong, is the part that's still mine. The deploy succeeded. Not the deployment I thought it was. And now I know to look twice. All four of these are from building Agentis Lux, an agent-readiness scanner. Yes, a tool that tells other people what agents can't read shipped a stub, hid a broken tab, and rendered an empty card. It's in the open if you want to watch me keep catching myself: https://github.com/earlgreyhot1701D/perseus-clew https://github.com/earlgreyhot1701D/perseus-clew . AI assisted. Human approved. Powered by NLP. Built with Kiro, Claude, and a lot of looking at the actual output.