Why I'm building a place to practice catching AI's bugs

A developer is building Loupe, a platform for practicing the skill of finding bugs in AI-generated code. The project was inspired by the realization that AI coding agents often produce code that passes tests but is still incorrect, and that developers are not trained to verify such code. Loupe presents users with real AI-written code that runs and passes tests, challenging them to identify the parts that are quietly wrong.

Two moments made me start building Loupe. Neither was some grand realization — they were just things that kept happening until I couldn't ignore them. The first happened several times a day. I'd hand a feature to my AI coding agent — Claude Code, Codex, whichever — and minutes later it would come back: done, tests passing. Green checkmarks everywhere. But when I actually read what it had written, I kept noticing the same thing: the tests weren't written to check the feature. They were written to pass. They cleared the bar, let the agent declare victory, and moved on. The code ran. The demo worked. Whether it was actually correct was a separate question — and nobody, human or machine, had bothered to ask it. The second happened in a meeting. A teammate and I agreed to build a specific feature. Some time later, they told me it was done. No docs, no walkthrough, just "done." So I asked the most basic questions I could think of. How many endpoints does it have? What happens in this edge case? Silence. They couldn't answer. The feature was "finished," and the person who shipped it couldn't describe what they'd shipped. Both times, the exact same thought hit me: we're about to put this in front of real users, and not one person has actually verified it. I couldn't let it go, because it isn't a one-off — it's the shape of how we build now. Writing code became almost free; AI produces it in seconds. But reading code, understanding it, and being able to say with confidence "yes, this is correct" didn't get any easier. If anything we do it less than we used to, because "the tests pass" feels like permission to stop looking. The cheaper generation gets, the more code flows past unread. And the failures from that gap aren't loud. A crash you catch instantly. The dangerous bug is the one that runs perfectly and just returns the wrong number — the refund that's slightly too large, the query that quietly drops a row, the check that never actually runs. It passes the test. It demos fine. It's wrong. And it ships, because everything looked green. So I'm building Loupe: a place to practice the one skill all of this quietly demands and almost nobody trains. You get real, AI-written code that runs and passes its tests — and your job is to find the part that's quietly wrong. Not to write it faster. To read it and catch what everyone else waved through. Because "it runs" and "it's right" are not the same sentence. Someone still has to know the difference — and I'd like more of us to be that someone. This is the first post in a series where I'll build this in the open: the decisions, the mistakes, the early users, the real bugs. Follow along if that's your kind of problem. → theloupe.dev