How we stopped our AI assistant from hallucinating bug fixes

LightShield, a SIEM built by LS-SIEM LLP, developed qa-probe, an open-source tool that stops AI coding assistants from hallucinating bug fixes by providing ground-truth evidence. The tool analyzes source code, probes live endpoints, and classifies root causes with calibrated confidence, enabling AI assistants to reason from evidence rather than guessing from status codes. Released under Apache-2.0, qa-probe supports frameworks like FastAPI, Express, and Next.js, and integrates via MCP with tools like Claude and Cursor.

Cover: a real qa-probe run against our own stack, cropped to the summary - internal product detail withheld. We are building LightShield, a SIEM that is in active demo right now. We built most of it pair-programming with an AI coding assistant wired in over MCP - it ran our stack, read the errors, and patched its own code. For a small team that is a superpower. Until an endpoint failed. Here is the loop we kept hitting. A route returns a 500, or a 404, or an empty . The assistant looks at the status code and announces the cause with total confidence. Then it rewrites a handler that was never broken - because a status code is not a cause, and it had nothing else to go on. So it guessed, and it guessed wrong, and the diff made things worse. The thing is, that empty had at least six possible causes: Same symptom, six different fixes. We could bisect to the real one. The AI could not - it had no ground truth, so it manufactured one. It analyzes the app, probes the live endpoints, and classifies each failure with a root cause and a fix hint. Three decoupled, cached phases: php qa-probe analyze parse source + OpenAPI - route graph qa-probe probe hit live endpoints HTTP/SSE/WS , record evidence qa-probe report classify root cause - HTML / Markdown / JSON / AI-context or just: qa-probe run It has adapters for FastAPI, Express, Next.js, tRPC, GraphQL, and a generic fallback, so it discovers your routes instead of you hand-listing them. Each result carries the evidence the real request, a bounded response sample, the timing , a root cause from ~25 categories, and a calibrated confidence - high , medium , or none . When it cannot tell, it returns none instead of bluffing. No neural network, no black box - transparent rules plus per-endpoint stat memory, so you can always read why it landed on a verdict. An AI consuming this needs to verify the claim, not trust a vibe. qa-probe mcp exposes 8 tools to Claude, Cursor, any MCP client The assistant stopped reasoning from a status code and started reasoning from evidence: "empty database, high confidence, here is the response that proves it." It seeded the DB instead of rewriting the handler. It fixed the right layer. The guessing basically stopped. It helped us debug faster. It helped the AI more - because an AI is only as good as the evidence you hand it, and "the endpoint is failing" is not evidence. It is early and it is open. The fastest way to help: One housekeeping note: contributions are sign-off based DCO - commit with git commit -s so the project's licensing stays clean. That is the only hoop. We built it for ourselves. It worked well enough that we cleaned it up and released it under Apache-2.0. npm i -g qa-probe Built by LS-SIEM LLP. If you run it against your own API, I would genuinely like to know what it found - that feedback is how the rules get sharper.