AI didn’t invent code drift. It made it harder to ignore.

A developer building a real frontend/backend system with AI assistance discovered that AI agents gradually lose coherence as systems grow complex, causing silent misalignments between code layers. This led to the creation of Scarab, a diagnostic tool designed to keep repository context stable during AI coding sessions. The developer argues that code drift is not new, but AI accelerates it to the point where discrepancies between a repo's claims, actual behavior, tests, and runtime can no longer be ignored.

I started noticing this while building a real frontend/backend system with AI assistance. The problem was not that the AI could not code. It could. It could solve local problems, patch visible failures, and make reasonable suggestions. The problem was that as the system became more complex, the agent would start losing the thread. A change in one layer would quietly violate an assumption in another. A fix would target the nearest visible symptom instead of the deeper boundary that had failed. The repo would still build. The UI might still render. The tests might still pass. But the system was slowly falling out of alignment with itself. That is what led me to build Scarab. Not as another coding agent. Not as a magic repair tool. Not as something that tells a developer how to fix their repo. Scarab began as a diagnostic response to a much simpler question: How do you keep repo context steady while an AI coding agent is working inside it? The more field testing I’ve done, the more I think this problem is bigger than “AI drift.” Human-built repos drift too. Docs get stale. Tests prove only part of the truth. Runtime behavior contradicts the intended architecture. A patch makes the red turn green without making the system more coherent. AI did not invent that problem. It just made it faster, louder, and much harder to ignore. The deeper issue is not only whether code works. It is whether the repo’s evidence layers still agree. What does the repo claim? What does the code actually do? What do the tests really prove? What does the runtime contradict? What baseline is the repair preserving? That is the space I’m interested in: deterministic diagnostics before repair, so developers and AI agents are not just arguing from vibes. I wrote the longer theory behind this here: https://scarabsystems.substack.com/p/ai-didnt-invent-code-drift-it-made?r=8isus0 https://scarabsystems.substack.com/p/ai-didnt-invent-code-drift-it-made?r=8isus0