How To Use AI Safely In Legacy Codebases

A developer has outlined three rules for safely using AI agents in legacy codebases: tests first, small diffs, and business rule discovery before any change. The approach warns that AI agents, when let loose on old systems, can silently break critical business logic by reasoning from incomplete information. The developer advocates for characterization tests that pin down current behavior before any AI-generated modification is allowed.

You've inherited a system that nobody wants to touch. It's ten years old. There are class files with three thousand lines. The naming convention changed sometime around year four. The original author left in year six. The test coverage is somewhere between "lol" and "we have one test, it imports the framework". And now leadership has discovered AI agents and would like you to "modernize" all of it by Q3. You know what happens next if you let the AI loose with no rules. It rewrites a payroll calculation it doesn't understand. It "cleans up" a strange-looking conditional that was actually compensating for a bug in a downstream system. It moves a function into a new helper file and quietly drops a side effect the original method had. It produces a 4,000-line diff that looks beautiful and breaks the second-largest revenue stream in the company. This isn't an AI problem. It's a legacy problem. Legacy code is full of decisions that look like accidents but aren't. AI sees the surface and reasons about the surface. Legacy reasoning lives underneath the surface. There is a way to use AI well here. It just isn't the same way you use it on a greenfield React app. It's slower, more disciplined, and more rewarding when you get it right. This piece is about that discipline, built around three rules that, in my opinion, are the difference between "AI made my legacy codebase better" and "AI made my legacy codebase someone else's problem". The three rules are: tests first , small diffs , and business rule discovery before any change . Let's break each one down. Modern AI agents are extraordinarily good at code that looks like what the internet has a lot of. React components. Express handlers. CRUD endpoints with normal naming. Anything where the code's behavior is mostly inferable from its shape. Legacy code is the opposite. Its shape lies to you. A function called calculateTotal might recalculate the total and mutate the invoice in place and emit an audit event and set a flag that decides whether the customer gets a refund. None of that is in the name. None of it is in the type signature, because there often isn't one. The only reason you know is that someone got fired three years ago for not knowing. When an AI agent looks at calculateTotal , it sees the name. It sees the body. It does not see the seven downstream consumers that depend on the side effect, the one cron job that breaks if the flag isn't set, or the SOC 2 auditor who reads the audit log every quarter. This is the core failure mode. The model is not lying or hallucinating. It is reasoning correctly from incomplete information. The problem is that in legacy systems, almost all information is incomplete. The behavior of the system is encoded across the code, the database, the operations runbook, the support tickets from 2019, the slack DM from a former engineer, and the institutional knowledge of three people who happen to still work there. Letting AI loose in that environment without rails is not "moving fast". It's gambling. And the bets are silent until the next month-end close. The three rules below exist to put rails around that gambling. They don't slow you down. They slow the AI down, which is a feature. No AI-generated change goes into a legacy codebase before a test pins down the current behavior of the area you're touching. This isn't about quality. It's about leverage. A test gives you something to compare the AI's output against. Without it, you have no signal except "the diff looks plausible", which is the worst possible signal in a legacy environment. The tests you write here are not unit tests in the textbook sense. They are characterization tests , a term from Michael Feathers' Working Effectively with Legacy Code . A characterization test doesn't verify that the function does the right thing. It verifies that the function does the current thing. Whatever the code does today, the test asserts that exactly. Here's the loop: This is a place where AI can actually help with the boring part. Generating the characterization tests themselves is a great AI task. The model is excellent at "here's a function, write me a dozen test cases that cover its visible inputs and outputs". You're not asking it to understand the code. You're asking it to enumerate. A short example, deliberately language-agnostic: characterization-test-loop.txt You: Here is the function calculateLatePenalty from billing/penalties. Generate 15 test cases. For each, give me realistic inputs and the output the function actually returns when run. Do not improve the function. Do not flag bugs. Just record its current behavior. AI: writes 15 cases You: run the cases as tests, confirm they pass You: Good. Now refactor calculateLatePenalty to use the new RateProvider interface. Do not change any case's output. AI: writes the refactor You: run the tests again The interesting thing about this loop is what it does to your trust calibration . You stop trusting the AI's diff because it looks clean. You start trusting it because it survived a contract that you wrote first. For untestable code code with hard-wired database access, network calls, time-dependent behavior, or global state , you have a harder problem. The honest answer is that you may need to introduce a seam first a single dependency injection point, or wrapping a now call in a function you can stub before any characterization is possible. That seam itself becomes the first small AI-assisted change, with its own tests around the seam point. Yes, it's slow. The alternative is silently broken billing. Warning Do not let the AI rewrite the test if it fails after a refactor. That's the entire point of the test. If the test fails, either the refactor is wrong or the test was wrong to start with. Investigate. Don't make the symptom go away by deleting the assertion. The second rule is the one that's hardest to follow because the AI is genuinely capable of producing huge changes, and huge changes feel productive. They aren't. In a legacy codebase, diff size is risk . Every line the AI touches is a line that has to be reviewed by a human who understands the system better than the AI does. The number of lines a human can review carefully per hour is roughly constant. So a 200-line diff is reviewable. A 2,000-line diff is not. It gets skimmed, and skimming a legacy refactor is how you ship the bug. The discipline is to cap every AI-assisted change at one reviewable unit of work . My own rough rule of thumb: under 300 lines changed, touching no more than three files, with a single named goal. If the AI wants to do more, you split the task. This sounds obvious. The reason it's hard is that the AI doesn't volunteer to stop. You ask it to "modernize the order service" and it will gladly return a 1,800-line diff that touches twelve files and "incidentally" reformats two of them. You have to be the one who keeps the leash short. Here are concrete prompts that keep diffs small: prompts-that-cap-diff-size.txt ✗ "Refactor the order service to use the new pricing engine." ✓ "Replace only the call to OldPricing::calculate inside OrderService::computeSubtotal. Use NewPricing::price. Do not change any other method. Do not reformat unrelated code. Return only the changed lines." ✗ "Clean up the user authentication module." ✓ "In auth/Login.php, extract the password comparison lines 142-156 into a private method. Keep behavior identical. Do not touch any other file." ✗ "Add error handling everywhere it's missing." ✓ "In PaymentGateway::charge, wrap the third-party call line 88 in a try/catch. Log the exception via the existing Logger::error and rethrow. No other changes." The pattern is the same every time: name the file, name the function, name the line range, and explicitly forbid drift. The forbidding is what most engineers skip, and it's what makes the difference. "Do not touch any other file" is a clause the model takes seriously. There's a second benefit to small diffs that nobody talks about: when something breaks, your bisect surface is tiny. If a one-file, twenty-line AI-assisted commit broke production, you know exactly where to look. If the offending commit is a 2,000-line "modernization" that touched twelve files across three modules, you're going to spend Friday night reading diffs instead of having dinner. Small diffs also let you build a rhythm. Tests pass - small diff - human review - commit - repeat. Each loop is maybe fifteen minutes. After a week of that you've shipped fifty small improvements with confidence, where the alternative, one giant AI refactor PR, is still sitting in review three weeks later because nobody wants to be the person who approved it. This is the rule that separates senior engineers from everyone else using AI on legacy code. Most legacy functions are not really code . They are encoded business rules . The function calculateShippingDiscount is not really about discount math. It is about a policy negotiated with a logistics partner in 2018, modified by a regulator in 2021, and patched by a panicked engineer in 2023 to cover an edge case nobody could reproduce. The current code is a record of all those decisions, not an explanation of them. When the AI looks at calculateShippingDiscount , it sees the math. It does not see the policy. So if you say "simplify this function", what it produces is a cleaner version of the math that may quietly violate the policy. The fix is not to be smarter about prompts. The fix is to extract the business rules before you refactor, and confirm them with a human who actually knows the rule. The process I use looks like this: A short example of the kind of prompt that works: business-rule-extraction-prompt.txt Here is the function calculateShippingDiscount. Do not refactor it. Do not propose improvements. Do not flag bugs. Read it and produce a numbered list of the business rules implied by the code. Each rule should be one sentence, written in plain English a product manager could read. If the code suggests a rule but you are not sure, label that rule with UNCERTAIN and explain what's unclear. The AI is genuinely excellent at this task. It can read three hundred lines of nested conditionals and give you back twelve sentences a human can review. That kind of summary, run past a real domain expert, will surface the dead rules, the misunderstood rules, and the load-bearing rules that the next refactor must preserve. A real moment of value here: more than once, I have run this loop and watched the domain expert react with something like, "wait, we stopped doing that in 2022, why is the code still checking the flag?". That single sentence is worth more than any refactor. The AI didn't find the bug. The conversation between the AI's summary and a human who remembers the policy found the bug. Tip Save the business rule list, with the provenance notes, in the repo, near the code: a RULES.md next to the file, or a long docblock on the function. The next person who refactors this code in 2028 will thank you. So will the AI agent of 2028 trying to read it. Three things to watch out for when doing rule extraction: Don't accept the AI's confidence level at face value. If a rule is implied by a single conditional branch that looks dead, the AI may state it as a confident rule. That's where the UNCERTAIN label matters. Force the model to mark its own confidence. Don't let the AI propose changes during this step. Rule extraction and refactoring are two different jobs. If you let them happen in the same prompt, the rule list will quietly omit the rules the AI plans to "improve away", and you'll lose the very thing you needed to preserve. Don't skip the human conversation. The AI cannot tell a current rule from a dead rule. Only a human with institutional memory can. If there is no such human left, that itself is a finding, and your refactor needs to be much more conservative. Tests first, small diffs, business rule discovery. Each rule does a different job, and they reinforce each other. A real working loop looks something like this, for any one change: Each loop is small enough that if any single step goes wrong, the blast radius is limited to one area you just covered with tests. The cumulative effect, over weeks and months, is enormous. A legacy codebase that nobody wanted to touch becomes a codebase where every part that was touched is now better-tested, better-documented, and better-understood than it was before. That last part is the quiet win. Using AI safely in legacy code doesn't just produce safer changes. It produces more knowledge about the system as a side effect. The characterization tests are documentation. The business rule lists are documentation. The small commits with clear messages are documentation. None of that existed before you started. A short list of things I have seen go badly, and that the three rules above are designed to prevent: Pasting a whole class file in and asking for a "refactor". The AI will return something that compiles, looks clean, and is wrong in a dozen subtle ways. You cannot review it. Your colleagues cannot review it. Don't do this. Letting the AI "fix" tests that broke after a refactor. If a characterization test was passing yesterday and is failing today, the AI's refactor is the suspect, not the test. Letting the AI edit the test makes the failure invisible. This is one of the few absolute rules in the loop. Doing rule extraction and refactoring in the same prompt. The two jobs corrupt each other. Extraction needs to be conservative and skeptical; refactoring needs to be assertive. Mixing them produces a refactor that quietly skips the rules the model didn't fully grasp. Trusting "the diff looks fine" as a review. In a legacy codebase, no diff looks fine to a sufficiently careful reviewer. If it looks fine to you, you either understand the area very deeply, or you're not looking hard enough. Reread it. Treating AI as a senior engineer. It isn't. It is an extremely capable, extremely fast, extremely confident intern who has read every book but has never been on call for your system. Brief it like an intern. Review its output like an intern. Give it small, well-defined tasks like an intern. The output is fantastic in that mode. The teams I see succeeding with AI in legacy systems are not the ones who let the agent loose with the biggest prompt and the most context. They are the ones who treat the AI as a force multiplier on a careful, disciplined process they would have followed anyway. Tests first means the AI has a contract to satisfy. Small diffs means a human can actually check the work. Business rule discovery means the AI is changing code in a space where the rules are explicit, not implicit. That is not slowing AI down. That is letting AI do what it's actually good at, generating code quickly inside a tight, well-defined contract, without letting it do what it's bad at, which is owning the consequences of a change to a system it doesn't understand. The legacy codebase didn't get the way it is overnight. It won't get safe overnight either. But every loop you complete makes the next one easier, and after enough loops, the codebase nobody wanted to touch starts to feel like one you actually enjoy working in. That's the goal. Not "AI rewrote our legacy system". Just: "we made this code a little safer this week, and we'll do it again next week, and we have the tests to prove it both times". Originally published at nazarboyko.com.