Migrating from Claude to DeepSeek without breaking everything

Firetiger migrated its agent fleet from Claude to DeepSeek, achieving a 62% reduction in inference costs from $606K/yr to $231K/yr. The migration required extensive prompt engineering and evaluation to close a quality gap, with DeepSeek 4.0 Pro scoring 80% accuracy with reasoning versus Claude Sonnet 4.6's 94%.

Migrating from Claude to DeepSeek without breaking everything Firetiger runs a fleet of agents that investigate incidents, monitor deployments, and dig through telemetry on behalf of our customers. Every one of those agents is, at its core, a loop around an LLM, which means our single biggest cost of goods sold is inference. The models we choose to power our agents have impacts across our business: on the product experiences we can deliver, how much we need to charge customers, and how we think about handling increased load and scale. Claudes of various flavors and vintages have been our models of choice for the past year, primarily served through AWS Bedrock. Recently, with adoption and costs growing, we more seriously investigated migrating to an open source, lower cost model. Finding a way to improve token economics allows us to build even more product experiences, building on the success of things like Change Monitors and Service Monitors. Switching from Claude to DeepSeek in production, we realized 62% reductions in real dollar spend for the first three agent types migrated, going from roughly $606K/yr on Anthropic to $231K/yr on DeepSeek today. After initial explorations, we put our focus on the DeepSeek v4 model family as our cheap er date. Just swap the model s/claude/deepseek/g A naive plan for this change: point our API client at a different endpoint, save a bunch of money, go home. In a simple chat product you might even get away with that plan Alas. Our agents run long both in time and scope , multi-step investigations with dozens of tool calls. Small behavioral differences between models compound over time, and would have huge impact on overall agent trajectories and product quality. We formed three hypotheses the experiment needed to prove to successfully migrate: - DeepSeek models are capable of our tasks without a heroic amount of effort. The metrics that matter: Task completion accuracy. - DeepSeek models will be cheaper than Claude, measured on a cost per task basis. - DeepSeek powered agents will accomplish their tasks and behave in ways similar to Claude powered agents. The metrics that matter: Steps to completion, time to completion, and error rate. To start small, we scoped our initial experiment to three representative tasks: user-defined tasks centered on exploring their own product, plan generation for monitoring code changes "given this PR, come up with a plan to monitor the deployment and make sure it does what it's supposed to and doesn't break anything" , and post-hoc root cause analysis. Measuring accuracy required good evals. Ours come in two flavors. Static evals run locally against stored data and mostly target specific capabilities where we've seen models struggle, while living eval datasets update every week: interesting agent sessions get detected, analyzed, and promoted into the dataset as older cases expire. With questions and metrics in place, we flipped our model string and API endpoint for a subset of workloads. Step one: measuring and closing the quality gap We started by swapping models and runing task completion evals. DeepSeek 4.0 Pro scored 65% without reasoning and 80% with it, against our baseline Sonnet 4.6's 94%. These were respectable for a drop-in replacement, but nowhere near shippable. Given the early data, we decided using reasoning with DeepSeek was non-negotiable for success. From here, we ran a self-improvement loop that modified the prompt and tool descriptions. Understanding each turn and identified issue told us where DeepSeek struggles vs Claude. DeepSeek is clearly a capable model, but a few patterns showed up. First: when creating a plan to monitor PR changes going to production, it had trouble finding the secondary and tertiary effects the code would have. Call it "inferring non-local dependencies from a local artifact". When our agents look at a diff, they see: - What the code does. - What the code calls. - What calls the code with grep . What they don't see, but need to reason about: - Who reads the data this code produces. - Who depends on the timing this code controls. - What invariants other systems assume about this code's behavior. Claude tended to ideate about that second list unprompted, while DeepSeek needed to be told to think carefully about second order effects. Here's the actual change the loop made to the planning prompt: @@ "Understand the changes" section Then: - Read affected files to understand context - Trace code paths to identify what services or components are affected - Pay attention to the PR title, description, and user's request for specific concerns + + Before researching telemetry, complete this two-step trace: + 1. List what this change produces or alters. + 2. For each, name what depends on it that does NOT appear in the diff. + Those off-page dependents are the plan. If a candidate check + only references things the diff touches, you are monitoring + the producer, not the change. Second pattern: DeepSeek would often ignore additional context it was given, from other agents or from the user, and start from scratch, redoing work and reprioritizing subsets of the task. In one case, the PR description spelled out exactly which changes mattered, and DeepSeek reprioritized based on lines changed per section instead. The model ignored context in the artifact it was analyzing, and chose to use a crude heuristic instead. The fix involved reframing how context is presented. Our original prompt was hedged: it told the model notes were "NOT authoritative", and DeepSeek read that hedge as permission to discard them entirely. The fix: @@ "Research available telemetry" section - - Use read notes to read the "general" and "deploy" notes for - system knowledge. Notes are general learnings, NOT the - authoritative definition of this PR. + - Flows, notes, and attached indicators are directional + context — what humans and prior agents have already + established about this codebase. Anchor on them. The exact + numbers may have drifted; the shape of what they describe + usually hasn't. Verify what matters for this PR, but don't + re-investigate from scratch what the context already + establishes. + - The exception: if a note describes a different PR with + similar keywords, ignore it. Notes are general; this PR is + specific. Third: during investigations, the model would get stuck in local minima, diving deep into a particular hole and getting lost there while the actual root cause sat elsewhere. None of these are exotic failures, but they’d be easy to miss if the test methodology stopped at eyeballing a few responses. With these fixes in place, DeepSeek v4 Pro with reasoning scored 92% on our eval set, on par with Claude Sonnet 4.6's 94% baseline. Critically, the prompt changes did not change Sonnet 4.6's performance, allowing us to transition between model families without further special cases or IF MODEL == X DO Y type logic. At this point, we’d proven to ourselves that DeepSeek can reliably accomplish what we need it to. Next: how does it get there, at what cost, and by what path? Lies, damn lies, and evals Digging into cost per task data, the numbers hadn't dropped nearly as far as the per-token pricing spread between the models. Two reasons became apparent: - DeepSeek had a tendency to inflate its own output along the way, writing notes to itself and producing large plans and larger conclusions. - In addition to per-model differences in how tokens are consumed and emitted, in the process of optimizing our prompts, we'd increased both input and output token usage. Another round of prompt optimization, plus validation on token count for key outputs, brought numbers back in line with the spend curve we originally envisioned. We next focused on time and number of steps required to complete tasks. In one example, a DeepSeek investigation on an RCA task arrived at the right conclusion with full marks from the eval but took 4x as long and 2x as many steps. Looking at tool usage, we found that DeepSeek reached for internet lookups far more often than Claude did, and in this case spent ten minutes in a rabbit hole reading Go documentation, most of which had nothing to do with the ultimate root cause. If we'd only looked at eval score, we would have shipped a markedly poorer product experience. Comparing error rates caught more issues: hallucinated tool arguments, odd choices in which tools got used, and task strategies we'd never seen from Claude. Changing models and tool calling behavior also caused us to re-examine our agent authorization strategy and policy. Previously, we'd taken implicit dependencies on Claude's behaviors and tendencies. Switching models forced us to add defense in depth to our sandboxing and isolation primitives. As a final check, we replayed a subset of user sessions on DeepSeek and analyzed differences in behavior, rather than quality, to make sure nothing else would cause issues. Taking this all to production In our test benches, we’d kept model behavior predictable, kept quality high, and cut cost significantly, so we called the experiment a success. That meant we were good to move a subset of our production agent operations to DeepSeek, continuing to measure quality, cost, and other metrics as we went. We started by migrating agents that ran with poor cache hit rates on Claude, because the per token pricing spread is more meaningful there vs agents with high hit rates that see very cheap cached token reads. If at least 30% of an agent's Anthropic prompt tokens are cache writes, and the same prompts produce fewer than 2x the tokens on V4 Pro, it's a good candidate to migrate. In production, we realized 62% reductions in real dollar spend for the first three agent types migrated, going from roughly $606K/yr on Anthropic to $231K/yr on DeepSeek today. One wrinkle we didn't anticipate: cache tuning with DeepSeek was more complicated in prod than we’d thought. These three agents were the right candidates partly because they had our lowest cache hit rates on Bedrock + Claude. With DeepSeek, we're currently seeing 30-40% hit rates, and if caching matched what we saw on Anthropic, the same workloads would run at $181K/yr, a 70% reduction. We're working with our inference provider, Baseten, to close that gap. If you're considering this migration yourself, the one-sentence version of our advice is this: the eval score is the start of the work, not the end. Getting to the right answer reliably tells you the migration is possible; getting there efficiently, by a path you understand, is the finish line.