Okay, But I'm Still Using It

wpnews.pro

Where I'm actually deploying AI coding tools at my next organization, and why.

If you read my last post, you know I'm not a believer in the uncritical sense. Context rot is real. The confidence-to-accuracy gap is real. The autonomous decision-making that nobody asked for is real. I've got the debugging hours to prove it. So why am I walking into my next organization with a plan to deploy Claude Code on day one?

Because "not ready to replace your engineering team" and "genuinely useful in specific, bounded ways" are not mutually exclusive. The mistake I see people making is treating this as binary: either AI is going to 10x everything or it's overhyped garbage. That's not what I found. What I found is a tool with a real ceiling that happens to be above the floor for a lot of valuable work. The question worth asking isn't "is it good?" It's "what is it actually good at, and how do you keep it from screwing up the rest?"

Here's where I've landed.

Every time I join a new organization, I spend the first few weeks trying to get oriented in a codebase I've never seen. Some of it is reading. Some of it is asking questions. Historically, that means interrupting a new teammate I barely know, who has their own work to do and may or may not actually remember how that part of the codebase works.

Claude Code changes that equation. The ability to ask questions in plain language: "where does this feature actually live?", "what's touching this module?", "why does this exist?" Getting back a reasonably accurate answer on demand is genuinely valuable. Not because it's always right, but because it's always available and it gives you a starting point that you can verify.

I want to be clear: this is less valuable for engineers who've been living in a codebase for two years. They already have that map in their heads. But for someone walking in fresh, having an agent available on demand to answer questions is a real accelerant, and it removes a recurring tax on the people who would otherwise be answering them.

It also changes the complexity of what I can hand a new hire on day one. Before, I'd rely on simple, low-stakes tasks to get someone oriented. Now I can hand a new employee something meaningfully complex, tell them to ask Claude where to start, and let them engage with the real work earlier. That's good for the new hire and good for the team.

Claude is a legitimately effective code reviewer. With one caveat that I think gets chronically glossed over: you have to explicitly ask it to be critical.

Left to its own devices, it defaults to cheerleading. It wants to approve things. Ask it for "a code review" and you'll get something along the lines of "This is awesome, ship it!" but not much in the way of critical feedback. Ask it to be highly critical and something different happens. It starts finding the things you actually want: violations of established patterns, scaling concerns, places where the logic doesn't hold.

I've been experimenting with running two separate Claude instances, one as the coding agent and one as the review agent. Watching them interact is genuinely entertaining. The review agent has no loyalty to the coding agent's decisions and will say so plainly: this isn't going to scale, this violates the practices we established, this was a poor decision. It's more candid than most human code reviews, because there's no social friction involved.

The failure mode to watch for: the review agent can inherit the same context rot as the coding agent. I had a situation where I'd established linting rules early in the process. The coding agent stopped running the linter before creating PRs. The review agent didn't catch it either. I found out when it hit the CI pipeline and failed. I had to go scold both of them.

Use it as a first pass, not a final gate. Have engineers run Claude's review before sending code to a human reviewer. It'll catch a meaningful percentage of issues. It does not replace the human reviewer, and you shouldn't position it that way. This one doesn't get enough attention.

Engineers chronically underinvest in testing, not because they don't value it, but because it's tedious and it competes with shipping. The result is codebases with inadequate coverage that make everything else harder.

Claude is good at test generation, but the most useful thing about it isn't the tests it produces. It's what happens when it struggles. Hard-to-test code is almost always a symptom: a god class, too many responsibilities in one place, hidden dependencies. When you ask Claude to write tests for something and it can't do it cleanly, that's information. The struggle surfaces design problems that were already there.

I'd have engineers use it as a two-part workflow. Ask Claude to write tests for the thing you just built. If it can't do it cleanly, treat that as a signal that the code needs to be reconsidered before it ships. You're not just generating tests. You're using the test generation process as a lightweight design review.

Every engineering organization has a graveyard of internal tools that nobody built because they weren't customer-facing, weren't on the roadmap, and never made it past "someone should probably do that someday." Scripts to automate reporting. Dashboards for operational visibility. Small utilities that would save hours every week if they existed.

The metrics dashboard I described in my last post was exactly this. Thirty minutes of manual weekly work, automated in eight hours. Not customer-facing, and not something I could have justified pulling an engineer away from a project with actual stakeholders waiting on it. But real value, delivered.

This is where the current limitations matter least. Internal tooling can tolerate roughness. The users are technical. The stakes of a runtime error are lower. If the agent makes a weird autonomous decision about a script that three people use internally, you catch it, you fix it, you move on.

Use internal tooling as a proving ground. Let engineers develop their instincts for how to work with these tools: where to trust it, where to verify, how to prompt it effectively. Do it on things where the cost of being wrong is low before you do it on things where it isn't. This is where the conversation usually goes sideways in both directions. People either say "never let AI touch production" or "just ship whatever it generates." Both are wrong.

There's a class of production work where the risk profile is manageable, and the common thread is this: the problem is well-defined, the contract is explicit, and failure is predictable and debuggable.

Third-party integrations are the clearest example. When you're integrating with an external service, you're coding to a specification someone else wrote. The inputs and outputs are defined. The edge cases are documented. There aren't a lot of hidden performance traps or security landmines if you stay within the bounds of the API contract. This is work that can consume a senior engineer's time without really requiring senior judgment. That's exactly what you want to offload.

Dependency upgrades are another one. Rote work that someone has to do but nobody wants to do. Well-understood expected behavior. The test suite tells you whether it worked. I'd let Claude handle those, with guardrails. My rule of thumb: if updating a dependency requires touching more than three files beyond the version bump itself, stop. Something more complex is happening and it needs a human making the call.

The opportunity cost argument matters here. Every hour a senior engineer spends on an integration spec or a dependency upgrade is an hour they're not spending on genuinely novel problems: the architecture decisions, the hard bugs, the performance investigations that actually require experience and judgment. Off the well-defined work isn't laziness. It's resource allocation.

The non-negotiables are proper monitoring, scrutiny before anything hits production, and clear criteria for when to escalate. AI-generated production code isn't inherently reckless. Unreviewed, unmonitored AI-generated production code is.

I want to be honest about the fact that I'm walking into this with a plan, not a guarantee. Everything I've described is based on experiments I ran on my own projects, not on deploying this across an organization at scale. Some of it will work the way I expect. Some of it won't. I'll probably find new failure modes I haven't encountered yet.

What I'm not going to do is either throw the tools at my team without a framework for using them, or keep them locked up because they're imperfect. The engineers I'm going to be working with are going to be using these tools with or without my guidance. My job is to give them a smarter way to engage with them than "just try stuff and see what happens."

If my thinking on this changes once I've actually run it at scale, I'll write about that too. I write about engineering leadership and team health in the Engineering Health newsletter on LinkedIn. Search "Engineering Health" to find it.

source & further reading

dev.to — original article I Built 7 Idle Games in 30 Days: What I Learned About Incremental Design I Built Kikar — An AI Messaging Platform That Creates Digital Versions of People One API key across OpenAI, Claude and Gemini: how to compare token cost per model

Okay, But I'm Still Using It

Run your AI side-project on zahid.host