# The x3.16 Developer | Part 1

> Source: <https://dev.to/amirarad/the-x316-developer-part-1-2jc4>
> Published: 2026-05-28 19:44:19+00:00

Taking the time to figure out what has value beyond this specific task and feeding it back into your setup is the single highest-leverage thing you can do.

Everyone's chasing the x10 developer these days. AI can make you ten times more productive, they say. Ship ten times faster. Do the work of a team.

When I started using AI for coding, the obvious move was to go faster: wire the best LLM to the codebase, give it a task suitable in size and complexity, and watch it change the code faster than I can keep track of. That didn't work. The model would produce things that looked right but I'd spend more time correcting and retracing than it would have taken me to build it myself. Every 6 months a new model or tool would come out that made things better, but nothing that really closes the gap. I think I got to x1.25, maybe.

Slowly, I fell back to old principles. TDD, acceptance tests, static analysis: structural constraints around the risky areas. Task after task I'd retrospect, fix, generalize and automate the things supporting the model's coding. And it started working. Not because the model got better, but because everything around it got better. Work quality went up, and speed followed without my optimizing for it directly. At some point the attention required per task dropped enough that I could run two at the same time. Then three. My current peak is three running coding tasks and a design brainstorm - all running at the same time.

Here's how I think about it. x10 breaks down into two axes: scale up, where each task gets done faster, and scale out, where you can run more tasks concurrently. If you can only reach x√10 on each axis, you get x10 total. √10 = 3.16 . That's where the title comes from.

What follows is the method. How I think about the system, how I optimize it, and what the feedback loops look like. The specific tools and workflows get their own article.

So the first thing to understand about LLMs, they're basically trained to predict what comes next in a sentence, and then fine-tuned to be helpful. Which means they have this deep pull toward giving you something that *looks like* a good answer. Not necessarily *is* a good answer, looks like one. I think of that as a very powerful engine, without a sense of direction.

When you work with a person, they have domain knowledge, they have stakes in the outcome, they understand the problem. An LLM doesn't have any of that. It just has this pull toward whatever feels most helpful right now, message by message. And a lot of the time that looks like substance, but it's not always actually substance.

This is more or less how hallucinations happen. A hallucinated answer sounds knowledgeable, confident, relevant. It passes your immediate quality check, because that's what the model is designed for. It's not that the model is trying to deceive you, it's just how the system works.

If you're just chatting with it, asking questions, brainstorming, it works fine most of the time, because what *sounds* right and what *is* right are mostly aligned. The problems start when things get complex, and what *sounds* right and what *is* right drift further apart. That pull toward passing as helpful becomes counterproductive the more complex the domain gets.

This thing took me a while to internalize: the model, with all its intelligence, is the least trustworthy part in the system. Engineers instinctively treat LLM as the core and wrap uncertainty around it. But it's actually a statistical guessing box with some randomness on top. It is inherently not sensible. Everything around it has to compensate for that.

So if the LLM is the engine, there's all this stuff around it that makes it actually usable. System prompts, tools, verification steps, context management, guardrails. In agent engineering they call this the harness. It's basically everything between the model and reality.

When an agent fails, the reflex is to rewrite the prompt. Add more detail, be more specific. But agent behavior comes from the whole system, not just the prompt. Structural fixes at the harness level regularly outperform prompt tweaks by an order of magnitude.

Tejas Kumar recently took a 2023-era model, GPT-3.5 Turbo, and had it successfully complete a multi-step browser task through harness engineering alone ([watch his excellent talk here](https://www.youtube.com/watch?v=C_GG5g38vLU)). He then continued to explain that the harness has moving parts:

That last one closes the gap between "appeared successful" and "was successful": Without it, you're trusting the model to judge its own output. And the model is biased toward telling you everything went great.

While Tejas Kumar may talk from the context of fully autonomous agents, this is not an autonomous car. You're in the center of the operation. The engine does the heavy lifting, but you decide where it goes, when it stops, and whether what it produced is what you needed. The harness is set up around that. It doesn't need to handle everything on its own, it needs to make your interventions cheap and your oversight easy. Every automation starts with doing the task manually and learning the process, then automating piece by piece.

So we have an engine and a car. But to get somewhere we need the driver.

The driver is an equal part of the system as the engine and the car. What you choose to check and where you save your energy, how you review the model's output, when you decide to interrupt, how you define tasks - these are skills and intuitions, different from person to person, and change over time. That's the driver part.

Right now general-purpose AI-coding feels a lot like the early days of cars. The technology works, but it's not something you can just use without thinking about it. If you owned a car in 1905 you carried a toolbox and you knew your way around the engine, because driving and maintenance weren't separate things yet.

That's roughly where we are with coding harnesses. You're not just using the tool, you're also the person who tweaks and maintains it. And the really good results aren't commoditized yet. You have to *make* them by yourself, *around* yourself.

So this is what we optimize. The engine is fixed. The harness is engineerable. The driver improves through practice and feedback, and also improves the harness. And the interfaces between layers, that's where most of the value lives and where most things go wrong.

So there's this idea from manufacturing that became very popular. In the 1950s Toyota couldn't compete with American manufacturers on volume or capital. So they focused on their process instead. They took ideas from people like Deming about quality and waste reduction and built them into how they worked, at every level, continuously. The approach is called Kaizen. The core of it is that you keep finding and removing waste from your process, over and over, and the improvements compound.

Some principles that translate directly:

There's also something I know as the Cult of Done (by Bre Pettis and Kio Stark) which roughly says "Done is the engine of more. Ship the imperfect thing, learn from it, ship the next thing better". The temptation with AI tools is to grab more land every iteration, do more, extend further. That causes drift. Forcing small completions counters that.

Now bring it back to improving the harness:

Taking the time to figure out what has value beyond this specific task and feeding it back into your setup is the single highest-leverage thing you can do. It's not glamorous. It's you as a mechanic, working on the car. It's you as a driver, learning. *and it has nothing to do with LLM.*

It's what compounds.

*Immediate feedback from Driver to Harness, during a task.*

My default fix for tasks gone wrong is usually to restart the task. But I prefer to use local repair over global restart as much as possible, so for this I need early detection at an actionable point. A restart costs you everything, context, progress, warmup. A local repair costs almost nothing if you catch it early.

In Toyota's factories, any worker could pull the andon cord to stop the production line when they spotted a defect. Not a failure of the system, the system working as intended. The defect gets fixed at the source instead of propagating downstream.

Same thing when you're running a task. The agent is generating code and starts drifting, adding unnecessary abstraction, approaching something in a way that'll cost you later. You interrupt. Not to start over. To make a local repair.

Sometimes it's a prompt adjustment. Sometimes you realize the agent needs context it doesn't have, or a tool is leading it astray. Sometimes the fix is removing something, not adding something.

The instinct is to let it finish. But that's how defects propagate. It's always more expensive later. Pull the cord when you first feel something is off, even if you're not sure - just to ask it to explain why it is doing what you think may be wrong, or how it plans to handle some challenge you suspect is going to cause issues. This habit has two great side-effects, and you can always tell it to "resume" once you are satisfied that things are going well.

Side effect 1: you learn. Little by little, you get a sense for the red flags. This learning will slowly translate into increased speed and reduced attention waste, and later you could even submit some of that learned wisdom to the harness, as guardrails, system prompts, and validations.

Side effect 2: the LLM gets grounding. Like people, it is sometimes good for LLMs to reflect on why they are doing what they are doing. After providing reasoning or plans, the chances of drifting away from it in the near future reduce significantly.

But if the answers are not good enough, that's when this habit *really* pays off - it saves time that would have otherwise been wasted on letting it wander and bump around the problem, reading the garbage result, etc. that's the scale-up axis going from x1 to x1.05 right there.

The subtle part is learning to distinguish "wrong" from "different than I expected." Sometimes the agent takes a path you wouldn't have taken, and it works fine. The cord is for defects, not preferences. Learning that boundary is part of the driver skill.

*Feedback from completed task back to Harness, between tasks*

Task done. Feature shipped, bug fixed, thing works. There's a mess on the workbench. Failed approaches, temporary hooks, discovered patterns, workarounds, model behaviors that only surfaced under these conditions.

Most of this gets swept into the bin. Task done, next task. This is where most people leave compound improvement on the table.

After a task, I ask: does anything here have value beyond this specific task?

Sometimes no. Clean up, move on. But often: a hook you added, was the underlying problem specific to this task, or general? A prompting pattern that worked, can you encode it into the system prompt? A model behavior you discovered, does it change how you structure future tasks? A tool you built mid-task, is a minimal version worth keeping permanently?

Options from cheapest to most expensive: log it (just capture, zero processing), generalize it (make the specific solution apply to the class), or investigate it (spend time on something weird, was it noise or signal?).

Sometimes the right answer is "not worth any investment." That's fine. The Cult of Done applies to the feedback loop too. But don't skip it entirely.

Not all improvements come from the loop. Some arrive sideways. You're debugging something unrelated and notice a model behavior that changes how you structure tasks. You read about someone else's setup and realize it solves a problem you hadn't named yet. You can't schedule these, but you can avoid suppressing them. When something unexpected happens during a task, give it thirty seconds before you correct it.

What compounds: system prompts that evolved from empty to opinionated operating manuals. Not designed top-down, but built from dozens of small synthesis cycles, each one encoding a specific problem into a structural solution.

That's the first part, it's about how the solution works, in general. The next part will be more about the practical assets and habits I've collected for scale-up using the methods described here, and how I scale-out several tasks together. It builds on what's here: a harness you trust and a feedback loop that compounds. You can't run three cars at once if you don't trust the one you're driving.

I don't think the x10 developer is someone who found a better tool than everyone else. It's someone who keeps tuning their setup.

Try this once. After your next completed task, before you move to the next one, stop and ask whether anything you just did or learned has value beyond that specific task. You don't need a system for it, you don't need to do it every time. You can also use LLM for that. Just the question, once, and see what's there.
