# Coding is solved. The factory isn't.

> Source: <https://dev.to/souliane/coding-is-solved-the-factory-isnt-18i3>
> Published: 2026-06-05 09:23:20+00:00

Highly opinionated, based on my personal experience. Not a prescription —

just notes from what I keep figuring out while dogfooding my own setup. I'm

scratching the surface, with a lot left to learn.

I'm building a multi-repo personal code factory. I don't spec it up front: I dogfood it day by day — using it, and asking for improvements or fixes when something breaks. The architectural decisions still can't be made blindly by the models, so daily use is how the system finds its shape. Two qualifiers about scope, then four claims.

**Personal means local — for better and worse.** I call this a *personal* code factory because it has no business running anywhere but on my laptop. There's no auth layer, no audit log, no sandbox between the agent and my git remotes. It has my GitHub tokens, my GitLab tokens, my Slack credentials, my pass store.

The other side of that coin is that local makes it *stealth*. I can use it without bothering anyone in the company. I don't have to ask DevOps for anything, there's no IT-security review to clear, no team-practice committee to harmonize with, no central infra to wait on — it just automates things I would otherwise do by hand. That removes every external bottleneck, and it's what lets me experiment fast.

The downside is the same thing said the other way: because it's mine, it doesn't help anyone else, and it's nowhere near as efficient as something that would run on GitLab or Slack directly. This is a POC. If it turns out to work, the right move is to promote it to actual company infrastructure.

**Multi-repo means a particular kind of hard.** I run it on multi-repo because that's what my work looks like. If your code lives in a single repo, a lot of what I describe in this series either disappears or shows up differently. I'm not claiming the multi-repo case is the interesting one — just that it's the one I have.

Coding is solved — Cherny's phrase, and I think he's right. It took four things, and they're not the same thing.

The **model**: capable enough to write the code.

The **harness**: what lets the model act instead of just emitting text — read the repo, run the tests, iterate, fix. (Mine is Claude Code, but the principle isn't tied to it.)

A layer of **deterministic constraints**: checks that keep the output converging toward quality instead of tech debt. I work in Python, so for me that's [ruff](https://docs.astral.sh/ruff/), [ty](https://github.com/astral-sh/ty), [tach](https://github.com/gauge-sh/tach) run through [prek](https://github.com/j178/prek), plus [gitleaks](https://github.com/gitleaks/gitleaks) and a stack of project-specific hooks. Different language, different tools — the constraint is the point, not the toolchain.

And **skills**: written guidance that gives the model the business and project knowledge to make the right call in *this* codebase, not a generic one.

Take any one of the four away and it stops working. What none of them guarantees is that the architecture is *right* — and that is the next claim.

The factory around it isn't solved. I don't think you can specify it up front.

There are two ways to get a system that builds and ships software for you. One: write the spec — every edge case, every failure mode, every integration — hand it to an agent, let it build. Two: use it every day and fix what breaks.

I don't believe in the first — at least I wouldn't try it. A spec for a system that builds, reviews, and ships software ends up being more or less the system itself: you don't find out which edges bite until they bite. And the architectural calls inside it still can't be made blindly by the models, so the spec would have to make all of them in advance — that's the part I don't see working yet.

That leaves the second way.

That leaves dogfooding — using the thing every day, fixing what breaks, keeping it running tomorrow.

Dogfooding fuses three things into one loop that no spec can: verifying the system works, improving it where it's wrong, and keeping it running long enough to do both. The first two are the same act — you verify by trying to use it, and the parts that don't work are the parts you fix.

Making that verification less manual split into two halves. The proactive half is a test suite that checks whether the agent *behaves* as intended — did it reach for the right tool, did it avoid the wrong one — so a behavior regression shows up as a red test instead of going unnoticed for days. I'm only starting on these: a handful of behavioral scenarios plus the deterministic checks around them, noisy enough that I don't lean on them yet. The reactive half is a runtime hook that catches a bad action as it happens and refuses it — the backstop for when the agent misbehaves anyway. I lean on those far more today. But every backstop I need is something the proactive half didn't catch in time. If the evals and the agent were good enough, the gates would be dead weight. They aren't yet, so I keep both.

The third thing in the loop is the precondition. **Self-improvement and resilience are two sides of the same coin.** A system that shuts down can't keep improving itself. If I had to pick which matters more, it's resilience — improvement stops the moment the loop stops. You don't get either by specifying them. You get both by running the thing every day and refusing to let it stay broken.

So who orchestrates the loop? That's the last claim.

Orchestration looks like the part that stays human: holding the big picture, deciding what gets attention first, noticing when two threads are about the same thing, deciding what to keep and what to drop.

In [teatree](https://github.com/souliane/teatree) most of it already runs without me. One orchestrator with the big picture, not a swarm — it arbitrates and hands the actual work to sub-agents. What still needs me is basically troubleshooting and steering, and I assume that the loop can't be fully closed as long as the behavioral evals are missing.

I'll try to publish roughly one post a week. Each one is the thing I keep getting wrong and trying to get less wrong:

I'll change my mind about some of this between now and the last post. That's the point.
