Vibe coding is not a level. It's an axis.

A developer argues that 'vibe coding' autonomy levels (L0-L5) are incomplete without a second axis: operator discipline, which measures how much work survives session boundaries as inspectable state. The developer claims that high operator discipline at low autonomy levels outperforms low discipline at high autonomy over time, based on personal experience with decision stores and persona files.

Karpathy gave us vibe coding: "see stuff, say stuff, run stuff, copy and paste stuff, and it mostly works." Since then, the industry has kept trying to turn it into a tidy autonomy ladder — Level 0, Level 1, all the way up to fully autonomous development. That ladder is useful. It is also incomplete. It measures one thing: how much of the building you delegate to AI. But two people can delegate the same amount and get radically different outcomes. One compounds. The other accumulates entropy. Same autonomy level. Different operating system. That's the missing axis: operator discipline . By operator discipline I mean one thing: how much of your work survives the session boundary as inspectable state. The autonomy ladder — inspired by Karpathy, reinforced by recent writing on AI-assisted development, and repeated in a dozen industry variants — measures one vertical: how much of the work you direct the model to own, and how fluent you are at directing that delegation. Each step is a skill ladder inside one domain — building software. You climb by getting better at prompts, decomposition, code review-at-speed, and tolerance for non-determinism. This is real and worth measuring. It's just not the only axis. Here's the question the vertical can't answer: Two developers are both at Level 4. One ships features that compound — the codebase gets cleaner, their operating context gets sharper, their next prompt does more with less. The other ships features that decay — the codebase grows entropy, their trust in the model degrades, every new prompt is a fresh negotiation. Same vibe coding level. Different outcomes. What's the difference? It's not skill at building. It's how the person relates to the tool over time . Some maps name fragments of this — trust, verification, code review burden, the "perception–action gap" between knowing AI code can be wrong and being able to actually catch it. Those are real and worth reading. But they tend to live as caveats inside the autonomy story, not as a second axis with its own structure. So let me try to draw the axis directly. A small concrete example, since the abstraction needs one. For about three months I kept re-explaining the same architecture decision to the model every few sessions. Each time it would respectfully suggest the alternative I'd already rejected. Each time I'd argue it down again. The work felt fine in any single session. Over a month it was exhausting. Then I started writing those decisions down in a separate store, with a status field. proposed → accepted → locked . Once a decision is locked , the model is told not to relitigate it without an explicit unlock. The relitigation stopped. The work got calmer. The codebase started moving in one direction instead of wobbling. Nothing about my vibe coding level changed. What changed was that a decision became a piece of state instead of a thing I had to defend live. That's the axis. Not "are you good at prompting" — how much of your context is a state machine, vs. how much is reconstructed from scratch each session. If autonomy is L0–L5 and operator discipline is Low/High, you get twelve cells. The diagonal that matters isn't "low everything → high everything." It's the cross-axis claim: L1 + High operator discipline L5 + Low operator disciplineover any time horizon longer than a sprint. Three sample cells: The claim is that the second axis dominates the first over time. I think it's right. It's testable. If you've watched two equally fluent AI users diverge over six months, you've already seen the pattern. I'll describe what I personally run — not as the right answer, but so you have something concrete to disagree with. A persona file the model loads each session: identity, communication preferences, hard rules, things that previously caused friction. Updated when a session reveals a new edge case. Three append-only stores. Decisions have a lifecycle proposed → accepted → locked . Threads are active workstreams, each with current step, blocker, and next action. Notes are atomic facts with source-anchoring — every fact carries provenance: which email, which call, which file, which line. A capture habit. Decisions go into the store the same turn they happen, not as a post-session recap. Recaps drift. Live captures don't. Locked decisions stop the death-by-second-guessing loop. Source-anchoring removes one easy path to hallucination — the model is less likely to confidently restate a "fact" when the workflow forces provenance into view. None of this is novel architecture. The novelty is that it's written down and enforced , not implied. It's a state machine, not a prompt trick. Whatever your autonomy level, you can be high or low on this. That's the axis. Discipline doesn't beat fluency. They multiply. An L1 user with high discipline still moves slower than an L4 user with high discipline. The autonomy ladder isn't wrong. It's real and worth climbing. What I am claiming: the map has two axes, and most of the public conversation has been about one of them. If "more AI" hasn't translated into "more leverage" for you, the answer might not be a smarter model. It might be the axis you weren't measuring. What does your operator discipline look like? What's captured as state, what's reconstructed every session? Curious to hear concrete setups in the comments — especially ones that disagree with mine. — Mike