The New Code: Why Specifications Will Replace Programming

wpnews.pro

The agents were doing exactly what I told them to. That was the problem.

I'd built a pipeline where AI agents could take a spec file, implement a feature, run the tests, review the result, and commit — without me writing a line of code. It mostly worked. Dozens of features shipped. But I kept reviewing the output and feeling like something was off. Not broken. Just subtly wrong in a way that was hard to name.

I spent a while blaming the models. Then the prompts. Then the validation steps. Eventually I had to sit with the obvious: the agents were implementing exactly what I'd written. My specs were underspecified. The bottleneck was always me, at the planning stage.

There's something that feels right about vibe coding. You're operating at the level of intent — describing what you want and letting the model handle the mechanics. That part is genuinely useful.

But watch what most people do with the output:

Traditional development:
Source code  →  Compiler  →  Binary
(keep the source; regenerate binary anytime)

Vibe coding done wrong:
Prompt  →  LLM  →  Generated code
(delete the prompt; commit the code)

You've shredded the source and carefully version-controlled the binary.

The prompt — your structured description of what you wanted, why, and what "correct" meant — is the valuable artifact. The generated code is what compiles from it. When you discard the prompt and commit only the output, you've lost the thing that actually mattered.

The practical consequence shows up six months later: you're staring at code you wrote and spending twenty minutes reverse-engineering your own intent. The spec would have been a thirty-second read.

I built what I call an SDLC (Software Development Lifecycle) harness — a system where instead of writing code directly, you write a spec describing what needs to be built, and AI agents handle the implementation, testing, review, and documentation.

The spec is the source. The code is what gets compiled from it.

Simple idea. The interesting part is figuring out how to run that pipeline efficiently. I made some expensive mistakes along the way.

My original design ran every task through the full pipeline — implement, test, review, document, wrap-up — independently, in an isolated environment, in parallel:

tasks.md
   │
   ├── Task 1 (isolated)  →  implement → test → review → doc → wrap  [~200k tokens]
   ├── Task 2 (isolated)  →  implement → test → review → doc → wrap  [~200k tokens]
   └── Task 3 (isolated)  →  implement → test → review → doc → wrap  [~200k tokens]
                                                          ────────────────────────────
                                                          Total: ~200k × N tasks

On paper: thorough. In practice: around 200,000 tokens per task. A five-task spec burned through a million tokens before I'd seen any integrated output.

The waste was structural. Setup phases that didn't vary between tasks ran N times. Per-task reviews could only see one task's changes — they missed integration problems anyway. Documentation ran in isolation before anything was assembled.

And the per-task review gave me false confidence: it caught issues within a task, but couldn't catch whether the integrated result actually worked. I was paying for isolation on tasks that were mostly sequential.

The redesign comes from a simpler question: what actually needs isolation, and what can be shared?

The implement step benefits from a fresh context per task — no bleed between unrelated changes. Everything else — setup, testing, review, documentation — can run once over the integrated result.

That produces a clean ladder of tools:

  /patch        trivial fix, no tests needed
      ↓
  /sdlc-task    one unit → implement → fast-test → fix → commit
      ↓
  /sdlc-run     one whole spec, full lifecycle, in-place
      ↓
  /sdlc-flow    one whole spec, worktree isolation, produces a PR
      ↓
  /sdlc-block   block-level orchestrator, branch train of PRs

Pick the rung that matches the scope. A trivial hotfix doesn't need a review stage. A whole feature spec does. Multiple independent feature blocks running in parallel need the orchestrator.

The biggest change is in the orchestrator. The original version orchestrated individual tasks. The rebuilt version operates at the block level — each block is a complete feature-sized unit of work that runs its own full pipeline in its own branch:

  master-plan.md
        │
  sdlc-block orchestrator
        │
    Phase 1  ─────────────────────── (parallel)
    ├── Block A  →  sdlc-flow  →  PR #1
    └── Block B  →  sdlc-flow  →  PR #2
        │
    (after Phase 1 merges)
        │
    Phase 2
    └── Block C  →  sdlc-flow  →  PR #3

And inside each sdlc-flow

, the lean design runs one fresh implement agent per task, then one consolidated back-half over the integrated result:

  shared setup (once)
        │
    Task 1  →  fresh implement agent
    Task 2  →  fresh implement agent
    Task 3  →  fresh implement agent
        │
  test → review → fix → document → wrap-up
  (once, over the integrated result)

Per-task agent isolation, at roughly the cost of running it once.

Faster and cheaper, yes. Better results, no.

What made the results better was fixing the specs.

I'd been optimizing execution while planning was still the bottleneck. The agents were reading underspecified tasks, making reasonable assumptions, and producing technically correct results that missed the actual intent. No amount of pipeline tuning changes that.

You can't fix a bad spec with a better pipeline.

The most interesting thing I've done with this pipeline: I used it to redesign itself.

Here's a condensed version of the master plan I wrote before touching a line of code. The goal was to redesign all four SDLC engines — and the spec-driven pipeline was the tool I used to do it:


## Goal

Four engines ship in the harness, but they no longer earn their place
as built. The most expensive one burns ~200k tokens per task running a
full review+test+document cycle that's mostly redundant when tasks are
sequential. Usage data: almost nobody runs it anymore — the simpler
single-spec runner has become the actual default.

This redesign gives each engine a distinct scope × ceremony tier,
gives every run a committed token-accounted state trail, and
rationalizes the planning commands that feed them.

## Architecture

| Engine       | Scope            | When to use                          |
|--------------|------------------|--------------------------------------|
| /patch       | Trivial hotfix   | No tests needed                      |
| /sdlc-task   | One small unit   | Implement → test → fix → commit      |
| /sdlc-run    | One whole spec   | Full lifecycle, in-place             |
| /sdlc-flow   | One whole spec   | Worktree isolation + PR              |
| /sdlc-block  | A full roadmap   | Block-level orchestration, PRs       |

---

## Phase 0 — Foundation

### Block A — Unified token telemetry
**What:** Every engine writes a committed state file with a token
usage block after each phase. Persists what was previously
render-only output that vanished when a run ended.

**Why:** Token costs were invisible between runs. You couldn't tell
which stage was expensive without watching the live output scroll by.

**Acceptance criteria:**
- sdlc-run writes a committed state file with a tokens roll-up
  after each phase
- sdlc-flow persists per-task token usage into its committed state
- No engine references a gitignored breadcrumb file
- node --check clean on both engines

### Block B — Lean single-unit engine
**What:** Rewrite the task-level engine into a lean single-unit
runner: implement → fast validation → fix loop (≤3 attempts) →
commit. Delete the heavy stages and coupling flags.

**Why:** Makes small work cheap enough to be worth a dedicated
engine, and gives the trivial /patch command an intermediate rung
instead of jumping straight to the full-spec runner.

**Acceptance criteria:**
- Engine runs only implement → test → fix → commit
- Coupling flags are gone (grep-clean)
- Committed state file with tokens is written
- node --check clean

---

## Phase 1 — The Headline Change

### Block A — Rewrite the orchestrator as a block-level engine
**What:** Replace the task-level wave machine with block-level
orchestration. Reads a master-plan file, computes dependency waves
at block granularity, fans out one sdlc-flow per block in its own
worktree branch, produces a PR per block by default. Delete the
legacy task-level execution engine and its orphaned schema file.

**Why:** The old design's merge-conflict failure mode was structural
— tasks sharing a worktree conflicting on shared files. Blocks are
independent by construction. This eliminates the failure mode
entirely, and reuses the proven single-spec runner as the inner
engine rather than duplicating its logic.

**Files:**
- Modified: sdlc-block.js — full rewrite (keep wave computation,
  config , traced-agent wrapper; add plan-file input,
  block-level fan-out, branch-train, two-level committed state)
- Deleted: execution-plan.schema.json — no remaining consumer
  after the task-level machine is removed

**Interfaces / shared surface:** Reads master-plan-format files
(the same format /generate-master-plan and /plan produce). Invokes
sdlc-flow as the inner engine via the workflow() primitive. The
committed block-orchestration-state.json — with child-flow token
roll-up — is the resume signal and the human review artifact.

**Out of scope:** The /review-PR and /merge-train commands for
human-gated branch-train merging (Block B). The per-block close-out
quality gate (Block C). The harness config schema rewrite for the
new block.* keys (Phase 3). This block may read provisional keys
and leave the schema update for Phase 3.

**Acceptance criteria:**
- Orchestrator reads a master-plan-format file and fans out
  sdlc-flow per independent block
- Default opens a PR per block; --auto-merge merges in
  dependency order
- Committed block-orchestration-state.json written with child
  token roll-up
- All legacy task-level code is gone (grep-clean for removed
  symbols: runTaskWorktree, runTaskInPlace, --from test,
  --verify-depth)
- node --check clean on the engine file

A few things worth noting about how this spec works.

The goal section captures why this is happening before describing what to build. "Usage data: almost nobody runs it" is load-bearing context — without that, an agent might preserve backward compatibility with a behavior nobody is actually using.

The Files field is a commitment, not a description. The agent knows exactly which files to touch and which to leave alone. Anything not listed is out of scope by default.

The Interfaces / shared surface field tells the agent what this block exposes to the blocks that come after it. That's the contract other blocks depend on — changing it mid-implementation breaks downstream work.

The Out of scope field does something most specs skip entirely: it names the things you are not building. This matters because an agent trying to be helpful will often implement adjacent things that seem obviously related. Explicit out-of-scope entries prevent that drift.

And the acceptance criteria are verifiable against the diff. Not "the engine works better" — "these specific symbols are gone, this file is committed, node --check passes." These become the exact checklist the review stage runs.

The practical test:Can you hand this spec to a smart engineer with zero prior context, and have them build the right thing? If not, the spec isn't done — and an AI agent will fail the same way.

The flow that works for me now:

  Feature idea or requirement
          │
     /plan or /generate-master-plan
     (mini-roadmap: what, why, blocks, dependencies)
          │
     /generate-tasks
     (executable spec: tasks + acceptance criteria + decisions)
          │
     pick the right engine rung
          │
     AI agents implement

The planning step is where the hard questions happen: what does done actually look like, what are the real constraints, where are the edge cases. These are cheaper to answer before implementation starts than after.

When I rush through planning, I pay for it in review cycles. The root cause is almost always an underspecified task — and the fix is always "write a clearer spec," not "use a better model."

When I slow down at planning, implementations usually ship clean on the first pass.

The scarce skill isn't writing code. It isn't prompting. It's writing specifications that fully capture intent — clear enough that a stateless agent can make the right call without asking follow-up questions.

This is hard. It requires thinking through your assumptions before you start. It requires distinguishing between "I have a vague sense of what I want" and "I can articulate what I want precisely enough for someone else to act on."

But it transfers. The same discipline that makes a good agent spec makes a better architecture decision record, a clearer PR description, a more useful design doc. It makes you a better collaborator regardless of whether the person on the other side is human or AI.

The tooling is changing fast. The underlying skill isn't.

Start with the spec.

If this was useful, I write about building production AI and agentic systems at learn-agentic-ai.com — including hands-on learning paths available in both English and Brazilian Portuguese. Come build something real.

source & further reading

dev.to — original article Where AI code intelligence fits in your AI developer roadmap 2026 What actually changed in two weeks Why do we import 100MB of frameworks to run a 50-line LLM reasoning loop?

The New Code: Why Specifications Will Replace Programming

Run your AI side-project on zahid.host