TL;DR — When the spec is good, implementation needs less model. I started using a top-tier model to write the spec and a cheaper, faster one to implement it — still using the strong model, just spending it on the spec instead of the implementation. The gain isn't some magic prompt phrasing; it's the context: explicit business rules, audited project constraints, a defined output contract. That's systems engineering — the discipline of anyone who's kept real software alive, whatever their stack.
Every backend dev knows the scene: the Swagger is out of date, the last hotfix shipped without a unit test, and the README.md
documents a command nobody's used in six months. The code works. The docs lie. And the gap between the two is exactly where AI — and we — start to go wrong.
I've spent the last few months developing with AI for real inside production projects, not tutorial greenfield. My takeaway was less about which model to use and more about a shift that already has a name: the move from prompt engineering to context engineering.
The difference isn't semantic. Prompt engineering treats the problem as writing — finding the magic phrase. Context engineering treats it as what it always was: a systems engineering problem. And it's where my backend background applied most directly — though anyone who's kept a real system alive has the same instinct.
Let me start with the evidence, because that's what made me take this seriously.
My reflex, for a long time, was to reach for the strongest model for everything — more expensive, smarter, fewer errors. Makes sense on paper.
In practice, I saw something else. When the task's specification is well done — explicit business rules, audited project constraints, a defined output format — the model capability needed for implementation drops sharply. Enough to split the work by stage: I started using a top-tier model (currently Opus) to write the spec, and a cheaper, faster model (Sonnet) to implement from it. I didn't drop the strong model — I moved it: off the implementation, onto the spec.
The comparison that actually matters isn't cheap implementer vs. expensive implementer. It's strong model implementing with no structured spec vs. strong model writing the spec, cheaper model implementing from it — and the second is both cheaper and more accurate. The accuracy isn't coming from the cheap model — it's that the spec carries the precision. With unlimited budget you could run the strong model on both stages, and that's the ceiling; the useful finding is narrower: good context makes a cheaper implementer beat the strong model left to implement without it — while saving on the highest-volume stage.
Be precise about what that does and doesn't show, though. Sonnet isn't a weak model — it's a notch below Opus, not a tiny one. So this isn't evidence that some minimal model can implement anything; it's that a good spec lowers the capability ceiling a task demands, and in my case that ceiling dropped below top-tier. How far it drops depends on the task.
I want to be honest about the scope of this claim, because it easily turns into misleading clickbait. It's not a universal recipe. The exact split depends on task complexity, codebase maturity, and how good your patterns already are. Apply this to a messy project with no specs and the cheap model will hallucinate, and you'll conclude I was wrong — when what was missing is exactly the half that makes the claim true.
What is generalizable is the principle behind it:
Context quality lowers the model capability a task requires.
The Opus→Sonnet split is just concrete evidence that the principle works. And it explains, as a bonus, why the hype migrated so fast from "prompt engineering" to "context engineering": the marginal gain from investing in context is bigger than from investing in the phrasing.
Here's a trap I also fell into: thinking more context is better context. It isn't. Dumping the whole repo into the window doesn't help the model infer better — it just inflates cost and dilutes signal. Good context isn't volume, it's precision. In practice, what moves the needle:
README.md
Notice none of this is "prompt technique." It's engineering hygiene. AI just made the cost of skipping that hygiene visible and immediate.
A thesis without a method is a motivational talk. So here's the how.
I treat every repeatable step of my flow as a skill: a structured prompt with three fixed parts — Instructions, Constraints, and Output Format. They live versioned in .md
files, reachable from both the editor and the agent. When I catch myself asking the model for the same thing a third time across different projects, I formalize it into a skill.
The flow of a task in an existing project looks like this:
extract task description
→ spec-writer (generates the spec from the informal ticket)
→ [review the spec] (human in the loop — the step I don't delegate)
→ implement
→ ai-output-verifier (red-team what the AI generated)
→ code-reviewer (senior-level critical review)
→ commit-message (Conventional Commits from the diff)
→ pr-description (description in the team's format)
→ push
→ review by another model on the PR
Three of these skills do the heavy lifting. Let me open all three.
spec-writer
The heart of the pipeline. It turns an informal ticket (natural language, bullets, a Slack thread) into an implementation-ready spec. The detail that makes the difference isn't the template — it's the Constraints Audit: before describing what the code should do, the skill forces an inspection of what the project already has configured that could invalidate the model's output.
Turns an informal ticket description into an implementation-ready SDD spec.
## Instructions
1. Read the provided description.
2. Identify ambiguities BEFORE writing the spec. If there's at least 1
critical doubt, STOP and ask.
3. Project Constraints Audit (mandatory). Before describing what the code
should do, audit what the project ALREADY HAS configured that could
invalidate the output. Run concrete inspection — don't assume:
- Schema (DB writes): inspect the table. NOT NULL, defaults,
UNIQUE/FK, encoding. Sample existing data to see the expected
format of special fields.
- Transpile/syntax (JS/TS): Babel rules may differ across
.js, .jsx, .ts, .tsx. Confirm modern syntax (?., ??) IN THE SAME
file type that will be modified.
- TypeScript config: target, strict, noImplicitAny.
- Lint config: rules that could reject the output.
- Dependencies in use: suggest libs ALREADY present, don't add
new ones without justification.
- Existing patterns: grep for a similar implementation and follow the style.
4. Suggest at least 3 non-obvious edge cases from the original ticket.
## Constraints
- DON'T invent business rules. If info is missing, mark it as TODO.
- DON'T assume default stack/config. Use what the Constraints Audit
found. If transpile doesn't support ?. in .jsx, the spec says
"use traditional conditionals in .jsx".
- Keep the spec short: one screen max.
Why this matters: most of the hallucinations I've seen weren't the model "inventing" from nothing. They were the model assuming a reasonable default that the specific project didn't follow — ?.
in a file that project's Babel doesn't transpile, an insert that violates a NOT NULL
nobody mentioned. The Constraints Audit kills that entire class of error before the first line of code.
ai-output-verifier
Sycophancy is the tendency of LLMs to agree with, flatter, and validate the user at the expense of accuracy — the model tells you what you want to hear instead of acting as a neutral truth-seeker. In a flow where AI generates a lot of code, this is the hidden tax: the code looks right, passed the tests (which the same AI wrote), and nobody checked whether the tests test anything.
This skill formalizes distrust into a checklist. I run the checks in order and stop at the first red flag:
Red-team of AI-generated output. Fights sycophancy, fabrication, and the
"code that looks right but isn't" pattern.
### Check 1: Fabricated references
Did the AI cite file:line, a function, or an API? Open each one and confirm
it exists AND does what it claimed. Red flag: a plausibly-named but
nonexistent function, a lib version different from the installed one.
### Check 2: Tests that don't test
Does each expect validate BEHAVIOR, or just that it "ran without throwing"?
Would the test fail if you commented out the implementation? Red flag:
expect(result).toBeDefined(), mocks that return the expected result
instead of simulating real behavior.
### Check 3: Happy path syndrome
Does it work with empty input (null, undefined, "")? Under concurrency?
With the external integration down? Red flag: no try/catch on I/O, no
input validation.
### Check 4: Adjectives without evidence
Did the AI use "robust", "performant", "secure", "production-ready"? Is
there a benchmark or technical argument to back it? Red flag: "more
performant" with no comparison.
### Check 5: Sycophancy detection
Did you push the AI in a direction and it changed its mind too fast,
without a new technical argument? If so, ask explicitly: "what's the
evidence for this new position?"
### Check 6: Plan vs delivery
Did the AI reference files/specs/paths as if they exist — in summaries,
conclusions, or next steps? For each path, run `ls` or open the file: does
it exist? Watch verb tense — "X becomes the destination in Y/Z/" is a plan
dressed as delivery. In the final summary, separate "I created" (verified
by `ls`) from "I suggest creating" (just a plan).
### Check 7: Confidence on inferences
Did the AI make claims it couldn't have verified directly? It must not give
the same affirmative authority to what it SAW in the code and what it
INFERRED from folder structure. Demand a level: High (directly observed) /
Medium (inferred from structure) / Low (speculative).
## Constraints
- DON'T mark "verified" if any check landed on "maybe". Go back to the code.
- DON'T rely on the AI's tests alone. They came from the same source that
might have the bug.
- DON'T skip a check because "it's a small change". Sycophancy scales —
small changes are where AI cheats most.
The metric that started to matter to me isn't "how much code per day." It's how much bad code I kept from reaching production. This skill is the instrument for that.
code-reviewer
Short and direct. Critical review at a senior backend Node/TS level, focused on correctness and security — not style, which Prettier handles.
Critical code review at a senior backend Node/TS level. Focus on
correctness, security, maintainability — not style.
## Instructions
1. Read the provided diff.
2. Classify each issue as Blocker, Major, or Minor.
3. For each Blocker/Major, give a concrete example of the bug (input that
breaks it, race condition scenario).
4. Suggest a refactor ONLY when it reduces real complexity. Not for aesthetics.
5. End with a Verdict: Ship / Fix and ship / Block.
## Constraints
- Assume Node 20+, TypeScript strict, ESLint configured.
- Don't comment on formatting.
- Consider concurrency, DB transactions, idempotency, promise leaks,
silent error handling.
- If it's an HTTP endpoint: input validation, authz, rate limit, PII logging.
Notice the pattern across all three: Instructions + Constraints + Output Format, always. The Constraints are the part that matters most, because that's where you encode the "do NOT" — and the "do NOT" is what keeps the AI on the rails.
The Achilles' heel of all this is documentation aging. A spec is a snapshot of the past; the code keeps changing. The direction that feels right to me is living documentation — generated and maintained on every change, not in a heroic quarterly effort. Tools are already aiming exactly at this, and while they mature, you can approximate it: take a snapshot of the project and ask the model to generate a context file from the code + README.md
.
How I see this evolving inside a repo: alongside the README.md
, a _Skills/
folder with the team's structured prompts and anchor files (CLAUDE.md
, CODEX.md
) that list the available skills and point to the project's CONTEXT.md
. The model enters the repo and already knows what to read, in what order.
And there's a layer beyond this one. Context engineering is about what you put into a model call; the moment you wrap those calls in a system — quality gates, retries, an independent reviewer, persistent memory — you've crossed into what the field now calls harness engineering: the discipline of everything that isn't the model. The skill pipeline above is already a harness in miniature. That's where I go next.
None of this is exotic. It's the old engineering discipline — document decisions, keep patterns, write tests that test — with a new, impossible-to-ignore incentive: now the cost of skipping those steps shows up immediately, in the form of an agent that hallucinates. AI didn't change what good development is. It just stopped letting us pretend that shortcuts are free.
This is the flow I run today. The skills evolve every week as I catch the AI failing in new ways — if you run something similar and hit different walls, I'd want to know which.