{"slug": "context-engineering-is-engineering-work-not-prompt-writing", "title": "Context engineering is engineering work — not prompt-writing", "summary": "A developer reports that using a top-tier AI model to write a detailed specification and a cheaper model to implement from that spec yields better accuracy and lower cost than using the strong model alone without a structured spec. The approach, called context engineering, treats AI development as a systems engineering problem where explicit business rules and output contracts reduce the capability needed for implementation. The finding is based on production projects and is not a universal recipe, but the principle that context quality lowers required model capability is generalizable.", "body_md": "TL;DR — When the spec is good, implementation needs less model. I started using a top-tier model to write the spec and a cheaper, faster one to implement it — still using the strong model, just spending it on the spec instead of the implementation. The gain isn't some magic prompt phrasing; it's the context: explicit business rules, audited project constraints, a defined output contract. That's systems engineering — the discipline of anyone who's kept real software alive, whatever their stack.\n\nEvery backend dev knows the scene: the Swagger is out of date, the last hotfix shipped without a unit test, and the `README.md`\n\ndocuments a command nobody's used in six months. The code works. The docs lie. And the gap between the two is exactly where AI — and we — start to go wrong.\n\nI've spent the last few months developing with AI *for real* inside production projects, not tutorial greenfield. My takeaway was less about which model to use and more about a shift that already has a name: the move from **prompt engineering** to **context engineering**.\n\nThe difference isn't semantic. Prompt engineering treats the problem as writing — finding the magic phrase. Context engineering treats it as what it always was: **a systems engineering problem**. And it's where my backend background applied most directly — though anyone who's kept a real system alive has the same instinct.\n\nLet me start with the evidence, because that's what made me take this seriously.\n\nMy reflex, for a long time, was to reach for the strongest model for everything — more expensive, smarter, fewer errors. Makes sense on paper.\n\nIn practice, I saw something else. When the task's *specification* is well done — explicit business rules, audited project constraints, a defined output format — the model capability needed for **implementation** drops sharply. Enough to split the work by stage: I started using a top-tier model (currently Opus) to **write the spec**, and a cheaper, faster model (Sonnet) to **implement** from it. I didn't drop the strong model — I moved it: off the implementation, onto the spec.\n\nThe comparison that actually matters isn't cheap implementer vs. expensive implementer. It's *strong model implementing with no structured spec* vs. *strong model writing the spec, cheaper model implementing from it* — and the second is both cheaper and more accurate. The accuracy isn't coming from the cheap model — it's that the spec carries the precision. With unlimited budget you could run the strong model on both stages, and that's the ceiling; the useful finding is narrower: good context makes a cheaper implementer beat the strong model left to implement without it — while saving on the highest-volume stage.\n\nBe precise about what that does and doesn't show, though. Sonnet isn't a weak model — it's a notch below Opus, not a tiny one. So this isn't evidence that some minimal model can implement anything; it's that a good spec lowers the capability *ceiling* a task demands, and in my case that ceiling dropped below top-tier. How far it drops depends on the task.\n\nI want to be honest about the scope of this claim, because it easily turns into misleading clickbait. **It's not a universal recipe.** The exact split depends on task complexity, codebase maturity, and how good your patterns already are. Apply this to a messy project with no specs and the cheap model will hallucinate, and you'll conclude I was wrong — when what was missing is exactly the half that makes the claim true.\n\nWhat **is** generalizable is the principle behind it:\n\nContext quality lowers the model capability a task requires.\n\nThe Opus→Sonnet split is just concrete evidence that the principle works. And it explains, as a bonus, why the hype migrated so fast from \"prompt engineering\" to \"context engineering\": the marginal gain from investing in context is bigger than from investing in the phrasing.\n\nHere's a trap I also fell into: thinking **more** context is better context. It isn't. Dumping the whole repo into the window doesn't help the model infer better — it just inflates cost and dilutes signal. Good context isn't volume, it's **precision**. In practice, what moves the needle:\n\n`README.md`\n\nNotice none of this is \"prompt technique.\" It's engineering hygiene. AI just made the cost of skipping that hygiene visible and immediate.\n\nA thesis without a method is a motivational talk. So here's the how.\n\nI treat every repeatable step of my flow as a **skill**: a structured prompt with three fixed parts — *Instructions*, *Constraints*, and *Output Format*. They live versioned in `.md`\n\nfiles, reachable from both the editor and the agent. When I catch myself asking the model for the same thing a third time across different projects, I formalize it into a skill.\n\nThe flow of a task in an existing project looks like this:\n\n```\nextract task description\n   → spec-writer        (generates the spec from the informal ticket)\n   → [review the spec]  (human in the loop — the step I don't delegate)\n   → implement\n   → ai-output-verifier (red-team what the AI generated)\n   → code-reviewer      (senior-level critical review)\n   → commit-message     (Conventional Commits from the diff)\n   → pr-description      (description in the team's format)\n   → push\n   → review by another model on the PR\n```\n\nThree of these skills do the heavy lifting. Let me open all three.\n\n`spec-writer`\n\nThe heart of the pipeline. It turns an informal ticket (natural language, bullets, a Slack thread) into an implementation-ready spec. The detail that makes the difference isn't the template — it's the **Constraints Audit**: before describing what the code *should do*, the skill forces an inspection of what the project *already has* configured that could invalidate the model's output.\n\n```\n# Skill: Spec Writer\n\nTurns an informal ticket description into an implementation-ready SDD spec.\n\n## Instructions\n1. Read the provided description.\n2. Identify ambiguities BEFORE writing the spec. If there's at least 1\n   critical doubt, STOP and ask.\n3. Project Constraints Audit (mandatory). Before describing what the code\n   should do, audit what the project ALREADY HAS configured that could\n   invalidate the output. Run concrete inspection — don't assume:\n   - Schema (DB writes): inspect the table. NOT NULL, defaults,\n     UNIQUE/FK, encoding. Sample existing data to see the expected\n     format of special fields.\n   - Transpile/syntax (JS/TS): Babel rules may differ across\n     .js, .jsx, .ts, .tsx. Confirm modern syntax (?., ??) IN THE SAME\n     file type that will be modified.\n   - TypeScript config: target, strict, noImplicitAny.\n   - Lint config: rules that could reject the output.\n   - Dependencies in use: suggest libs ALREADY present, don't add\n     new ones without justification.\n   - Existing patterns: grep for a similar implementation and follow the style.\n4. Suggest at least 3 non-obvious edge cases from the original ticket.\n\n## Constraints\n- DON'T invent business rules. If info is missing, mark it as TODO.\n- DON'T assume default stack/config. Use what the Constraints Audit\n  found. If transpile doesn't support ?. in .jsx, the spec says\n  \"use traditional conditionals in .jsx\".\n- Keep the spec short: one screen max.\n```\n\nWhy this matters: most of the hallucinations I've seen weren't the model \"inventing\" from nothing. They were the model **assuming a reasonable default** that the specific project didn't follow — `?.`\n\nin a file that project's Babel doesn't transpile, an insert that violates a `NOT NULL`\n\nnobody mentioned. The Constraints Audit kills that entire class of error before the first line of code.\n\n`ai-output-verifier`\n\nSycophancy is the tendency of LLMs to agree with, flatter, and validate the user at the expense of accuracy — the model tells you what you want to hear instead of acting as a neutral truth-seeker. In a flow where AI generates a lot of code, this is the **hidden tax**: the code *looks* right, *passed* the tests (which the same AI wrote), and nobody checked whether the tests test anything.\n\nThis skill formalizes distrust into a checklist. I run the checks in order and stop at the first red flag:\n\n```\n# Skill: AI Output Verifier\n\nRed-team of AI-generated output. Fights sycophancy, fabrication, and the\n\"code that looks right but isn't\" pattern.\n\n### Check 1: Fabricated references\nDid the AI cite file:line, a function, or an API? Open each one and confirm\nit exists AND does what it claimed. Red flag: a plausibly-named but\nnonexistent function, a lib version different from the installed one.\n\n### Check 2: Tests that don't test\nDoes each expect validate BEHAVIOR, or just that it \"ran without throwing\"?\nWould the test fail if you commented out the implementation? Red flag:\nexpect(result).toBeDefined(), mocks that return the expected result\ninstead of simulating real behavior.\n\n### Check 3: Happy path syndrome\nDoes it work with empty input (null, undefined, \"\")? Under concurrency?\nWith the external integration down? Red flag: no try/catch on I/O, no\ninput validation.\n\n### Check 4: Adjectives without evidence\nDid the AI use \"robust\", \"performant\", \"secure\", \"production-ready\"? Is\nthere a benchmark or technical argument to back it? Red flag: \"more\nperformant\" with no comparison.\n\n### Check 5: Sycophancy detection\nDid you push the AI in a direction and it changed its mind too fast,\nwithout a new technical argument? If so, ask explicitly: \"what's the\nevidence for this new position?\"\n\n### Check 6: Plan vs delivery\nDid the AI reference files/specs/paths as if they exist — in summaries,\nconclusions, or next steps? For each path, run `ls` or open the file: does\nit exist? Watch verb tense — \"X becomes the destination in Y/Z/\" is a plan\ndressed as delivery. In the final summary, separate \"I created\" (verified\nby `ls`) from \"I suggest creating\" (just a plan).\n\n### Check 7: Confidence on inferences\nDid the AI make claims it couldn't have verified directly? It must not give\nthe same affirmative authority to what it SAW in the code and what it\nINFERRED from folder structure. Demand a level: High (directly observed) /\nMedium (inferred from structure) / Low (speculative).\n\n## Constraints\n- DON'T mark \"verified\" if any check landed on \"maybe\". Go back to the code.\n- DON'T rely on the AI's tests alone. They came from the same source that\n  might have the bug.\n- DON'T skip a check because \"it's a small change\". Sycophancy scales —\n  small changes are where AI cheats most.\n```\n\nThe metric that started to matter to me isn't \"how much code per day.\" It's **how much bad code I kept from reaching production**. This skill is the instrument for that.\n\n`code-reviewer`\n\nShort and direct. Critical review at a senior backend Node/TS level, focused on correctness and security — not style, which Prettier handles.\n\n```\n# Skill: Code Reviewer\n\nCritical code review at a senior backend Node/TS level. Focus on\ncorrectness, security, maintainability — not style.\n\n## Instructions\n1. Read the provided diff.\n2. Classify each issue as Blocker, Major, or Minor.\n3. For each Blocker/Major, give a concrete example of the bug (input that\n   breaks it, race condition scenario).\n4. Suggest a refactor ONLY when it reduces real complexity. Not for aesthetics.\n5. End with a Verdict: Ship / Fix and ship / Block.\n\n## Constraints\n- Assume Node 20+, TypeScript strict, ESLint configured.\n- Don't comment on formatting.\n- Consider concurrency, DB transactions, idempotency, promise leaks,\n  silent error handling.\n- If it's an HTTP endpoint: input validation, authz, rate limit, PII logging.\n```\n\nNotice the pattern across all three: **Instructions + Constraints + Output Format**, always. The Constraints are the part that matters most, because that's where you encode the \"do NOT\" — and the \"do NOT\" is what keeps the AI on the rails.\n\nThe Achilles' heel of all this is documentation aging. A spec is a snapshot of the past; the code keeps changing. The direction that feels right to me is **living documentation** — generated and maintained on every change, not in a heroic quarterly effort. Tools are already aiming exactly at this, and while they mature, you can approximate it: take a snapshot of the project and ask the model to generate a context file from the code + `README.md`\n\n.\n\nHow I see this evolving inside a repo: alongside the `README.md`\n\n, a `_Skills/`\n\nfolder with the team's structured prompts and anchor files (`CLAUDE.md`\n\n, `CODEX.md`\n\n) that list the available skills and point to the project's `CONTEXT.md`\n\n. The model enters the repo and already knows what to read, in what order.\n\nAnd there's a layer beyond this one. Context engineering is about what you put *into* a model call; the moment you wrap those calls in a system — quality gates, retries, an independent reviewer, persistent memory — you've crossed into what the field now calls *harness engineering*: the discipline of everything that isn't the model. The skill pipeline above is already a harness in miniature. That's where I go next.\n\nNone of this is exotic. It's the old engineering discipline — document decisions, keep patterns, write tests that test — with a new, impossible-to-ignore incentive: now the cost of skipping those steps shows up immediately, in the form of an agent that hallucinates. AI didn't change what good development is. It just stopped letting us pretend that shortcuts are free.\n\n*This is the flow I run today. The skills evolve every week as I catch the AI failing in new ways — if you run something similar and hit different walls, I'd want to know which.*", "url": "https://wpnews.pro/news/context-engineering-is-engineering-work-not-prompt-writing", "canonical_source": "https://dev.to/pablofelps/context-engineering-is-engineering-work-not-prompt-writing-203g", "published_at": "2026-06-25 21:37:40+00:00", "updated_at": "2026-06-25 21:42:48.767369+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "ai-products", "ai-research"], "entities": ["Opus", "Sonnet", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/context-engineering-is-engineering-work-not-prompt-writing", "markdown": "https://wpnews.pro/news/context-engineering-is-engineering-work-not-prompt-writing.md", "text": "https://wpnews.pro/news/context-engineering-is-engineering-work-not-prompt-writing.txt", "jsonld": "https://wpnews.pro/news/context-engineering-is-engineering-work-not-prompt-writing.jsonld"}}