I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.
Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com
Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video
This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.
Layer 1 is complete. Eight issues, a working order management API, Pact contracts, a CI/CD pipeline, and a spec audit framework. The specification layer is done.
Layer 2 starts here. And it begins with a question that sounds simple until you think about it: why do you keep rewriting the same prompts?
If you've been using AI seriously for more than a few weeks, you have a collection of prompts that work. You've refined them. You copy them between sessions. You paste them into Claude Code at the start of a task and the agent does the right thing.
That feels like a system. It isn't.
Here's what copying a prompt actually does: it copies the words. It doesn't copy the contract. The agent reads the words, interprets them in the context of this session, and makes a series of decisions that aren't in the prompt. Different sessions, different context, different decisions — even with the same words. You won't notice until two agents produce incompatible outputs from the same prompt and you have to figure out which one is right.
A skill is different. A skill specifies what to produce, not just what to consider. It has a version, an output contract, and a routing signal. It gets better over time and the improvements persist. It's the difference between a note you wrote to yourself and infrastructure your whole team — human and agent — can depend on.
I reviewed the entire order-api project to find the best prompt-to-skill conversion candidate. Three instructions surfaced:
The test-run verification sequence (pytest tests/steps/ -v && pytest tests/pact/ -v && python scripts/can_i_deploy.py
) appears in every session. Rejected — it's a procedure, not a judgment call. Any agent can run three commands.
The findings file protocol appears in CLAUDE.md and has been followed since Issue #3. Rejected — it describes a format and cadence, not a methodology.
The Gherkin scenario quality evaluation — the methodology for deciding whether a scenario is well-formed before accepting or writing it — appeared across Issues #5, #7, and #8. Every time, the agent re-derived the same judgment framework from scratch. This is the winner.
Why: it encodes judgment, not procedure. Whether a step is UNDERSPECIFIED or LEAKY ABSTRACTION is a reasoning call. Its output drives everything downstream — every implementation session depends on the scenarios being well-formed. A bad scenario written in a planning session becomes broken step definitions two sessions later.
And here's the uncomfortable detail: the timeout ambiguity that was fixed in Issue #8 — And the response is returned within 12 seconds
— was introduced in Issue #2. Three sessions inherited it silently before it was caught. A quality evaluation skill running in Issue #2 would have caught it before it was ever committed.
Here's the current prompt as it would be pasted into a session:
Before writing or accepting a Gherkin scenario, check that it is well-formed. A well-formed scenario describes behavior from the caller's perspective, not from the implementation. Each step should be specific enough that only one implementation can satisfy it. Check for: vague quantities, counts that could be read as total or additional, time bounds without a start anchor, mechanism claims without the mechanism, and internal field names leaking into the spec. If the scenario has these problems, rewrite it before proceeding.
Four weaknesses:
No output format. The prompt implies the agent will either rewrite or proceed — but doesn't say what to return. A list of problems? An annotated version? The corrected Gherkin? Two agents will produce two different output shapes. A downstream agent consuming either has to parse whichever shape it receives.
No taxonomy. "Internal field names" means something specific — but the prompt doesn't define what internal means vs external. Two agents draw that line differently.
No handling for partial problems. "Rewrite before proceeding" — does that mean rewrite every step, or only the problematic ones?
Context-dependence. The prompt was written for reviewing an existing scenario. Used in a planning context ("write a new scenario for DELETE /orders/{id}"), an agent either ignores the check or applies it mid-draft. The two strategies produce different levels of revision.
The fundamental problem: the prompt describes what to look for but not what to produce. The agent invents the output format. When the output format is invented, it is never the same twice.
Same input. Same endpoint. Prompt version first, skill version second.
The task: Write a Gherkin scenario for DELETE /orders/{order_id}
— cancel a confirmed order. The order can only be cancelled if payment has not been captured yet.
Task A — prompt version output:
Scenario: Cancel a confirmed order before payment capture
Given a confirmed order exists with order ID "order-123"
And payment has not been captured for the order
When the user sends DELETE /orders/order-123
Then the order is cancelled
And the response status code is 200
And the response includes a confirmation message
Six implicit decisions the agent made that weren't in the prompt:
Total implicit decisions: 6. All silent.
Task B — skill version output:
Scenario: Confirmed order is cancelled when payment has not yet been captured
Given an order was created via POST /orders with order ID "order-del-001" and status "CONFIRMED"
And the payment gateway has not captured payment for order "order-del-001"
When the client sends DELETE /orders/order-del-001
Then the response HTTP status is 200
And the response body contains a "status" field with value "CANCELLED"
And the response body contains an "order_id" field with value "order-del-001"
And the payment gateway receives no capture or charge requests for order "order-del-001"
Scenario: Cancellation is rejected when payment has already been captured
Given an order was created via POST /orders with order ID "order-del-002" and status "CONFIRMED"
And the payment gateway has already captured payment for order "order-del-002"
When the client sends DELETE /orders/order-del-002
Then the response HTTP status is 422
And the response body contains an "error" field with value "PAYMENT_ALREADY_CAPTURED"
And the order status remains "CONFIRMED"
Two implicit decisions — both surfaced explicitly:
Total implicit decisions: 2. Both visible.
| What changed | Classification |
|---|---|
| "confirmed order exists" → "created via POST /orders with status CONFIRMED" | SKILL CONSTRAINT |
| "payment has not been captured" → "the payment gateway has not captured payment for order-del-001" | SKILL CONSTRAINT |
| "the order is cancelled" → HTTP status + "status" field with value "CANCELLED" | QUALITY DELTA |
| "confirmation message" → specific field name and value | QUALITY DELTA |
| (absent) → "payment gateway receives no capture requests" | |
| SKILL CONSTRAINT | |
| (absent) → full second scenario for failure case | |
| QUALITY DELTA |
Six meaningful differences. Three skill constraints, three quality deltas, six prompt ambiguities eliminated.
1. Version control
A prompt has no version. When you improve it, you copy the new text into the next session. The old version exists in your clipboard history or a chat transcript from three weeks ago. You cannot diff it. You cannot pin a session to it. You cannot see what changed between the prompt that worked and the prompt that produced the wrong output.
The Gherkin quality skill lives in docs/skills/gherkin-scenario-quality.md
. When Issue #8 added the IMPLICIT FLOW debt class, the skill gets a one-line update:
+| IMPLICIT FLOW | A step that implies a follow-up flow that is not specced anywhere |
Every session after that commit uses the updated skill. Every session before it used the previous version. git blame
tells you exactly when IMPLICIT FLOW was added and which issue prompted it. With a prompt, "skill v1.1" means nothing. There is only "the prompt I'm using today."
2. Output contract
The skill specifies exactly what it must return:
# Assumption:
commentsThe downstream dependency is the step definition author. When tests/steps/test_order_creation.py
implements And the payment gateway received exactly one charge request
— "exactly one", "charge request", "payment gateway" are all actionable. When it implements "And the response includes a confirmation message" — the author must invent an assertion. That invention is where test coverage becomes unreliable.
The output contract is the interface between the agent that writes scenarios and the agent that implements from them.
3. Routing signal description
The skill's description line:
Evaluate and produce well-formed Gherkin scenarios for the order-api project using the five-question debt diagnostic and output contract.
It names the artifact type, the project, the method, and the output. An agent knows exactly when to use this skill and what it will receive.
A bad description for the same skill:
Help with writing tests and checking scenarios for the project.
"Tests" matches pytest, Pact contracts, unit tests, and Gherkin. "The project" matches any repo. No methodology named means two agents doing "help with writing tests" produce incompatible outputs — which is exactly the problem the skill exists to solve.
If both the prompt and the skill produce output that works, the difference is this:
The prompt produces output that passes today's tests. The skill produces output that a different agent can implement tomorrow without making any decisions you didn't make.
That's why copying prompts isn't enough. The words travel. The contract doesn't.
Next issue: The 3-Tier Skill Architecture in Practice — mapping your skills to the right tier and why Tier 2 is where individual expertise becomes organizational leverage.
Sources & Further Reading
This article was written with the assistance of AI tools.