One structured prompt format. Two identical reasoning tasks. Same model. Unstructured: 1,240 tokens. Structured (with explicit schema): 847 tokens. 32% reduction. That's real, repeatable, shows up in cost logs. But it's also the easy part.
The harder part is knowing whether those saved tokens actually translate to better answers on YOUR task. And knowing when structure helps and when it's just overhead.
I spent the last month running the same prompts against Claude Sonnet 4.6 in both forms: one with step by step natural language instructions, one with XML tags and explicit field definitions. Code generation tasks, reasoning tasks, multi step workflows. Here's what the patterns actually show.
When you send a model a request in plain English, the model has to infer the shape you want. It's flexible. It's also ambiguous.
Write a function that validates user email addresses and returns helpful error messages.
The model will deliver SOMETHING. Maybe a function with inline validation. Maybe a helper class. Maybe a regex comment. Maybe a full test suite because "helpful error messages" seemed like extra context worth expanding. You got an answer, but you didn't specify the answer format.
Over five runs with Sonnet 4.6, the same unstructured prompt produced three different architectural shapes:
All correct. None of them what I actually wanted (a single, composable validation function that returned structured errors as objects).
Total tokens across five runs: 6,200. Average per run: 1,240.
Same task, now with explicit format:
Write a JavaScript function: validateEmail()
Requirements:
- Input: string (email address)
- Output: { valid: boolean, error: string | null }
- Implementation: regex-based validation only
- Error messages: return null if valid, specific error reason if invalid
Error categories:
- "missing_at": no @ symbol found
- "invalid_domain": domain lacks . or has no TLD
- "invalid_local": local part contains invalid characters
Return example:
{ valid: true, error: null }
{ valid: false, error: "invalid_domain" }
Over five runs with the same model, every output had the same shape. No factory functions, no classes, no extra bells. It did exactly what was asked.
Total tokens across five runs: 4,235. Average per run: 847.
32% reduction. No ambiguity. Consistent shape meant I could pipe the output directly into a test harness without transformation.
Here's what that actually looked like:
function validateEmail(email) {
const atIndex = email.indexOf('@');
if (atIndex === -1) {
return { valid: false, error: 'missing_at' };
}
const domain = email.substring(atIndex + 1);
if (!domain.includes('.')) {
return { valid: false, error: 'invalid_domain' };
}
// Check for invalid characters in local part
const localPart = email.substring(0, atIndex);
const invalidChars = /[<>()\\[\],.;:\s]/;
if (invalidChars.test(localPart)) {
return { valid: false, error: 'invalid_local' };
}
return { valid: true, error: null };
}
Every structured run produced this exact shape. Unstructured runs generated the same logic but wrapped it differently.
Here's the tricky part: tokens aren't the full story.
The unstructured versions were objectively MORE flexible. If I had asked for "write a function AND include a test harness," one of those three architectures would have made that trivial. The structured format was so locked down that asking for tests required a second prompt.
The benchmark friendly metric (tokens saved) is real. The useful metric (does this output directly feed my pipeline?) is context specific. Different answers, different weights for different tasks.
Code generation tasks: structure wins hard. You have a format spec. You want the model to follow it. Tokens drop, consistency rises.
Running the same comparison on five reasoning tasks (writing essays, analyzing text, brainstorming), the token savings were still there (29% average), but the quality tradeoff appeared. Structured prompts locked the reasoning into tighter paths. Some essays came out more formulaic. Not worse, just more boundaried.
The model hit a schema compliance target instead of exploring the actual reasoning space.
For code: schema compliance IS the target. For reasoning: sometimes the messiness is the point.
Using current pricing (Sonnet 4.6 input at $3/1M, output at $15/1M), average input tokens 2,000, average output 800:
Unstructured approach:
Structured approach:
Difference: $0.0006 per 100 calls. On pricing, it's noise. On latency (fewer output tokens = faster), it matters more.
If your task outputs 4,000 tokens regularly, suddenly the math shifts. Structured formats that reduce 4,000 token outputs by 30% actually save something you notice.
What's interesting is what the output patterns reveal about how models parse instructions.
Models trained on massive code datasets have seen thousands of function specifications. When you send a structured spec (name, input type, output type, constraints), you're activating pattern recognition pathways the model has seen before. It copies the shape. Fast, consistent, fewer tokens.
When you send natural language, the model has to build context from scratch. It's slower, fuzzier, more creative. For code, that's overhead. For reasoning, that's sometimes the whole point.
The models aren't "reasoning through" the unstructured prompt. They're doing pattern matching on a less constrained pattern set. Which is fine. Just know that's what's happening. The structured version isn't necessarily smarter, it's just aimed at a narrower target.
If you're optimizing cost on code generation at scale:
If you're working on reasoning or analysis:
The people telling you "always structure your prompts" are right about code. They're also copying advice from a code heavy community. Test it on your task. The benchmark lift doesn't predict real utility. Your data does.
Tags: #ai #tutorial #javascript #optimization