Why Your AI Agent Keeps Making the Same Mistakes (And a Structured Fix)

wpnews.pro

cd /news/large-language-models/why-your-ai-agent-keeps-making-the-s… · home › topics › large-language-models › article

[ARTICLE · art-2577] src=dev.to ↗ pub=2026-05-20T10:12Z topic=large-language-models verified=true sentiment=· neutral

Why Your AI Agent Keeps Making the Same Mistakes (And a Structured Fix)

A structured solution (the "PITFALL Rules") for organizing error-related knowledge in AI agent skill files, addressing the common problem of agents repeatedly making the same mistakes. The author argues that simply expanding context windows is insufficient due to the "lost in the middle" effect, and instead advocates for using decision tables grouped by priority to allow agents to quickly diagnose and fix errors. A/B test results showed that the new structured format reduced reasoning steps and eliminated ambiguity compared to traditional flat-list documentation.

read5 min views11 publishedMay 20, 2026

I ran into this maintaining production agent skills. Here's what I found: as far as I can tell, nobody has systematically addressed how to organize error-related knowledge inside skills. Maybe someone has and I just haven't found it — but in everything I looked at, this angle was missing. And here's what worked for me. I looked around the usual places — official forums, technical blogs, developer communities. Here's roughly what I found: One pattern stands out: everyone checks whether skill files resolve correctly and whether content is duplicated. But nobody asks — when an agent hits an error mid-execution, can it efficiently find the correct fix in SKILL.md or its reference files? Fair question. If context windows keep growing, do we even need structure? Partially. Bigger contexts solve reading — the model can ingest everything. But they don't solve attention. Anthropic's own context engineering research documents the "lost in the middle" effect: critical details buried in a long flat list get overlooked. Every irrelevant item your agent has to scan past is noise competing with the signal it actually needs. Think of it this way: you could give someone a 500-page manual or a 1-page troubleshooting card. They can read both, but the card is faster. This is an efficiency problem, not a capability problem. After banging my head against this for a while, I landed on five rules for organizing error documentation in skills (PITFALL Rules): Group by priority. The specific names depend on what your skill handles, but the structure looks like: Key principle: anomaly diagnosis always goes first, because that's what agents need when something breaks. Instead of prose like "sometimes the task times out because there's too much data and you should split it", write this:

Symptom	Diagnosis	Fix
Task timeout, no output written	Check if output file exists	Don't retry same config. Split input and rerun
Same input fails ≥3 times	Persistent bottleneck	Bypass delegation, process directly in main session
An agent can scan the Symptom column, find a match, and read the Fix — typically 1-2 steps instead of reading through 30 prose items.
This applies to operation-manual-style skills — ones with detailed step-by-step instructions and inline warnings (⚠️). If your skill body already explains a gotcha right next to the relevant operation, the pitfall section should give a one-line cross-reference only. Pitfalls cover blind spots the body doesn't address.
For reference-style or API-doc-style skills where the body doesn't include inline warnings, this rule doesn't apply — all error knowledge goes in the pitfall section.
Before adding any item: Is the section already categorized? → Which category? → Does it need a decision table? → Does it duplicate an existing item? → Too many items? Consider splitting to a separate file.
After refactoring, check 7 dimensions: categorized, decision tables for diagnostics, no flat lists >5, no duplicates, <50% body-text overlap, no prose narratives, no information loss.
I wanted data, not just intuition. So I ran A/B tests.
I picked three real errors from production agent skills. For each error, I constructed two isolated contexts — one group of agents only saw the old flat-list pitfall documentation, another only saw the new structured version. Same error description, same prompt. Multiple runs per version to reduce randomness.
Each test run is scored on 4 dimensions:

Overall Score = Diagnosis + Fix − Wrong Suggestions − (Steps > 3 ? 1 : 0). Max: 4. Min: -2.
New version: Decision table placed as the first category, Fix column directly says "split to 1 item per task"

Shared behavior: Both versions correctly diagnosed the root cause — dense analysis workload + 2-item batch config Old version's quirk: Misunderstood "batch" as a 2-item unit, suggested splitting into 2 sub-agents instead of 4; averaged 3.7 reasoning steps

New version's advantage: No ambiguity in the Fix column; averaged 3.0 reasoning steps
Score: Flat list 3.0 avg → Structured 4.0 avg (+1.0)
New version: Consolidated into one decision table — symptom → diagnosis → fix at a glance

Shared behavior: Both versions tried to match the error against known patterns Old version's quirk: Diagnosed as "session ID changed after navigation" — plausible but wrong, assembled from fragments across 4 sections

New version's advantage: Found a partial match against the decision table, explicitly flagged it as partial, gave a conservative fix (stop and report) plus a meta-suggestion to add a new row for this scenario
Score: Flat list 1.0 → Structured 3.0 (+2.0)
New version: Placed under a "Database" category with a fixed position

Shared behavior: Both versions scored perfectly — diagnosis and fix both correct Old version's quirk: The item was well-written enough to find regardless of position

New version's advantage: Category placement makes location more predictable, but in this case it didn't matter
Score: Flat list 4.0 → Structured 4.0 (0)

Three findings:

Scattered information = biggest win. Scenario B went from 4 sections of 24 items to 1 decision table — a +2.0 improvement. When error knowledge is scattered across a document, agents waste steps piecing together clues.
Decision tables remove ambiguity. The flat-list version described "occasionally succeeds, occasionally times out." An agent misunderstood and gave the wrong fix. The decision table directly says what to do — no room for misinterpretation.
Decision tables make agents honest. When no table row perfectly matched the symptoms, the agent said "partial match" instead of forcing a wrong answer. Flat lists don't encourage this kind of honesty. The whole project is up on my repo — feedback welcome: github.com/seanyan1984/skill-pitfalls It's framework-agnostic — the rules work for any markdown-based skill or prompt documentation. If your error knowledge is growing wild in your agent skills, give it a try.

source & further reading

dev.to — original article The number that lied: rebuilding a usage meter that actually helps AgentATC Extracting video frames in Rust, in-process — no CLI, no hand-written decode loop

~/api · this article 200

$curl api.wpnews.pro/v1/news/why-your-ai-agent-keeps-…

Read original on dev.to → dev.to/_10e34d2463b4a0aecf191/why-your-ai-agent-…

mentioned entities

Anthropic

metadata

slugwhy-your-ai-agent-keeps-making-the-same-mistakes-and-a-structured-fix

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevPip 26.1 Ships Dependency Cooldo…

next →Google is turning productivity i…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jul · #large-language-models

The number that lied: rebuilding a usage meter that actually helps

byteiota.com · 25 Jul · #large-language-models

Claude Voice Mode: Opus, Sonnet, and What Connectors Do

sourcefeed.dev · 25 Jul · #large-language-models

Your Mega-Prompt Just Became Technical Debt

startupfortune.com · 25 Jul · #large-language-models

A Chinese AI model just turned a routine training trick into a Washington national-security crisis

── more on @anthropic 3 stories trending now

wpnews · 24 Jul · #artificial-intelligence

A $700 Billion Sovereign Fund Just Made the Chinese AI Cost Argument Impossible to Ignore

wpnews · 24 Jul · #artificial-intelligence

SK Hynix reports Q2 2026 earnings as the AI memory supercycle faces its first real test

wpnews · 24 Jul · #artificial-intelligence

As agentic AI inference surges, tokenomics becomes the enterprise’s defining budget constraint

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required