{"slug": "ai-agents-and-persistent-context-what-design-md-teaches-us", "title": "AI Agents and Persistent Context: What design.md Teaches Us", "summary": "A GitHub repository called design.md, which provides AI agents with a persistent design document, has gained over 1,400 stars. The approach addresses context fragmentation in agent development by offering a single source of truth that persists across sessions. The repository also emphasizes observability in agent workflows, including systematic benchmarking with the CUA Benchmark.", "body_md": "A GitHub repository called design.md has been trending recently, accumulating over 1,400 stars. The concept is straightforward: provide AI agents with a persistent design document they can reference throughout their work.\n\nThis approach addresses a practical challenge in agent development that many teams encounter.\n\nThe Context Challenge\n\nWhen working on complex tasks, AI agents need to understand the broader picture. What's the architecture? What constraints exist? What approaches have been tried before?\n\nTypically, agents get context from:\n\nCurrent conversation (limited window)\n\nCode comments (often outdated)\n\nDocumentation (if it exists)\n\nThe issue is that this context is fragmented and temporary. When conversation moves forward, earlier context disappears. When documentation is outdated, agents make incorrect assumptions.\n\nA design.md provides a single source of truth that persists across sessions.\n\nWhat Belongs in design.md\n\nAn effective design.md answers these questions:\n\nBeyond feature lists, document the core purpose. Why does this project exist? What problem does it solve?\n\nDocument major choices and their rationale:\n\n\"PostgreSQL was chosen over MongoDB because ACID guarantees are required for financial transactions\"\n\n\"Microservices architecture was adopted because components have different scaling requirements\"\n\nTechnical constraints (performance requirements, browser support), business constraints (budget, timeline), and regulatory constraints (GDPR, HIPAA).\n\nDocument failed approaches to prevent agents from suggesting rejected solutions.\n\nKnown issues, technical debt, areas needing improvement help agents prioritize work.\n\nHow Agents Use design.md\n\nWhen starting a task, agents can:\n\nRead design.md to understand context\n\nMake decisions aligned with documented architecture\n\nAvoid solutions violating constraints\n\nReference design.md in reasoning\n\nThis leads to more coherent and consistent work. Agents work within a broader framework rather than just reacting to immediate tasks.\n\nKeeping design.md Updated\n\nThe main risk with design.md is becoming outdated. Effective practices include:\n\nMake it part of the workflow\n\nUpdate design.md immediately when making significant architectural decisions. Waiting until \"later\" means it never happens.\n\nVersion control it\n\nKeep design.md in the repository. During PR reviews, check if design.md needs updating.\n\nReview it regularly\n\nSchedule periodic reviews (monthly or quarterly) to ensure the document reflects current reality.\n\nLet agents help\n\nAgents can assist in maintaining design.md by:\n\nSuggesting updates when noticing inconsistencies\n\nSummarizing changes from recent commits\n\nFlagging outdated information\n\nObservability in Agent Workflows\n\nEven with good design.md, observing what agents actually do is important. This is particularly relevant for GUI agents interacting with complex interfaces.\n\nConsider a GUI agent tasked with \"fill out this form and submit it\". The agent needs to:\n\nLocate form fields\n\nEnter correct data\n\nHandle validation errors\n\nSubmit the form\n\nVerify success\n\nEach step can fail in different ways. Without observability, only the final result is visible: success or failure. The reason for failure remains unknown.\n\nBuilding Observable Workflows\n\nGood observability includes:\n\nRecord each action:\n\nWhat was observed (screenshots, DOM state)\n\nWhat decision was made\n\nWhat actually happened\n\nWhether it matched expectations\n\nTrack:\n\nSuccess rate per task type\n\nAverage steps to completion\n\nTime per step\n\nFailure modes\n\nWhen things go wrong, categorize errors:\n\nPerception errors (agent didn't see the right element)\n\nDecision errors (agent chose wrong action)\n\nExecution errors (action failed due to external factors)\n\nThis data helps identify where improvements are needed.\n\nSystematic Benchmarking\n\nCUA Benchmark provides systematic observability through:\n\n100 test cases across 5 different web applications\n\nStandardized task definitions\n\nAutomated result verification\n\nDetailed performance metrics\n\nRunning agents against CUA Benchmark provides quantitative data:\n\nOverall success rate\n\nSuccess rate by task type\n\nAverage steps per task\n\nCommon failure points\n\nThis data is valuable for iterative improvement. Instead of guessing what to optimize, specific areas where agents struggle can be identified and addressed.\n\nA Practical Example: Mano-AFK\n\nMano-AFK is an open-source autonomous application builder that demonstrates these principles. The workflow includes:\n\nReceiving natural language description of what to build\n\nGenerating a PRD (Product Requirements Document)\n\nWriting the code\n\nDeploying to a test environment\n\nRunning tests (lint, API, E2E)\n\nAuto-fixing any issues\n\nDelivering the final application\n\nThroughout this process, the agent references rules.md and preferences.md files to maintain consistency across projects. These files provide persistent context that guides decisions.\n\nCUA Benchmark results for Mano-AFK:\n\nW8A16 quantization: 58.0% accuracy\n\nW8A8 quantization (Cider): 54.0% accuracy, but faster inference (~1,453 tok/s prefill)\n\nThese numbers show that W8A8 version is slightly less accurate but significantly faster. Depending on use case, one might be preferred over the other.\n\nWithout systematic benchmarking, this data wouldn't exist. Only vague impressions like \"it works sometimes\" or \"it's kind of slow\" would remain.\n\nPractical Recommendations\n\nWhen building agent workflows, these practices have proven effective:\n\nBefore writing agent code, document architecture, constraints, and key decisions. This document guides both human developers and AI agents.\n\nDon't add logging later. Design agent workflows to be observable from the start. Every step should produce some form of output that can be inspected.\n\nEstablish a benchmark suite early. Run it regularly. Track metrics over time. This provides objective data on whether changes are improvements or regressions.\n\nWhen low success rates are observed, examine failure modes. Instead of making broad changes, identify specific failure patterns and address them directly.\n\nWhether through design.md, rules.md, or other mechanisms, ensure agents have access to persistent context. Conversation history is too ephemeral for complex projects.\n\nMoving Forward\n\nAI agent engineering is still in early stages. Best practices are still being figured out, but two things are becoming clear:\n\nAgents need persistent context to do good work (design.md)\n\nAgent workflows need systematic observability to improve (benchmarks and logging)\n\nThese aren't advanced techniques. They're foundational practices that make everything else work better.\n\nIf you're interested in seeing these principles in action, Mano-AFK ([https://github.com/Mininglamp-AI/Mano-AFK](https://github.com/Mininglamp-AI/Mano-AFK)) is an open-source autonomous application builder that uses persistent context files and systematic benchmarking to improve agent reliability.\n\nFor those working on GUI agents, Mano-P ([https://github.com/Mininglamp-AI/Mano-P](https://github.com/Mininglamp-AI/Mano-P)) implements think-act-verify loops and online reinforcement learning, achieving 58.2% success rate on the OSWorld benchmark (specialized models category).\n\nBoth projects are Apache 2.0 licensed. Stars and contributions are welcome.", "url": "https://wpnews.pro/news/ai-agents-and-persistent-context-what-design-md-teaches-us", "canonical_source": "https://dev.to/mininglamp/ai-agents-and-persistent-context-what-designmd-teaches-us-4l9b", "published_at": "2026-06-26 09:45:07+00:00", "updated_at": "2026-06-26 10:03:56.113016+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models"], "entities": ["GitHub", "design.md", "CUA Benchmark"], "alternates": {"html": "https://wpnews.pro/news/ai-agents-and-persistent-context-what-design-md-teaches-us", "markdown": "https://wpnews.pro/news/ai-agents-and-persistent-context-what-design-md-teaches-us.md", "text": "https://wpnews.pro/news/ai-agents-and-persistent-context-what-design-md-teaches-us.txt", "jsonld": "https://wpnews.pro/news/ai-agents-and-persistent-context-what-design-md-teaches-us.jsonld"}}