AI Agents and Persistent Context: What design.md Teaches Us

A GitHub repository called design.md, which provides AI agents with a persistent design document, has gained over 1,400 stars. The approach addresses context fragmentation in agent development by offering a single source of truth that persists across sessions. The repository also emphasizes observability in agent workflows, including systematic benchmarking with the CUA Benchmark.

A GitHub repository called design.md has been trending recently, accumulating over 1,400 stars. The concept is straightforward: provide AI agents with a persistent design document they can reference throughout their work. This approach addresses a practical challenge in agent development that many teams encounter. The Context Challenge When working on complex tasks, AI agents need to understand the broader picture. What's the architecture? What constraints exist? What approaches have been tried before? Typically, agents get context from: Current conversation limited window Code comments often outdated Documentation if it exists The issue is that this context is fragmented and temporary. When conversation moves forward, earlier context disappears. When documentation is outdated, agents make incorrect assumptions. A design.md provides a single source of truth that persists across sessions. What Belongs in design.md An effective design.md answers these questions: Beyond feature lists, document the core purpose. Why does this project exist? What problem does it solve? Document major choices and their rationale: "PostgreSQL was chosen over MongoDB because ACID guarantees are required for financial transactions" "Microservices architecture was adopted because components have different scaling requirements" Technical constraints performance requirements, browser support , business constraints budget, timeline , and regulatory constraints GDPR, HIPAA . Document failed approaches to prevent agents from suggesting rejected solutions. Known issues, technical debt, areas needing improvement help agents prioritize work. How Agents Use design.md When starting a task, agents can: Read design.md to understand context Make decisions aligned with documented architecture Avoid solutions violating constraints Reference design.md in reasoning This leads to more coherent and consistent work. Agents work within a broader framework rather than just reacting to immediate tasks. Keeping design.md Updated The main risk with design.md is becoming outdated. Effective practices include: Make it part of the workflow Update design.md immediately when making significant architectural decisions. Waiting until "later" means it never happens. Version control it Keep design.md in the repository. During PR reviews, check if design.md needs updating. Review it regularly Schedule periodic reviews monthly or quarterly to ensure the document reflects current reality. Let agents help Agents can assist in maintaining design.md by: Suggesting updates when noticing inconsistencies Summarizing changes from recent commits Flagging outdated information Observability in Agent Workflows Even with good design.md, observing what agents actually do is important. This is particularly relevant for GUI agents interacting with complex interfaces. Consider a GUI agent tasked with "fill out this form and submit it". The agent needs to: Locate form fields Enter correct data Handle validation errors Submit the form Verify success Each step can fail in different ways. Without observability, only the final result is visible: success or failure. The reason for failure remains unknown. Building Observable Workflows Good observability includes: Record each action: What was observed screenshots, DOM state What decision was made What actually happened Whether it matched expectations Track: Success rate per task type Average steps to completion Time per step Failure modes When things go wrong, categorize errors: Perception errors agent didn't see the right element Decision errors agent chose wrong action Execution errors action failed due to external factors This data helps identify where improvements are needed. Systematic Benchmarking CUA Benchmark provides systematic observability through: 100 test cases across 5 different web applications Standardized task definitions Automated result verification Detailed performance metrics Running agents against CUA Benchmark provides quantitative data: Overall success rate Success rate by task type Average steps per task Common failure points This data is valuable for iterative improvement. Instead of guessing what to optimize, specific areas where agents struggle can be identified and addressed. A Practical Example: Mano-AFK Mano-AFK is an open-source autonomous application builder that demonstrates these principles. The workflow includes: Receiving natural language description of what to build Generating a PRD Product Requirements Document Writing the code Deploying to a test environment Running tests lint, API, E2E Auto-fixing any issues Delivering the final application Throughout this process, the agent references rules.md and preferences.md files to maintain consistency across projects. These files provide persistent context that guides decisions. CUA Benchmark results for Mano-AFK: W8A16 quantization: 58.0% accuracy W8A8 quantization Cider : 54.0% accuracy, but faster inference ~1,453 tok/s prefill These numbers show that W8A8 version is slightly less accurate but significantly faster. Depending on use case, one might be preferred over the other. Without systematic benchmarking, this data wouldn't exist. Only vague impressions like "it works sometimes" or "it's kind of slow" would remain. Practical Recommendations When building agent workflows, these practices have proven effective: Before writing agent code, document architecture, constraints, and key decisions. This document guides both human developers and AI agents. Don't add logging later. Design agent workflows to be observable from the start. Every step should produce some form of output that can be inspected. Establish a benchmark suite early. Run it regularly. Track metrics over time. This provides objective data on whether changes are improvements or regressions. When low success rates are observed, examine failure modes. Instead of making broad changes, identify specific failure patterns and address them directly. Whether through design.md, rules.md, or other mechanisms, ensure agents have access to persistent context. Conversation history is too ephemeral for complex projects. Moving Forward AI agent engineering is still in early stages. Best practices are still being figured out, but two things are becoming clear: Agents need persistent context to do good work design.md Agent workflows need systematic observability to improve benchmarks and logging These aren't advanced techniques. They're foundational practices that make everything else work better. If you're interested in seeing these principles in action, Mano-AFK https://github.com/Mininglamp-AI/Mano-AFK https://github.com/Mininglamp-AI/Mano-AFK is an open-source autonomous application builder that uses persistent context files and systematic benchmarking to improve agent reliability. For those working on GUI agents, Mano-P https://github.com/Mininglamp-AI/Mano-P https://github.com/Mininglamp-AI/Mano-P implements think-act-verify loops and online reinforcement learning, achieving 58.2% success rate on the OSWorld benchmark specialized models category . Both projects are Apache 2.0 licensed. Stars and contributions are welcome.