{"slug": "stop-loading-your-entire-instruction-system-into-every-session", "title": "Stop Loading Your Entire Instruction System Into Every Session", "summary": "A developer modularized their AI assistant's instruction system by splitting a monolithic file into a lean entry point and specialized modules, reducing token costs and improving signal-to-noise ratio. The entry point acts as a router, loading only relevant modules per task, such as persona, structure, or workflows.", "body_md": "Most people talk about better prompts. Hardly anyone talks about what happens before every prompt: the instructions the assistant loads into the context before the actual work begins.\n\nDepending on the system, you pay for that in different ways: input tokens, latency, reduced available context, or simply more noise in the assistant's active instructions. Even if the financial cost is partly reduced through prompt caching, the cognitive cost remains: the assistant still has to operate inside a larger instruction environment.\n\nAt some point, my setup had become one single, constantly growing instruction file. System structure, assistant personality, workflows, session rules, special cases: everything was in one file. And everything was loaded into the context on every interaction, no matter whether I was solving a complex task or just asking a quick question.\n\nThat is roughly like starting every phone call by reading the entire employee handbook before getting to the actual topic.\n\nA monolithic instruction file has two costs that become unpleasant when combined:\n\n**The baseline gets expensive.**\n\nMost of the file is irrelevant to the concrete task. Still, it sits in the active context. Depending on the system, that means token cost, latency, less room for the real task, or all of them at once.\n\n**The signal-to-noise ratio drops.**\n\nThe more rules and special cases you add, the more the currently relevant part gets diluted. More context does not automatically mean more competence.\n\nBoth scale in the wrong direction: the more mature your setup becomes, the heavier and less precise it gets, as long as everything lives in one file.\n\nNot all instructions are needed all the time.\n\nI always need the assistant's personality and basic operating principles. I only need the exact structure of my project system when I actually navigate through it. I only need the session-end rules at the end of a session, and never before that.\n\nA writing task does not need filesystem navigation rules.\n\nA quick reasoning task does not need session-close workflows.\n\nA debugging session does not need publishing guidelines.\n\nIf that is true, it makes no sense to keep everything loaded permanently.\n\nI split the one large file into a lean entry point plus specialized modules:\n\n``` php\n.config/\n├── instructions.md   -> compact entry point, always loaded\n├── persona.md        -> personality, tone, behavior\n├── structure.md      -> system structure, only relevant for navigation\n└── workflows.md      -> session workflows, only relevant when needed\n```\n\nThe main instruction file is now intentionally small. It contains the minimum that really has to be present in every session, plus clear references: which module is responsible for what, and when it should be loaded.\n\nThe detail modules are not active by default. They are accessible, but they only become part of the context when the task requires them.\n\nThat distinction matters. The full instruction set is not magically present for free. It is only available if the assistant knows that a module exists, recognizes that it is relevant, and loads it at the right moment.\n\nSo modularization does not mean: same context, lower cost.\n\nIt means: smaller baseline, with more responsibility placed on routing and loading.\n\nIn my setup, the entry point acts as a router. It does not contain all detailed rules. It contains short loading rules such as:\n\n```\nIf the task involves navigating the project system,\nload structure.md before answering.\n\nIf the task involves ending or reviewing a session,\nload workflows.md before making recommendations.\n\nIf the task is a quick standalone question,\ndo not load additional modules unless needed.\n```\n\nThis is simple, but it is also the fragile part of the system. If the entry point is vague, the assistant may fail to load the right module. If it is too broad, it loads too much again and the benefit disappears.\n\nThe quality of the entry point determines the quality of the whole architecture.\n\nIn my setup, the baseline token load per session dropped by around **60-80%**.\n\nI measured this by comparing the files that were previously loaded unconditionally at session start with the files that are now loaded unconditionally. The important number is not the total size of all available instructions. It is the size of the always-loaded baseline.\n\n```\nBefore modularization:\n\nAlways loaded:\ninstructions.md\npersona.md\nstructure.md\nworkflows.md\n\nBaseline load:\n~4,800 tokens\n\nAfter modularization:\n\nAlways loaded:\ninstructions.md\npersona.md\n\nBaseline load:\n~1,450 tokens\n\nReduction:\n69.8%\n```\n\nThe full instruction set still exists, but it is no longer active by default. It becomes active only when needed.\n\nThe trick is not compressing individual instructions. The trick is separating **baseline load** from **on-demand load**.\n\nSo you are not optimizing the total size of your instructions. You are optimizing which part of them must always be present. And that is surprisingly little.\n\nPrompt caching can reduce the financial cost of repeated baseline instructions in some systems. But it does not remove the context-budget cost, the latency implications in every environment, or the signal-to-noise problem. A cached irrelevant instruction is still an irrelevant instruction in the active instruction set.\n\nThis is not a free lunch.\n\n**Indirection:**\n\nThe assistant sometimes has to take an extra step to load the right module. That is slightly slower and creates the risk that the right module is not loaded.\n\n**Routing errors:**\n\nIf the assistant does not recognize that a task requires a module, it may answer with incomplete instructions. This is the main operational risk.\n\n**Maintenance:**\n\nMore files mean more places that can drift apart. If the entry point promises something that no longer exists in a module, you have a silent consistency problem.\n\n**Rule conflicts:**\n\nModules can contradict each other or the entry point. You need a precedence rule: general instructions define the default, specialized modules override them only within their domain, and explicit user instructions still have to be handled according to the system's hierarchy.\n\n**Onboarding:**\n\nAn outsider first has to understand the loading logic before the system becomes readable. A single file is trivial to understand.\n\nThe real trade-off is this:\n\nYou reduce baseline cost, but you give up permanent availability. You move complexity from runtime context into structure, routing, and maintenance.\n\nIt is worth it if:\n\nIt is not worth it if your entire setup fits into a few hundred tokens anyway. In that case, modularization is premature optimization: you trade real simplicity for imagined efficiency.\n\nIt is also not worth it if the assistant cannot reliably access the modules when needed. A small always-loaded file plus inaccessible detail files is not an architecture. It is missing context with extra steps.\n\nThe biggest lever for instruction cost is rarely a better prompt. It is the question of what you force the assistant to carry into every interaction, and what it should only load when needed.\n\nSeparate baseline load from on-demand load. Keep the entry point small and turn it into a precise router. Leave the details where they only create cost when they are actually needed.\n\nIn my case, that meant 60-80% less baseline load. But the important part is not just the savings. It is the trade-off: less permanent context, more deliberate loading.\n\nThat is the architecture I actually want. Not less instruction, but less unnecessary instruction in the room at the wrong time.", "url": "https://wpnews.pro/news/stop-loading-your-entire-instruction-system-into-every-session", "canonical_source": "https://dev.to/ben-witt/significantly-fewer-context-tokens-through-a-modular-instruction-architecture-2g70", "published_at": "2026-06-17 08:00:00+00:00", "updated_at": "2026-06-17 08:21:42.401301+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-agents"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/stop-loading-your-entire-instruction-system-into-every-session", "markdown": "https://wpnews.pro/news/stop-loading-your-entire-instruction-system-into-every-session.md", "text": "https://wpnews.pro/news/stop-loading-your-entire-instruction-system-into-every-session.txt", "jsonld": "https://wpnews.pro/news/stop-loading-your-entire-instruction-system-into-every-session.jsonld"}}