# Stop Loading Your Entire Instruction System Into Every Session

> Source: <https://dev.to/ben-witt/significantly-fewer-context-tokens-through-a-modular-instruction-architecture-2g70>
> Published: 2026-06-17 08:00:00+00:00

Most people talk about better prompts. Hardly anyone talks about what happens before every prompt: the instructions the assistant loads into the context before the actual work begins.

Depending on the system, you pay for that in different ways: input tokens, latency, reduced available context, or simply more noise in the assistant's active instructions. Even if the financial cost is partly reduced through prompt caching, the cognitive cost remains: the assistant still has to operate inside a larger instruction environment.

At some point, my setup had become one single, constantly growing instruction file. System structure, assistant personality, workflows, session rules, special cases: everything was in one file. And everything was loaded into the context on every interaction, no matter whether I was solving a complex task or just asking a quick question.

That is roughly like starting every phone call by reading the entire employee handbook before getting to the actual topic.

A monolithic instruction file has two costs that become unpleasant when combined:

**The baseline gets expensive.**

Most of the file is irrelevant to the concrete task. Still, it sits in the active context. Depending on the system, that means token cost, latency, less room for the real task, or all of them at once.

**The signal-to-noise ratio drops.**

The more rules and special cases you add, the more the currently relevant part gets diluted. More context does not automatically mean more competence.

Both scale in the wrong direction: the more mature your setup becomes, the heavier and less precise it gets, as long as everything lives in one file.

Not all instructions are needed all the time.

I always need the assistant's personality and basic operating principles. I only need the exact structure of my project system when I actually navigate through it. I only need the session-end rules at the end of a session, and never before that.

A writing task does not need filesystem navigation rules.

A quick reasoning task does not need session-close workflows.

A debugging session does not need publishing guidelines.

If that is true, it makes no sense to keep everything loaded permanently.

I split the one large file into a lean entry point plus specialized modules:

``` php
.config/
├── instructions.md   -> compact entry point, always loaded
├── persona.md        -> personality, tone, behavior
├── structure.md      -> system structure, only relevant for navigation
└── workflows.md      -> session workflows, only relevant when needed
```

The main instruction file is now intentionally small. It contains the minimum that really has to be present in every session, plus clear references: which module is responsible for what, and when it should be loaded.

The detail modules are not active by default. They are accessible, but they only become part of the context when the task requires them.

That distinction matters. The full instruction set is not magically present for free. It is only available if the assistant knows that a module exists, recognizes that it is relevant, and loads it at the right moment.

So modularization does not mean: same context, lower cost.

It means: smaller baseline, with more responsibility placed on routing and loading.

In my setup, the entry point acts as a router. It does not contain all detailed rules. It contains short loading rules such as:

```
If the task involves navigating the project system,
load structure.md before answering.

If the task involves ending or reviewing a session,
load workflows.md before making recommendations.

If the task is a quick standalone question,
do not load additional modules unless needed.
```

This is simple, but it is also the fragile part of the system. If the entry point is vague, the assistant may fail to load the right module. If it is too broad, it loads too much again and the benefit disappears.

The quality of the entry point determines the quality of the whole architecture.

In my setup, the baseline token load per session dropped by around **60-80%**.

I measured this by comparing the files that were previously loaded unconditionally at session start with the files that are now loaded unconditionally. The important number is not the total size of all available instructions. It is the size of the always-loaded baseline.

```
Before modularization:

Always loaded:
instructions.md
persona.md
structure.md
workflows.md

Baseline load:
~4,800 tokens

After modularization:

Always loaded:
instructions.md
persona.md

Baseline load:
~1,450 tokens

Reduction:
69.8%
```

The full instruction set still exists, but it is no longer active by default. It becomes active only when needed.

The trick is not compressing individual instructions. The trick is separating **baseline load** from **on-demand load**.

So you are not optimizing the total size of your instructions. You are optimizing which part of them must always be present. And that is surprisingly little.

Prompt caching can reduce the financial cost of repeated baseline instructions in some systems. But it does not remove the context-budget cost, the latency implications in every environment, or the signal-to-noise problem. A cached irrelevant instruction is still an irrelevant instruction in the active instruction set.

This is not a free lunch.

**Indirection:**

The assistant sometimes has to take an extra step to load the right module. That is slightly slower and creates the risk that the right module is not loaded.

**Routing errors:**

If the assistant does not recognize that a task requires a module, it may answer with incomplete instructions. This is the main operational risk.

**Maintenance:**

More files mean more places that can drift apart. If the entry point promises something that no longer exists in a module, you have a silent consistency problem.

**Rule conflicts:**

Modules can contradict each other or the entry point. You need a precedence rule: general instructions define the default, specialized modules override them only within their domain, and explicit user instructions still have to be handled according to the system's hierarchy.

**Onboarding:**

An outsider first has to understand the loading logic before the system becomes readable. A single file is trivial to understand.

The real trade-off is this:

You reduce baseline cost, but you give up permanent availability. You move complexity from runtime context into structure, routing, and maintenance.

It is worth it if:

It is not worth it if your entire setup fits into a few hundred tokens anyway. In that case, modularization is premature optimization: you trade real simplicity for imagined efficiency.

It is also not worth it if the assistant cannot reliably access the modules when needed. A small always-loaded file plus inaccessible detail files is not an architecture. It is missing context with extra steps.

The biggest lever for instruction cost is rarely a better prompt. It is the question of what you force the assistant to carry into every interaction, and what it should only load when needed.

Separate baseline load from on-demand load. Keep the entry point small and turn it into a precise router. Leave the details where they only create cost when they are actually needed.

In my case, that meant 60-80% less baseline load. But the important part is not just the savings. It is the trade-off: less permanent context, more deliberate loading.

That is the architecture I actually want. Not less instruction, but less unnecessary instruction in the room at the wrong time.