Microsoft FastContext: a Repo-Explorer Subagent Cuts Coding-Agent Tokens 60%: Explorer-Subagent Context Offloading

Microsoft released FastContext, a system that trains a dedicated explorer subagent to handle repository exploration for coding agents. By offloading read-only searches and returning compact file-line citations instead of full files, FastContext reduces token usage by up to 60% and improves task resolution by up to 5.5% on the Mini-SWE-Agent benchmark.

What: The FastContext paper Microsoft trains a dedicated explorer subagent — a 4B-30B model the main coding agent calls to find code — that issues read-only searches and returns compact file-line citations instead of dumping files into the main context. Why: Reading and searching a repository is the biggest single drain on a coding agent: in GPT-5.4 traces it ate 56.2% of tool-use turns and 46.5% of the main agent's tokens , so moving that work off the main agent is where the token budget is won. vs prior: A normal coding agent greps and reads files itself , so every raw file lands in its own context window and crowds out the actual coding. FastContext offloads exploration to a separate subagent that returns only citations — the evidence, not the haystack. A reference librarian you send into the stacks. ONE CODE QUESTION │ ┌─────────────┴─────────────┐ │ │ ┌────────▼────────┐ ┌────────▼────────┐ │ READ IT │ │ SEND A │ │ YOURSELF │ │ LIBRARIAN │ │ baseline │ │ FastContext │ └────────┬────────┘ └────────┬────────┘ │ │ haul every file into explorer greps the your own context stacks, hands back an index card │ │ ▼ ▼ ✗ ~18,000 tokens ✓ ~480 tokens of bury the desk citations — desk before you code stays clear Explorer subagent — A separate model the main agent delegates a sub-task to. Here its one job is exploration: take a natural-language query, search the repo, and hand back what it found — it never writes code. Context offloading — Keeping the bulky, raw evidence out of the main agent's context window and bringing back only a compact result. The reading still happens — just not in the context that has to do the reasoning. Read / Glob / Grep — The three read-only tools an explorer uses: Read opens a file, Glob matches file names by pattern, Grep searches file contents . None of them change anything, so running many at once is safe. File-line citation — A pointer of the form path/to/file.ts:88-104 — the exact place the answer lives. Returning the citation instead of the whole file is what keeps the result compact. SFT supervised fine-tuning — Training a model on example query → good exploration pairs so it imitates them. It's the first of FastContext's two training stages. Task-grounded RL — Reinforcement learning where the reward isn't "did the search look reasonable" but did the exploration actually help solve the downstream task . It tunes the explorer toward evidence that the main agent can act on. Mini-SWE-Agent — A small open-source coding-agent harness. FastContext was plugged into it to measure the end-to-end effect on real software-engineering tasks. Token budget — The total tokens an agent spends on a task — what you pay for in cost and latency. Exploration dominates it, which is why offloading it moves the number so much. The news.OnJune 15, 2026, Microsoft releasedFastContext, a system that attacks the most expensive thing a coding agent does: finding the right code. Analyzing GPT-5.4 trajectories, the authors found reading and searching accounted for56.2% of tool-use turnsand46.5% of the main agent's total tokens. FastContext trains dedicated4B-30B exploration modelsthat the main agent queries in natural language; the explorer fires read-only Read / Glob / Grep calls in parallel and returns focused file-line citations. Plugged into Mini-SWE-Agent, it reportsup to +5.5% resolution rateandup to 60% fewer tokens. Weights are open on Hugging Face. Read the paper → Picture yourself at a small desk in a vast library, trying to answer one question. The naive way is to walk the stacks yourself, haul every promising book back, and stack them on the desk — and within a dozen volumes the desk is buried, the early books slide onto the floor, and you can't even see the question anymore. The desk is the bottleneck, and you filled it with raw material you mostly didn't need. A coding agent does exactly this when it greps and reads files itself: every file it opens lands in its own context window, and long before it starts writing the fix, the window is full of source it skimmed once and will never look at again. That's not a small inefficiency — it's the inefficiency. When FastContext's authors traced real GPT-5.4 coding runs, reading and searching the repository accounted for 56.2% of every tool-use turn and 46.5% of the main agent's tokens . Roughly half the agent's entire budget goes to finding code, not changing it. And exploration is the most context-poisoning kind of work there is: it pulls in big, low-signal blobs of text whose only useful output is usually a single line number. So FastContext stops doing the exploring on the main desk. It sends a librarian into the stacks. The main agent delegates a natural-language query — "where is the retry budget enforced?" — to a separate explorer subagent , a 4B-30B model trained for exactly this. The explorer reads, globs, and greps its way through the repo in parallel read-only calls, then hands back not an armful of files but an index card : scheduler/retry.go:88-104 , the exact evidence. The main agent's desk stays clear, holding citations instead of haystacks — the reading happened, but the bulk never touched the context that has to reason. Because the explorer only ever uses read-only tools, running a swarm of those searches at once is safe by construction. The explorer earns its accuracy in two training stages. First supervised fine-tuning teaches it to imitate good exploration traces; then task-grounded RL rewards it not for searches that merely look thorough but for evidence that actually lets the main agent solve the downstream task. A scout that brings back the wrong shelf is worse than useless, so the reward is tied to the outcome , not the search. | Who reads the repo | What lands in the main context | Cost | |---|---|---| | Main agent itself baseline | every file it opens — raw source | ~46.5% of tokens spent exploring | Where does a 60% cut actually come from? Walk one task token counts here are illustrative — the paper reports the percentages, not these absolute numbers . Say solving a bug needs evidence from 12 files averaging 1,500 tokens each. A baseline agent that reads them all carries 18,000 tokens of raw source in its working context — and that's before it writes a line. FastContext's explorer reads the same 12 files in its own scratch context, then returns 12 citations at ~40 tokens each = ~480 tokens . The main agent now reasons over ~480 tokens instead of 18,000 — a ~37× lighter exploration footprint on the desk that matters. Multiply that across a long task where exploration was already 46.5% of the budget , and a headline 60% token reduction stops looking surprising — it's just the haystack never landing on the desk. Goes deeper in: AI Agents → Context Engineering → Subagents for context isolation It's a pattern where a coding agent doesn't search the codebase itself but delegates the search to a separate "explorer" model. The explorer reads and greps files in its own context, then returns only compact pointers — file paths and line ranges — to the main agent. The bulky raw source never enters the main agent's context window, which is what frees up its budget for the actual coding. FastContext trains that explorer SFT plus task-grounded RL at 4B-30B scale. Because finding code is the dominant cost. In FastContext's analysis of GPT-5.4 traces, reading and searching was 56.2% of tool-use turns and 46.5% of the main agent's tokens. Most of that text is low-signal — its only useful output is a line number. Offloading the reading to a subagent that returns citations instead of files removes the haystack from the main context, which is where the up-to-60% token reduction comes from. Both reduce context pressure through delegation, but at different layers. SearchSwarm bakes task-decomposition-and-delegation into one model's weights via supervised fine-tuning, so a single model delegates by reflex. FastContext keeps two separate agents at inference time: a general main agent plus a specialized read-only explorer it calls for context. One trains the behavior into a model; the other architects it into the system. Originally posted on Learn AI Visually https://learnaivisually.com/ai-explained/fastcontext-explorer-subagent-offloading .