LLMs have limited context windows. When an agent session grows too long, VS Code Copilot Chat uses compaction to summarize older history while keeping recent work verbatim. I traced exactly how it works from the source.
Note
The standalone microsoft/vscode-copilot-chat
repo was archived in May 2026. The agent code now lives in the main microsoft/vscode repo under extensions/copilot/
, and it has changed since the standalone repo froze. Everything below is traced against the current extensions/copilot
source.
Source files (all under extensions/copilot/
):
src/extension/intents/node/agentIntent.ts
— budget math, trigger logicsrc/extension/prompts/node/agent/summarizedConversationHistory.tsx
— prompt, LLM call, history selection, re-insertionsrc/extension/prompts/node/agent/backgroundSummarizer.ts
— the async state machine and thresholdssrc/extension/prompts/node/agent/simpleSummarizedHistoryPrompt.tsx
— the Simple-mode fallback
Note
Microsoft moves fast. Treat line numbers as approximate and the thresholds as true the day I looked.
First, what is @vscode/prompt-tsx #
?
Compaction makes no sense until you understand the renderer underneath it. @vscode/prompt-tsx (the repo pins
^0.4.0-alpha.8
) is a budget-aware prompt renderer. Instead of concatenating strings into a
messages
array, you author the entire prompt as a tree of TSX components — a PromptElement
— and the renderer flattens it into the ChatMessage[]
that goes to the model, fitting it to the token budget for you.
Here’s the actual agent prompt, lightly trimmed (agentPrompt.tsx
):
return <>
{baseInstructions}
<AgentConversationHistory flexGrow={1} priority={700} promptContext={ctx} />
<AgentUserMessage flexGrow={2} priority={900} {...userMessageProps} />
<ChatToolCalls flexGrow={2} priority={899} toolCallRounds={ctx.toolCallRounds}
truncateAt={maxToolResultLength} />
</>;
Two numbers on each element do the work:
— when the rendered tree is over budget, the renderer drops thepriority
lowest-priority content first. The current user message (900
) and the latest tool calls (899
) outrank older history (700
), which outranks boilerplate system instructions. Same-priority siblings are pruned in declaration order.— controls howflexGrow
leftoverbudget is distributed. AflexGrow
element rendersafter its siblings and receives whatever budget they didn’t use. So the user message and tool calls (flexGrow={2}
) get first claim on the window; conversation history (flexGrow={1}
) fills the remainder.
Every element’s render(state, sizing)
is handed a PromptSizing
with tokenBudget
and an async countTokens()
, so an element can measure itself and trim to fit. That’s literally how tool results cap themselves at 50% of the window (truncateAt={maxToolResultLength}
above), and there are helpers for the common cases:
<TokenLimit max={2000}>{/* hard cap on a subtree */}</TokenLimit>
<PrioritizedList priority={p} descending={false}>{rounds}</PrioritizedList>
TokenLimit
caps a subtree; PrioritizedList
hands a list of children descending (or ascending) priorities — exactly how conversation rounds are ranked so the oldest get pruned first; TextChunk
keeps as much of a long string as fits.
Pruning is graceful — until it isn’t. When even the highest-priority content won’t fit, prompt-tsx throws a ** BudgetExceededError**. That exception is the signal the agent loop catches to escalate from “just prune” to “compact the history” — which is where the rest of this post begins.
Overview #
Compaction is the escalation when prompt-tsx’s pruning isn’t enough. There are two automatic triggers — both on by default now — plus the user-initiated /compact
command:
| Path | Trigger | Default? |
|---|---|---|
| Foreground summarize | BudgetExceededError while rendering (hard overflow) |
✅ on |
| Background compaction | post-render context ≥ ~80% (jittered 0.78–0.82 with a warm cache; ≥0.90 cold-cache emergency) |
✅ on |
Manual /compact |
user runs the command | — |
Note
This is the biggest change since the standalone repo. The old build gated background compaction behind a chat.backgroundCompaction
experiment and had a separate ≥85%
“proactive inline” experiment. Both flags are gone. Background compaction now ships on by default (gated only by the master summarization switch), and the inline mechanism it used became how background compaction works rather than a separate path.
When it triggers #
The budget is computed from config and the effective context size, then trimmed for tools:
const baseBudget = Math.min(
summarizeThresholdTokens ?? effectiveMaxTokens,
effectiveMaxTokens // clamp: never above the real max
);
// 10% safety margin on the message portion when tools are present
const messageBudget = Math.max(1, Math.floor((baseBudget - toolTokens) * 0.9));
Two subtleties: effectiveMaxTokens
honors the user’s Context Size picker (not just the model’s raw modelMaxPromptTokens
), and the threshold config is interpreted as either a ratio (0–1
, a fraction of the window) or an absolute token count (≥100
) — the ambiguous gap in between throws.
Drag the fill level to see which path fires:
Foreground is the safety net — it catches the hard overflow and recovers:
} catch (e) {
if (e instanceof BudgetExceededError && summarizationEnabled) {
result = await renderWithSummarization(`budget exceeded(${e.message})`);
}
}
Background compaction exists to avoid the cost of “render, fail, re-render.” After each turn renders, it checks the post-render context ratio and, if you’re over the threshold, kicks off summarization in the background so the compacted result is ready to apply on the next render. Two details make it cache-friendly:
Cache-warmth gating + jitter. It only fires at the normal ~0.80
threshold when the prompt cache is warm (a completed tool-call round this turn), and itjittersthe exact trigger across0.78–0.82
so it doesn’t always fire at the same boundary. With a cold cache it waits for the≥0.90
emergency line. (Thresholds live inBackgroundSummarizationThresholds
.)Apply-min ratio. A finished background summary is discarded if the context ratio has since dropped below0.65
— e.g. you switched to a larger-context model and no longer need it.
Mechanically, background compaction doesn’t make a separate “summarize this” request. It folds a “compact now” instruction into a forked copy of the same render (for prompt-cache parity) and parses a <summary>
block back out of the model’s reply.
How it works #
Pick the cut point. Render history in reverse (newest first), keeping recent rounds verbatim.Exclude the overflowing round. With multiple tool rounds, drop the last one — it’s what pushed over the limit.Stop at the previous summary. Walking back, break at the first round that already has a.summary
. Compactioncompoundsinstead of re-summarizing from scratch.Generate the summary. Call the LLM with the structured format below.Re-insert. Wrap the summary in a<conversation-summary>
user message that replaces the older turns; store it as turn metadata so it survives the next turn.
// Excluding the round that blew the budget:
if (toolCallRounds && toolCallRounds.length > 1) {
toolCallRounds = toolCallRounds.slice(0, -1); // last round overflowed
summarizedToolCallRoundId = toolCallRounds.at(-1)!.id; // summarize from the prior one
}
What gets kept vs summarized #
Recent rounds→ kept verbatim, so the model picks up mid-task.** The round that overflowed→ excluded from the summary. Everything before the previous summary→ already represented; not re-included. Tool results**→ truncated atmaxToolResultLength
, which the agent loop sets to50% of the model’s max prompt tokens(modelMaxPromptTokens * 0.5
), keeping the head and tail of long outputs (a 40/60 split) and dropping the middle. (The flat2000
you may see in the source is adifferentfeature — the panel’s chat-summary renderer — not in-loop compaction.)
The summary format #
The summarization is a real LLM call on the same model the conversation uses, at temperature: 0
(streaming is intentionally not disabled — there’s even a regression test asserting the request doesn’t force stream: false
). The prompt doesn’t ask for “a summary” — it demands an 8-section handoff document, emitted inside a <summary>
block after a separate <analysis>
block:
1. Conversation Overview — objectives, session context, intent evolution
2. Technical Foundation — core tech, frameworks, architectural patterns
3. Codebase Status — each file touched: purpose, state, key code
4. Problem Resolution — issues hit, solutions, debugging context
5. Progress Tracking — done vs. partially done vs. validated
6. Active Work State — exactly what was being worked on last
7. Recent Operations — last commands, tool results, pre-summary state
8. Continuation Plan — pending tasks and the immediate next step
There are two modes: Full (sends tool definitions with tool_choice: 'none'
) and Simple. If Full fails, it falls back to Simple.
The breadcrumb #
Compaction is lossy — the verbatim history is gone. So the summary carries a pointer to the full transcript on disk:
summary += `\nIf you need specific details from before compaction (such as exact
code snippets, error messages, tool results, or content you previously generated),
use the ${ToolName.ReadFile} tool to look up the full uncompacted conversation
transcript at: "${transcriptPath}"`;
// ...then appends the transcript's current line count and an example call
This breadcrumb is added whenever the session has an on-disk transcript path (no experiment flag — that gate existed in the old build but is gone now). It’s appended exactly once, at summary-creation time, and baked into the frozen summary text — so later renders replay it verbatim, preserving Anthropic’s prompt cache. The summary is the fast path; the transcript file is the escape hatch the model reads only when it needs an exact detail. Same instinct as progressive disclosure Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools AI coding tools are stateless—every session starts fresh. The solution isn't cramming everything into CLAUDE.md, but building a layered context system where learnings accumulate in docs and specialized agents load on-demand. : cheap index in context, expensive detail on demand.
Model-specific gotchas #
| Model | Handling |
|---|---|
Opus (claude-opus* ) |
Extra instruction: do not call tools, only write text. |
| Anthropic + thinking | Last thinking block preserved and re-attached as the first thinking block after the summary. |
Anthropic + tool_search |
Client-side tool_search tool-use/result pairs stripped before the call, or Anthropic 400s. |
| Gemini | Orphaned function_call s (whose results got pruned) stripped, or it 400s. |
| GPT-4.1 | A “keep going” reminder is appended after the summary. |
| Prompt caching | Summary “baked” once so later renders don’t bust Anthropic’s cache. |
| PreCompact hook | Fires before summarizing; can archive the transcript. Errors never block. |
Settings #
| Config key | Default | What it does |
|---|---|---|
chat.summarizeAgentConversationHistory.enabled |
on | Master switch — gates both foreground and background compaction. |
chat.advanced.summarizeAgentConversationHistoryThreshold |
model max | Lower the budget that triggers compaction (ratio 0–1 or absolute tokens ≥100 ). |
chat.advanced.agentHistorySummarizationMode |
auto | Force simple (or full ) summary mode. |
chat.conversationCompaction.usePrismCompaction |
experiment | Route compaction through a separate “Prism” trajectory-compaction model. |
chat.conversationCompaction.prismModelFilter |
(model list) | Which models the Prism path applies to; falls back to the agent endpoint otherwise. |
Note
Gone since the standalone repo: chat.backgroundCompaction
and chat.advanced.agentHistorySummarizationInline
. Background compaction is no longer a separate experiment — it’s on by default under the master switch.
Compared to Claude Code #
Same problem, different taste. Claude Code’s /compact
also summarizes history into a block — see The Four Types of Memory for AI Agents The Four Types of Memory for AI Agents (and How Claude Code Implements Each) Working, semantic, procedural, episodic. The CoALA framework splits agent memory into four kinds. Here is what each one is, and how Claude Code actually implements them on disk. . Copilot leans harder on two ideas worth stealing if you build your own loop ( like this Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. ): prune by priority before you summarize, and make the lossy summary recoverable with a breadcrumb to the raw history.