{"slug": "how-vs-code-copilot-chat-compacts-your-conversation", "title": "How VS Code Copilot Chat Compacts Your Conversation", "summary": "VS Code Copilot Chat uses a compaction mechanism to summarize older conversation history when the agent session exceeds the LLM's context window, triggered automatically or via the /compact command. The process relies on the budget-aware prompt renderer @vscode/prompt-tsx, which prunes low-priority content and escalates to compaction when pruning fails.", "body_md": "LLMs have limited context windows. When an agent session grows too long, VS Code Copilot Chat uses **compaction** to summarize older history while keeping recent work verbatim. I traced exactly how it works from the source.\n\nNote\n\nThe standalone `microsoft/vscode-copilot-chat`\n\nrepo was **archived in May 2026**. The agent code now lives in the main [microsoft/vscode](https://github.com/microsoft/vscode) repo under `extensions/copilot/`\n\n, and it has changed since the standalone repo froze. Everything below is traced against the current `extensions/copilot`\n\nsource.\n\n**Source files** (all under `extensions/copilot/`\n\n):\n\n`src/extension/intents/node/agentIntent.ts`\n\n— budget math, trigger logic`src/extension/prompts/node/agent/summarizedConversationHistory.tsx`\n\n— prompt, LLM call, history selection, re-insertion`src/extension/prompts/node/agent/backgroundSummarizer.ts`\n\n— the async state machine and thresholds`src/extension/prompts/node/agent/simpleSummarizedHistoryPrompt.tsx`\n\n— the Simple-mode fallback\n\nNote\n\nMicrosoft moves fast. Treat line numbers as approximate and the thresholds as true the day I looked.\n\n## First, what is `@vscode/prompt-tsx`\n\n?\n\nCompaction makes no sense until you understand the renderer underneath it. [ @vscode/prompt-tsx](https://github.com/microsoft/vscode-prompt-tsx) (the repo pins\n\n`^0.4.0-alpha.8`\n\n) is a **budget-aware prompt renderer**. Instead of concatenating strings into a\n\n`messages`\n\narray, you author the entire prompt as a tree of TSX components — a `PromptElement`\n\n— and the renderer flattens it into the `ChatMessage[]`\n\nthat goes to the model, *fitting it to the token budget for you*.\n\nHere’s the actual agent prompt, lightly trimmed (`agentPrompt.tsx`\n\n):\n\n```\nreturn <>\n  {baseInstructions}\n  <AgentConversationHistory flexGrow={1} priority={700} promptContext={ctx} />\n  <AgentUserMessage    flexGrow={2} priority={900} {...userMessageProps} />\n  <ChatToolCalls       flexGrow={2} priority={899} toolCallRounds={ctx.toolCallRounds}\n                       truncateAt={maxToolResultLength} />\n</>;\n```\n\nTwo numbers on each element do the work:\n\n— when the rendered tree is over budget, the renderer drops the`priority`\n\n**lowest-priority** content first. The current user message (`900`\n\n) and the latest tool calls (`899`\n\n) outrank older history (`700`\n\n), which outranks boilerplate system instructions. Same-priority siblings are pruned in declaration order.— controls how`flexGrow`\n\n*leftover*budget is distributed. A`flexGrow`\n\nelement renders**after** its siblings and receives whatever budget they didn’t use. So the user message and tool calls (`flexGrow={2}`\n\n) get first claim on the window; conversation history (`flexGrow={1}`\n\n) fills the remainder.\n\nEvery element’s `render(state, sizing)`\n\nis handed a `PromptSizing`\n\nwith `tokenBudget`\n\nand an async `countTokens()`\n\n, so an element can measure itself and trim to fit. That’s literally how tool results cap themselves at 50% of the window (`truncateAt={maxToolResultLength}`\n\nabove), and there are helpers for the common cases:\n\n```\n<TokenLimit max={2000}>{/* hard cap on a subtree */}</TokenLimit>\n<PrioritizedList priority={p} descending={false}>{rounds}</PrioritizedList>\n```\n\n`TokenLimit`\n\ncaps a subtree; `PrioritizedList`\n\nhands a list of children descending (or ascending) priorities — exactly how conversation rounds are ranked so the oldest get pruned first; `TextChunk`\n\nkeeps as much of a long string as fits.\n\nPruning is graceful — until it isn’t. When even the highest-priority content won’t fit, prompt-tsx throws a ** BudgetExceededError**. That exception is the signal the agent loop catches to escalate from “just prune” to “compact the history” — which is where the rest of this post begins.\n\n## Overview\n\nCompaction is the escalation when prompt-tsx’s pruning isn’t enough. There are two automatic triggers — **both on by default now** — plus the user-initiated `/compact`\n\ncommand:\n\n| Path | Trigger | Default? |\n|---|---|---|\nForeground summarize | `BudgetExceededError` while rendering (hard overflow) | ✅ on |\nBackground compaction | post-render context ≥ ~80% (jittered `0.78–0.82` with a warm cache; ≥`0.90` cold-cache emergency) | ✅ on |\nManual `/compact` | user runs the command | — |\n\nNote\n\nThis is the biggest change since the standalone repo. The old build gated background compaction behind a `chat.backgroundCompaction`\n\nexperiment and had a separate `≥85%`\n\n“proactive inline” experiment. **Both flags are gone.** Background compaction now ships on by default (gated only by the master summarization switch), and the inline mechanism it used became *how* background compaction works rather than a separate path.\n\n## When it triggers\n\nThe budget is computed from config and the effective context size, then trimmed for tools:\n\n``` js\nconst baseBudget = Math.min(\n  summarizeThresholdTokens ?? effectiveMaxTokens,\n  effectiveMaxTokens                     // clamp: never above the real max\n);\n// 10% safety margin on the message portion when tools are present\nconst messageBudget = Math.max(1, Math.floor((baseBudget - toolTokens) * 0.9));\n```\n\nTwo subtleties: `effectiveMaxTokens`\n\nhonors the user’s Context Size picker (not just the model’s raw `modelMaxPromptTokens`\n\n), and the threshold config is interpreted as either a **ratio** (`0–1`\n\n, a fraction of the window) or an **absolute token count** (`≥100`\n\n) — the ambiguous gap in between throws.\n\nDrag the fill level to see which path fires:\n\n**Foreground** is the safety net — it catches the hard overflow and recovers:\n\n```\n} catch (e) {\n  if (e instanceof BudgetExceededError && summarizationEnabled) {\n    result = await renderWithSummarization(`budget exceeded(${e.message})`);\n  }\n}\n```\n\n**Background compaction** exists to avoid the cost of “render, fail, re-render.” After each turn renders, it checks the post-render context ratio and, if you’re over the threshold, kicks off summarization *in the background* so the compacted result is ready to apply on the next render. Two details make it cache-friendly:\n\n**Cache-warmth gating + jitter.** It only fires at the normal ~`0.80`\n\nthreshold when the prompt cache is warm (a completed tool-call round this turn), and it*jitters*the exact trigger across`0.78–0.82`\n\nso it doesn’t always fire at the same boundary. With a cold cache it waits for the`≥0.90`\n\nemergency line. (Thresholds live in`BackgroundSummarizationThresholds`\n\n.)**Apply-min ratio.** A finished background summary is discarded if the context ratio has since dropped below`0.65`\n\n— e.g. you switched to a larger-context model and no longer need it.\n\nMechanically, background compaction doesn’t make a separate “summarize this” request. It folds a “compact now” instruction into a forked copy of the *same* render (for prompt-cache parity) and parses a `<summary>`\n\nblock back out of the model’s reply.\n\n## How it works\n\n**Pick the cut point.** Render history in reverse (newest first), keeping recent rounds verbatim.**Exclude the overflowing round.** With multiple tool rounds, drop the last one — it’s what pushed over the limit.**Stop at the previous summary.** Walking back, break at the first round that already has a`.summary`\n\n. Compaction*compounds*instead of re-summarizing from scratch.**Generate the summary.** Call the LLM with the structured format below.**Re-insert.** Wrap the summary in a`<conversation-summary>`\n\nuser message that replaces the older turns; store it as turn metadata so it survives the next turn.\n\n```\n// Excluding the round that blew the budget:\nif (toolCallRounds && toolCallRounds.length > 1) {\n  toolCallRounds = toolCallRounds.slice(0, -1);          // last round overflowed\n  summarizedToolCallRoundId = toolCallRounds.at(-1)!.id; // summarize from the prior one\n}\n```\n\n## What gets kept vs summarized\n\n**Recent rounds**→ kept verbatim, so the model picks up mid-task.** The round that overflowed**→ excluded from the summary.** Everything before the previous summary**→ already represented; not re-included.** Tool results**→ truncated at`maxToolResultLength`\n\n, which the agent loop sets to**50% of the model’s max prompt tokens**(`modelMaxPromptTokens * 0.5`\n\n), keeping the head and tail of long outputs (a 40/60 split) and dropping the middle. (The flat`2000`\n\nyou may see in the source is a*different*feature — the panel’s chat-summary renderer — not in-loop compaction.)\n\n## The summary format\n\nThe summarization is a real LLM call on the **same model the conversation uses**, at `temperature: 0`\n\n(streaming is intentionally *not* disabled — there’s even a regression test asserting the request doesn’t force `stream: false`\n\n). The prompt doesn’t ask for “a summary” — it demands an 8-section handoff document, emitted inside a `<summary>`\n\nblock after a separate `<analysis>`\n\nblock:\n\n```\n1. Conversation Overview   — objectives, session context, intent evolution\n2. Technical Foundation    — core tech, frameworks, architectural patterns\n3. Codebase Status         — each file touched: purpose, state, key code\n4. Problem Resolution      — issues hit, solutions, debugging context\n5. Progress Tracking       — done vs. partially done vs. validated\n6. Active Work State       — exactly what was being worked on last\n7. Recent Operations       — last commands, tool results, pre-summary state\n8. Continuation Plan       — pending tasks and the immediate next step\n```\n\nThere are two modes: **Full** (sends tool definitions with `tool_choice: 'none'`\n\n) and **Simple**. If Full fails, it falls back to Simple.\n\n## The breadcrumb\n\nCompaction is lossy — the verbatim history is gone. So the summary carries a pointer to the full transcript on disk:\n\n```\nsummary += `\\nIf you need specific details from before compaction (such as exact\ncode snippets, error messages, tool results, or content you previously generated),\nuse the ${ToolName.ReadFile} tool to look up the full uncompacted conversation\ntranscript at: \"${transcriptPath}\"`;\n// ...then appends the transcript's current line count and an example call\n```\n\nThis breadcrumb is added whenever the session has an on-disk transcript path (no experiment flag — that gate existed in the old build but is gone now). It’s appended exactly once, at summary-creation time, and *baked* into the frozen summary text — so later renders replay it verbatim, preserving Anthropic’s prompt cache. The summary is the fast path; the transcript file is the escape hatch the model reads only when it needs an exact detail. Same instinct as [ progressive disclosure ](/posts/stop-bloating-your-claude-md-progressive-disclosure-ai-coding-tools/) Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools AI coding tools are stateless—every session starts fresh. The solution isn't cramming everything into CLAUDE.md, but building a layered context system where learnings accumulate in docs and specialized agents load on-demand. : cheap index in context, expensive detail on demand.\n\n## Model-specific gotchas\n\n| Model | Handling |\n|---|---|\nOpus (`claude-opus*` ) | Extra instruction: do not call tools, only write text. |\nAnthropic + thinking | Last thinking block preserved and re-attached as the first thinking block after the summary. |\nAnthropic + `tool_search` | Client-side `tool_search` tool-use/result pairs stripped before the call, or Anthropic 400s. |\nGemini | Orphaned `function_call` s (whose results got pruned) stripped, or it 400s. |\nGPT-4.1 | A “keep going” reminder is appended after the summary. |\nPrompt caching | Summary “baked” once so later renders don’t bust Anthropic’s cache. |\nPreCompact hook | Fires before summarizing; can archive the transcript. Errors never block. |\n\n## Settings\n\n| Config key | Default | What it does |\n|---|---|---|\n`chat.summarizeAgentConversationHistory.enabled` | on | Master switch — gates both foreground and background compaction. |\n`chat.advanced.summarizeAgentConversationHistoryThreshold` | model max | Lower the budget that triggers compaction (ratio `0–1` or absolute tokens `≥100` ). |\n`chat.advanced.agentHistorySummarizationMode` | auto | Force `simple` (or `full` ) summary mode. |\n`chat.conversationCompaction.usePrismCompaction` | experiment | Route compaction through a separate “Prism” trajectory-compaction model. |\n`chat.conversationCompaction.prismModelFilter` | (model list) | Which models the Prism path applies to; falls back to the agent endpoint otherwise. |\n\nNote\n\nGone since the standalone repo: `chat.backgroundCompaction`\n\nand `chat.advanced.agentHistorySummarizationInline`\n\n. Background compaction is no longer a separate experiment — it’s on by default under the master switch.\n\n## Compared to Claude Code\n\nSame problem, different taste. Claude Code’s `/compact`\n\nalso summarizes history into a block — see [ The Four Types of Memory for AI Agents ](/posts/four-types-memory-coding-agents-claude-code/) The Four Types of Memory for AI Agents (and How Claude Code Implements Each) Working, semantic, procedural, episodic. The CoALA framework splits agent memory into four kinds. Here is what each one is, and how Claude Code actually implements them on disk. . Copilot leans harder on two ideas worth stealing if you build your own loop ( [ like this ](/posts/building-your-own-coding-agent-from-scratch/) Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. ): **prune by priority before you summarize**, and **make the lossy summary recoverable** with a breadcrumb to the raw history.", "url": "https://wpnews.pro/news/how-vs-code-copilot-chat-compacts-your-conversation", "canonical_source": "https://alexop.dev/posts/how-vscode-copilot-chat-conversation-compaction-works/", "published_at": "2026-06-27 00:00:00+00:00", "updated_at": "2026-06-27 09:09:27.308124+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["VS Code", "Copilot Chat", "Microsoft", "@vscode/prompt-tsx"], "alternates": {"html": "https://wpnews.pro/news/how-vs-code-copilot-chat-compacts-your-conversation", "markdown": "https://wpnews.pro/news/how-vs-code-copilot-chat-compacts-your-conversation.md", "text": "https://wpnews.pro/news/how-vs-code-copilot-chat-compacts-your-conversation.txt", "jsonld": "https://wpnews.pro/news/how-vs-code-copilot-chat-compacts-your-conversation.jsonld"}}