# How VS Code Copilot Chat Compacts Your Conversation

> Source: <https://alexop.dev/posts/how-vscode-copilot-chat-conversation-compaction-works/>
> Published: 2026-06-27 00:00:00+00:00

LLMs have limited context windows. When an agent session grows too long, VS Code Copilot Chat uses **compaction** to summarize older history while keeping recent work verbatim. I traced exactly how it works from the source.

Note

The standalone `microsoft/vscode-copilot-chat`

repo was **archived in May 2026**. The agent code now lives in the main [microsoft/vscode](https://github.com/microsoft/vscode) repo under `extensions/copilot/`

, and it has changed since the standalone repo froze. Everything below is traced against the current `extensions/copilot`

source.

**Source files** (all under `extensions/copilot/`

):

`src/extension/intents/node/agentIntent.ts`

— budget math, trigger logic`src/extension/prompts/node/agent/summarizedConversationHistory.tsx`

— prompt, LLM call, history selection, re-insertion`src/extension/prompts/node/agent/backgroundSummarizer.ts`

— the async state machine and thresholds`src/extension/prompts/node/agent/simpleSummarizedHistoryPrompt.tsx`

— the Simple-mode fallback

Note

Microsoft moves fast. Treat line numbers as approximate and the thresholds as true the day I looked.

## First, what is `@vscode/prompt-tsx`

?

Compaction makes no sense until you understand the renderer underneath it. [ @vscode/prompt-tsx](https://github.com/microsoft/vscode-prompt-tsx) (the repo pins

`^0.4.0-alpha.8`

) is a **budget-aware prompt renderer**. Instead of concatenating strings into a

`messages`

array, you author the entire prompt as a tree of TSX components — a `PromptElement`

— and the renderer flattens it into the `ChatMessage[]`

that goes to the model, *fitting it to the token budget for you*.

Here’s the actual agent prompt, lightly trimmed (`agentPrompt.tsx`

):

```
return <>
  {baseInstructions}
  <AgentConversationHistory flexGrow={1} priority={700} promptContext={ctx} />
  <AgentUserMessage    flexGrow={2} priority={900} {...userMessageProps} />
  <ChatToolCalls       flexGrow={2} priority={899} toolCallRounds={ctx.toolCallRounds}
                       truncateAt={maxToolResultLength} />
</>;
```

Two numbers on each element do the work:

— when the rendered tree is over budget, the renderer drops the`priority`

**lowest-priority** content first. The current user message (`900`

) and the latest tool calls (`899`

) outrank older history (`700`

), which outranks boilerplate system instructions. Same-priority siblings are pruned in declaration order.— controls how`flexGrow`

*leftover*budget is distributed. A`flexGrow`

element renders**after** its siblings and receives whatever budget they didn’t use. So the user message and tool calls (`flexGrow={2}`

) get first claim on the window; conversation history (`flexGrow={1}`

) fills the remainder.

Every element’s `render(state, sizing)`

is handed a `PromptSizing`

with `tokenBudget`

and an async `countTokens()`

, so an element can measure itself and trim to fit. That’s literally how tool results cap themselves at 50% of the window (`truncateAt={maxToolResultLength}`

above), and there are helpers for the common cases:

```
<TokenLimit max={2000}>{/* hard cap on a subtree */}</TokenLimit>
<PrioritizedList priority={p} descending={false}>{rounds}</PrioritizedList>
```

`TokenLimit`

caps a subtree; `PrioritizedList`

hands a list of children descending (or ascending) priorities — exactly how conversation rounds are ranked so the oldest get pruned first; `TextChunk`

keeps as much of a long string as fits.

Pruning is graceful — until it isn’t. When even the highest-priority content won’t fit, prompt-tsx throws a ** BudgetExceededError**. That exception is the signal the agent loop catches to escalate from “just prune” to “compact the history” — which is where the rest of this post begins.

## Overview

Compaction is the escalation when prompt-tsx’s pruning isn’t enough. There are two automatic triggers — **both on by default now** — plus the user-initiated `/compact`

command:

| Path | Trigger | Default? |
|---|---|---|
Foreground summarize | `BudgetExceededError` while rendering (hard overflow) | ✅ on |
Background compaction | post-render context ≥ ~80% (jittered `0.78–0.82` with a warm cache; ≥`0.90` cold-cache emergency) | ✅ on |
Manual `/compact` | user runs the command | — |

Note

This is the biggest change since the standalone repo. The old build gated background compaction behind a `chat.backgroundCompaction`

experiment and had a separate `≥85%`

“proactive inline” experiment. **Both flags are gone.** Background compaction now ships on by default (gated only by the master summarization switch), and the inline mechanism it used became *how* background compaction works rather than a separate path.

## When it triggers

The budget is computed from config and the effective context size, then trimmed for tools:

``` js
const baseBudget = Math.min(
  summarizeThresholdTokens ?? effectiveMaxTokens,
  effectiveMaxTokens                     // clamp: never above the real max
);
// 10% safety margin on the message portion when tools are present
const messageBudget = Math.max(1, Math.floor((baseBudget - toolTokens) * 0.9));
```

Two subtleties: `effectiveMaxTokens`

honors the user’s Context Size picker (not just the model’s raw `modelMaxPromptTokens`

), and the threshold config is interpreted as either a **ratio** (`0–1`

, a fraction of the window) or an **absolute token count** (`≥100`

) — the ambiguous gap in between throws.

Drag the fill level to see which path fires:

**Foreground** is the safety net — it catches the hard overflow and recovers:

```
} catch (e) {
  if (e instanceof BudgetExceededError && summarizationEnabled) {
    result = await renderWithSummarization(`budget exceeded(${e.message})`);
  }
}
```

**Background compaction** exists to avoid the cost of “render, fail, re-render.” After each turn renders, it checks the post-render context ratio and, if you’re over the threshold, kicks off summarization *in the background* so the compacted result is ready to apply on the next render. Two details make it cache-friendly:

**Cache-warmth gating + jitter.** It only fires at the normal ~`0.80`

threshold when the prompt cache is warm (a completed tool-call round this turn), and it*jitters*the exact trigger across`0.78–0.82`

so it doesn’t always fire at the same boundary. With a cold cache it waits for the`≥0.90`

emergency line. (Thresholds live in`BackgroundSummarizationThresholds`

.)**Apply-min ratio.** A finished background summary is discarded if the context ratio has since dropped below`0.65`

— e.g. you switched to a larger-context model and no longer need it.

Mechanically, background compaction doesn’t make a separate “summarize this” request. It folds a “compact now” instruction into a forked copy of the *same* render (for prompt-cache parity) and parses a `<summary>`

block back out of the model’s reply.

## How it works

**Pick the cut point.** Render history in reverse (newest first), keeping recent rounds verbatim.**Exclude the overflowing round.** With multiple tool rounds, drop the last one — it’s what pushed over the limit.**Stop at the previous summary.** Walking back, break at the first round that already has a`.summary`

. Compaction*compounds*instead of re-summarizing from scratch.**Generate the summary.** Call the LLM with the structured format below.**Re-insert.** Wrap the summary in a`<conversation-summary>`

user message that replaces the older turns; store it as turn metadata so it survives the next turn.

```
// Excluding the round that blew the budget:
if (toolCallRounds && toolCallRounds.length > 1) {
  toolCallRounds = toolCallRounds.slice(0, -1);          // last round overflowed
  summarizedToolCallRoundId = toolCallRounds.at(-1)!.id; // summarize from the prior one
}
```

## What gets kept vs summarized

**Recent rounds**→ kept verbatim, so the model picks up mid-task.** The round that overflowed**→ excluded from the summary.** Everything before the previous summary**→ already represented; not re-included.** Tool results**→ truncated at`maxToolResultLength`

, which the agent loop sets to**50% of the model’s max prompt tokens**(`modelMaxPromptTokens * 0.5`

), keeping the head and tail of long outputs (a 40/60 split) and dropping the middle. (The flat`2000`

you may see in the source is a*different*feature — the panel’s chat-summary renderer — not in-loop compaction.)

## The summary format

The summarization is a real LLM call on the **same model the conversation uses**, at `temperature: 0`

(streaming is intentionally *not* disabled — there’s even a regression test asserting the request doesn’t force `stream: false`

). The prompt doesn’t ask for “a summary” — it demands an 8-section handoff document, emitted inside a `<summary>`

block after a separate `<analysis>`

block:

```
1. Conversation Overview   — objectives, session context, intent evolution
2. Technical Foundation    — core tech, frameworks, architectural patterns
3. Codebase Status         — each file touched: purpose, state, key code
4. Problem Resolution      — issues hit, solutions, debugging context
5. Progress Tracking       — done vs. partially done vs. validated
6. Active Work State       — exactly what was being worked on last
7. Recent Operations       — last commands, tool results, pre-summary state
8. Continuation Plan       — pending tasks and the immediate next step
```

There are two modes: **Full** (sends tool definitions with `tool_choice: 'none'`

) and **Simple**. If Full fails, it falls back to Simple.

## The breadcrumb

Compaction is lossy — the verbatim history is gone. So the summary carries a pointer to the full transcript on disk:

```
summary += `\nIf you need specific details from before compaction (such as exact
code snippets, error messages, tool results, or content you previously generated),
use the ${ToolName.ReadFile} tool to look up the full uncompacted conversation
transcript at: "${transcriptPath}"`;
// ...then appends the transcript's current line count and an example call
```

This breadcrumb is added whenever the session has an on-disk transcript path (no experiment flag — that gate existed in the old build but is gone now). It’s appended exactly once, at summary-creation time, and *baked* into the frozen summary text — so later renders replay it verbatim, preserving Anthropic’s prompt cache. The summary is the fast path; the transcript file is the escape hatch the model reads only when it needs an exact detail. Same instinct as [ progressive disclosure ](/posts/stop-bloating-your-claude-md-progressive-disclosure-ai-coding-tools/) Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools AI coding tools are stateless—every session starts fresh. The solution isn't cramming everything into CLAUDE.md, but building a layered context system where learnings accumulate in docs and specialized agents load on-demand. : cheap index in context, expensive detail on demand.

## Model-specific gotchas

| Model | Handling |
|---|---|
Opus (`claude-opus*` ) | Extra instruction: do not call tools, only write text. |
Anthropic + thinking | Last thinking block preserved and re-attached as the first thinking block after the summary. |
Anthropic + `tool_search` | Client-side `tool_search` tool-use/result pairs stripped before the call, or Anthropic 400s. |
Gemini | Orphaned `function_call` s (whose results got pruned) stripped, or it 400s. |
GPT-4.1 | A “keep going” reminder is appended after the summary. |
Prompt caching | Summary “baked” once so later renders don’t bust Anthropic’s cache. |
PreCompact hook | Fires before summarizing; can archive the transcript. Errors never block. |

## Settings

| Config key | Default | What it does |
|---|---|---|
`chat.summarizeAgentConversationHistory.enabled` | on | Master switch — gates both foreground and background compaction. |
`chat.advanced.summarizeAgentConversationHistoryThreshold` | model max | Lower the budget that triggers compaction (ratio `0–1` or absolute tokens `≥100` ). |
`chat.advanced.agentHistorySummarizationMode` | auto | Force `simple` (or `full` ) summary mode. |
`chat.conversationCompaction.usePrismCompaction` | experiment | Route compaction through a separate “Prism” trajectory-compaction model. |
`chat.conversationCompaction.prismModelFilter` | (model list) | Which models the Prism path applies to; falls back to the agent endpoint otherwise. |

Note

Gone since the standalone repo: `chat.backgroundCompaction`

and `chat.advanced.agentHistorySummarizationInline`

. Background compaction is no longer a separate experiment — it’s on by default under the master switch.

## Compared to Claude Code

Same problem, different taste. Claude Code’s `/compact`

also summarizes history into a block — see [ The Four Types of Memory for AI Agents ](/posts/four-types-memory-coding-agents-claude-code/) The Four Types of Memory for AI Agents (and How Claude Code Implements Each) Working, semantic, procedural, episodic. The CoALA framework splits agent memory into four kinds. Here is what each one is, and how Claude Code actually implements them on disk. . Copilot leans harder on two ideas worth stealing if you build your own loop ( [ like this ](/posts/building-your-own-coding-agent-from-scratch/) Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. ): **prune by priority before you summarize**, and **make the lossy summary recoverable** with a breadcrumb to the raw history.
