{"slug": "claude-code-uses-prompt-caching", "title": "Claude Code uses prompt caching", "summary": "Anthropic's Claude Code uses prompt caching to avoid reprocessing unchanged parts of API requests, organized by layers of system prompt, project context, and conversation. Actions like switching models, changing effort level, or toggling fast mode invalidate the cache, causing slower and more expensive subsequent requests.", "body_md": "[disable it](#disable-prompt-caching). It is still useful to know how prompt caching works, because some actions invalidate the cache and make the next response slower and more expensive while it rebuilds. This page covers which actions those are, why some settings wait for a restart to apply, and how to check cache performance when usage looks high.\n\n## How the cache is organized\n\nEach time you send a message in Claude Code, it makes a new API request. The model doesn’t remember anything between requests, so Claude Code re-sends the full context: the system prompt, your project context, every prior message and tool result, and your new message. New content is appended at the end, which means most of each request is identical to the one before it. Prompt caching is how the API avoids reprocessing the part that didn’t change. The API caches by matching the start of each request, called the prefix, against content it recently processed. On a normal turn, the prefix is the entire previous request and only the latest exchange is new. The match is exact, so a change anywhere in the prefix recomputes everything after it. There is no per-file or per-segment caching. See[how prompt caching works](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#how-prompt-caching-works)in the API reference for the underlying mechanism. To get the most out of prefix matching, Claude Code orders each request so content that rarely changes between turns comes first:\n\n| Layer | Content | Changes when |\n|---|---|---|\n| System prompt | Core instructions, tool definitions, output style | The set of loaded tool definitions changes, or Claude Code is upgraded |\n| Project context | CLAUDE.md, auto memory, unscoped rules | Session starts, or after `/clear` or `/compact` |\n| Conversation | Your messages, Claude’s responses, tool results | Every turn |\n\n[Plan mode](/docs/en/permission-modes#analyze-before-you-edit-with-plan-mode)and\n\n[skill loading](/docs/en/skills), for example, append their instructions as conversation messages, so the cached prefix stays intact. Two settings aren’t part of the prompt text at all, so they don’t appear in the layer table, but both are part of the cache key:\n\n**Model**: each model has its own cache. Switching models recomputes the entire request even when the content is identical. See[Switching models](#switching-models)below.**Effort level**: each effort level has its own cache for the same model. Changing it mid-session recomputes the entire request, and Claude Code asks you to confirm before applying the change. See[Changing effort level](#changing-effort-level)below.\n\n### Where the cache lives\n\nCaching happens server-side, in whichever infrastructure serves your model. Where that is depends on how you authenticate:**API key, Claude subscription, or**: the cache lives in Anthropic’s infrastructure, accessed through the[Claude Platform on AWS](/docs/en/claude-platform-on-aws)[Claude API](https://platform.claude.com/docs)**Bedrock or Vertex AI**: the cache lives in your cloud provider’s serving infrastructure** Foundry**: requests route to Anthropic’s infrastructure** Custom**: the cache lives wherever your requests are forwarded, and whether caching works depends on the gateway`ANTHROPIC_BASE_URL`\n\nor[LLM gateway](/docs/en/llm-gateway)\n\n[data usage](/docs/en/data-usage). Wherever the cache lives, entries expire after a period of inactivity, and\n\n[Cache lifetime](#cache-lifetime)below covers the TTL and how to extend it.\n\n## Actions that invalidate the cache\n\nThese actions cause the next request to miss part or all of the cache. You see a one-time slower, more expensive turn, after which the new prefix is cached. Most of them are avoidable mid-task once you know they have a cost. A model switch can feel free until you notice the slower turn that follows.[Switching models](#switching-models)[Changing effort level](#changing-effort-level)[Turning on fast mode](#turning-on-fast-mode)[Connecting or disconnecting an MCP server](#connecting-or-disconnecting-an-mcp-server)[Enabling or disabling a plugin](#enabling-or-disabling-a-plugin)[Denying an entire tool](#denying-an-entire-tool)[Compacting the conversation](#compacting-the-conversation)[Upgrading Claude Code](#upgrading-claude-code)\n\n### Switching models\n\nEach model has its own cache. Switching with[means the next request reads the entire conversation history with no cache hits, even though the content is identical. The](/docs/en/model-config#setting-your-model)\n\n`/model`\n\n[resolves to Opus during plan mode and Sonnet during execution, so each plan-mode toggle is a model switch and starts a fresh cache.](/docs/en/model-config#opusplan-model-setting)\n\n`opusplan`\n\nmodel setting[Automatic model fallback](/docs/en/model-config#automatic-model-fallback)on Fable 5 is also a model switch. When a safety classifier flags a request, Claude Code re-runs it on the default Opus model and the session continues there.\n\n### Changing effort level\n\nThe cache is keyed by[effort level](/docs/en/model-config#adjust-effort-level)as well as model, so switching with\n\n`/effort`\n\nmeans the next request reads the entire conversation history with no cache hits. Once a conversation has started, Claude Code shows a confirmation dialog before applying an effort change that would invalidate the cache. A change that resolves to the same level already in effect, such as setting the model’s default explicitly, skips the dialog and keeps the cache.\n### Turning on fast mode\n\nEnabling[fast mode](/docs/en/fast-mode)adds a request header that is part of the cache key, so the next request reads the entire conversation history with no cache hits. Those uncached input tokens are billed at\n\n[fast mode rates](/docs/en/fast-mode#understand-the-cost-tradeoff), which is why turning it on at the start of a session costs less than turning it on deep into a long one. Enabling fast mode from a non-Opus model also\n\n[switches your model](#switching-models), which starts a fresh cache on its own. The cost applies once per conversation. After the first fast mode turn, Claude Code keeps sending the header and varies only the request’s speed setting, which is not part of the cache key. Turning fast mode off, the\n\n[automatic fallback to standard speed](/docs/en/fast-mode#handle-rate-limits)after a rate limit, and turning it back on later all keep the cache.\n\n`/clear`\n\nand `/compact`\n\nreset this, since they rebuild the cache at those points anyway.\nKeeping the header across toggles requires Claude Code v2.1.86 or later. On earlier versions, every fast mode toggle and rate-limit fallback invalidates the cache.\n\n### Connecting or disconnecting an MCP server\n\nTool definitions sit in the system prompt layer, so the cache invalidates when the set of tool definitions in the request changes between turns. Toggling the[advisor tool](/docs/en/advisor)is an exception: its definition sits after the cache breakpoint, so enabling or disabling\n\n`/advisor`\n\nkeeps the cached prefix intact. Whether an [MCP server](/docs/en/mcp)change does this depends on whether its tools are deferred by\n\n[tool search](/docs/en/mcp#scale-with-mcp-tool-search)or loaded into the prefix:\n\n**Deferred tools**, the default on supported models: a server connecting, disconnecting, or changing its tool list only appends new content and doesn’t disturb anything already cached.**Tools loaded into the prefix**: any change to them invalidates the cache. This happens when[tool search is unavailable or disabled](/docs/en/mcp#configure-tool-search), such as on Haiku models, on Vertex AI, or with a custom`ANTHROPIC_BASE_URL`\n\ngateway. It also happens for a server or tool marked, and for definitions kept upfront by`alwaysLoad`\n\n[threshold-based loading](/docs/en/mcp#configure-tool-search).\n\n[reconnects automatically after a transient failure](/docs/en/mcp#automatic-reconnection). A connected server can also push a\n\n[dynamic tool update](/docs/en/mcp#dynamic-tool-updates)that changes its tool list. Editing your MCP config does not by itself change the cache. The new config takes effect only after a restart, which is when the server connects or disconnects.\n\n### Enabling or disabling a plugin\n\n[Plugins](/docs/en/plugins)bundle several component types, and the cost of a change depends on which components the plugin provides. Skills, commands, agents, hooks, LSP servers, monitors, and themes never invalidate the cache: anything they add to the request is appended after the existing conversation, so the next request pays for the new content but still reads everything before it from the cache. The exception is a plugin that provides\n\n[MCP servers](/docs/en/plugins-reference#mcp-servers). Enabling or disabling one follows the same rules as\n\n[connecting or disconnecting an MCP server](#connecting-or-disconnecting-an-mcp-server): the cache survives when the server’s tools are deferred, and the next request re-reads the entire conversation when they load into the prefix. Plugin changes apply when you run\n\n[or start a new session. The cost, whether appended announcements or a full re-read, shows up on the first turn after the reload, not when you run](/docs/en/discover-plugins#apply-plugin-changes-without-restarting)\n\n`/reload-plugins`\n\n`/plugin install`\n\n, `/plugin enable`\n\n, or `/plugin disable`\n\n. As of v2.1.163, when a reload would trigger the full re-read, `/reload-plugins`\n\nshows a warning and does not apply the reload. Pass `--force`\n\nto apply anyway.\nDisabling a plugin you enabled earlier in the session restores the previous request shape. If that prefix is still within its [cache lifetime](#cache-lifetime), the next request reads the older cache entry instead of rebuilding.\n\n### Denying an entire tool\n\nAdding a bare tool name like`Bash`\n\nor `WebFetch`\n\nas a [deny rule](/docs/en/permissions#manage-permissions)removes that tool from Claude’s context entirely. Built-in tool definitions load into the system prompt layer, so adding or removing one of these rules mid-session invalidates the cache. The change takes effect on the next turn whether you add it through\n\n`/permissions`\n\nor by [editing a settings file directly](/docs/en/settings#when-edits-take-effect). Only a deny rule that matches in the tool-name position has this effect: a bare tool name, the equivalent\n\n`Bash(*)`\n\nform, or a [tool-name glob](/docs/en/permissions#tool-name-wildcards)like\n\n`\"*\"`\n\n. A glob that matches only MCP tools, such as `\"mcp__*\"`\n\n, removes those tools the same way but leaves the cache intact when the matched tools are [deferred](#connecting-or-disconnecting-an-mcp-server), the default, since deferred definitions were never in the cached prefix. Scoped deny rules like\n\n`Bash(rm *)`\n\n, and all allow and ask rules, don’t change which tools Claude sees. Claude Code checks them when Claude attempts a call, leaving the prefix intact.\n### Compacting the conversation\n\n[Compaction](/docs/en/context-window#what-survives-compaction)replaces your message history with a summary. By design, this invalidates the conversation layer, since the next request has a new, shorter history that doesn’t share a prefix with the old one. Claude Code reuses the system prompt layer and reloads project context from disk, which cache-hits only if CLAUDE.md and memory are unchanged since the session started. To produce the summary, Claude Code sends a one-off request with the same system prompt, tools, and history as your conversation, plus a summarization instruction appended as a final user message. Because it shares your prefix, that request reads the existing cache rather than reprocessing the full history. Most of compaction’s time goes to generating the summary, not to a cache miss. The turn that follows rebuilds the conversation cache only for the much shorter summary, so the post-compaction turn is not the slow part.\n\n### Upgrading Claude Code\n\nA new Claude Code version typically updates the system prompt or tool definitions, so the first request after an upgrade rebuilds the cache from the top.[Auto-update](/docs/en/setup#auto-updates)downloads new versions in the background but applies them on the next launch, never mid-session, so you see this as an uncached first turn after restarting rather than a surprise during a session. Set\n\n`DISABLE_AUTOUPDATER=1`\n\nto control when upgrades apply.\n[Resuming a session](/docs/en/sessions#resume-a-session)after an upgrade reprocesses the entire conversation history with no cache hits, since the history now sits behind a different system prompt. The cost scales with how long the resumed conversation is, so the first turn back into a long session can be the most expensive request you send.\n\n## Actions that keep the cache\n\nThese actions either append to the end of the conversation or don’t touch the request at all. Some of them, such as editing CLAUDE.md or changing output style, are also why a setting change waits for a restart to apply.[Editing files in your repository](#editing-files-in-your-repository)[Editing CLAUDE.md mid-session](#editing-claude-md-mid-session)[Changing output style](#changing-output-style)[Changing permission mode](#changing-permission-mode)[Invoking skills and commands](#invoking-skills-and-commands)[Running](#running-%2Frecap)`/recap`\n\n[Rewinding the conversation](#rewinding-the-conversation)[Spawning a subagent](#subagents-and-the-cache)\n\n### Editing files in your repository\n\nFile contents enter context only when Claude reads them, and reads append to the conversation. Editing a file Claude previously read does not retroactively change the earlier read in history. Instead, Claude Code appends a`<system-reminder>`\n\nnoting the file changed, and Claude re-reads it if needed.\n### Editing CLAUDE.md mid-session\n\nYour project-root and user-level CLAUDE.md files are read once at session start and held in memory. Editing them mid-session does not invalidate the cache, but the edit also doesn’t apply. Claude keeps working with the version that was loaded at session start. The new content loads on the next`/clear`\n\n, `/compact`\n\n, or restart.\n[Nested CLAUDE.md files in subdirectories](/docs/en/memory)and\n\n[rules with](/docs/en/memory#path-specific-rules)load later, when Claude first reads a matching file. Editing one before it loads does take effect. After it loads, the content is part of the conversation history, so a mid-session edit doesn’t retroactively change it.\n\n`paths:`\n\nfrontmatter### Changing output style\n\n[Output style](/docs/en/output-styles)is part of the system prompt, which Claude Code reads once at session start. Changing it via\n\n`/config`\n\nor the `outputStyle`\n\nsetting mid-session does not invalidate the cache, but the change also doesn’t apply. Claude keeps using the style that was loaded at session start. The new style loads on the next `/clear`\n\nor restart.\n### Changing permission mode\n\nSwitching between[permission modes](/docs/en/permission-modes), such as from default to accept edits, does not change the system prompt or tool definitions, so mode changes are cache-safe. The exception is plan mode with the\n\n[model setting, which switches the model between Opus and Sonnet as you enter or leave plan mode. That makes the mode toggle a](/docs/en/model-config#opusplan-model-setting)\n\n`opusplan`\n\n[model switch](#switching-models).\n\n### Invoking skills and commands\n\n[Skills](/docs/en/skills)and\n\n[commands](/docs/en/commands)inject their instructions as user messages at the point of invocation. Nothing earlier in the conversation changes.\n\n### Running `/recap`\n\n[generates a summary for display in your terminal. Unlike](/docs/en/interactive-mode#session-recap)\n\n`/recap`\n\n`/compact`\n\n, it appends the summary as command output rather than replacing your message history, so the cached prefix stays intact.\n### Rewinding the conversation\n\n[truncates your conversation back to an earlier turn. The remaining history is the same content the cache was built from at that point, and the system prompt and project context layers are unchanged, so the next request hits the earlier cache entry. Every turn since then has read through that prefix, which kept the entry warm even if the original turn was longer ago than the TTL. Restoring file checkpoints alongside the conversation has no separate effect on the cache. File contents enter context only when Claude reads them, the same as](/docs/en/checkpointing)\n\n`/rewind`\n\n[editing files in your repository](#editing-files-in-your-repository).\n\n## Cache lifetime\n\nCached prefixes expire after a period of inactivity. Each request that hits the cache resets the timer, so the cache stays warm as long as you keep working. After a long enough gap, the next request recomputes the full input and re-establishes the cache, which is why the first turn back after stepping away can be noticeably slower. The time to live (TTL) controls how long a gap the cache survives. The API offers two: a five-minute TTL, and a[one-hour TTL](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#1-hour-cache-duration)that keeps the cache warm through longer breaks but\n\n[bills cache writes at a higher rate](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#pricing). Claude Code picks the TTL for you based on how you authenticate, and you can override it with environment variables.\n\n### On a Claude subscription\n\nOn a Claude subscription, Claude Code requests the one-hour TTL automatically. Usage is included in your plan rather than billed per token, so the longer TTL costs you nothing extra and only affects how long your cache stays warm. If you’ve gone over your plan’s usage limit and Claude Code is drawing on[usage credits](https://support.claude.com/en/articles/12429409-extra-usage-for-paid-claude-plans), you are billed for that usage, so Claude Code automatically drops the TTL to five minutes.\n\n### On an API key or third-party provider\n\nOn an API key, Bedrock, Vertex, Foundry, or Claude Platform on AWS, you pay the per-token rates, so the TTL stays at the cheaper five minutes by default. To opt into the[one-hour TTL](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#1-hour-cache-duration), set\n\n`ENABLE_PROMPT_CACHING_1H=1`\n\n.\nOn Bedrock, prompt caching support, minimum cacheable prefix length, and one-hour TTL availability all vary by model. If cache token counts stay at zero, check [supported models, regions, and limits](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html#prompt-caching-models)in the Bedrock documentation.\n\n### Override the TTL\n\nSet`FORCE_PROMPT_CACHING_5M=1`\n\nto force the five-minute TTL regardless of authentication. This is useful when you’re debugging cache behavior, comparing the two TTLs, or overriding an `ENABLE_PROMPT_CACHING_1H`\n\nset in [managed settings](/docs/en/settings#settings-files).\n\n## Cache scope\n\nIn Claude Code, the cache is effectively scoped to one machine and directory. The system prompt embeds the working directory, platform, shell, OS version, and auto-memory paths, so two sessions in different directories build different prefixes and miss each other’s cache. That includes worktrees of the same repository, since each worktree has its own working directory. Sessions you run in parallel in the same directory build matching prefixes and read each other’s cache. Sequential sessions share the prefix only when the git status snapshot at startup matches, since the system prompt also captures branch and recent commits. The underlying API cache is broader. Caches are isolated between organizations, and on some providers,[between workspaces within an organization](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#cache-storage-and-sharing). Within those boundaries, any two requests with the same model and prefix read the same cache. For Agent SDK callers running fleets of automated processes, see\n\n[improve prompt caching across users and machines](/docs/en/agent-sdk/modifying-system-prompts#improve-prompt-caching-across-users-and-machines)to suppress the per-machine sections of the system prompt and share the cache across machines.\n\n## Check cache performance\n\nCache performance shows up as two token counts the API reports on every response. The most direct way to watch them live is a[statusline script](/docs/en/statusline)that reads the\n\n`current_usage`\n\nobject:\n| Field | Meaning |\n|---|---|\n`cache_creation_input_tokens` | Tokens written to the cache on this turn, billed at the cache write rate |\n`cache_read_input_tokens` | Tokens served from cache on this turn, billed at roughly 10% of the standard input rate |\n\n[actions that invalidate the cache](#actions-that-invalidate-the-cache)section lists the usual causes. For visibility across an organization, the OpenTelemetry exporter reports cache read and creation tokens per user and session. See\n\n[Monitor usage](/docs/en/monitoring-usage)for the metric and event attribute reference.\n\n## Subagents and the cache\n\nA[subagent](/docs/en/sub-agents)starts its own conversation with its own system prompt and tool set, separate from the parent’s. It builds its own cache, starting with no cache hits on its first call and warming up across its own turns. Subagents use the five-minute TTL even on a subscription, since the automatic one-hour TTL applies to the main conversation. The parent’s cache is unaffected. From the parent’s side, the subagent’s call and result append to the conversation, leaving the parent’s prefix intact. A\n\n[fork](/docs/en/sub-agents#fork-the-current-conversation), by contrast, inherits the parent’s system prompt, tools, and conversation history exactly, so its first request reads the parent’s cache. The compaction summarization call described in\n\n[Compacting the conversation](#compacting-the-conversation)uses the same prefix-sharing approach.\n\n## Disable prompt caching\n\nDisabling caching is occasionally useful when debugging caching behavior with a specific model or provider. To turn it off, set one of these environment variables to`1`\n\n:\n| Variable | Effect |\n|---|---|\n`DISABLE_PROMPT_CACHING` | Disable for all models |\n`DISABLE_PROMPT_CACHING_HAIKU` | Disable for Haiku only |\n`DISABLE_PROMPT_CACHING_SONNET` | Disable for Sonnet only |\n`DISABLE_PROMPT_CACHING_OPUS` | Disable for Opus only |\n`DISABLE_PROMPT_CACHING_FABLE` | Disable for Fable only |\n\n[TTL variables](#cache-lifetime)in the\n\n`env`\n\nblock of [managed settings](/docs/en/settings#settings-files). For normal use, leave caching enabled.\n\n## Related resources\n\n[Lessons from building Claude Code: Prompt caching is everything](https://claude.com/blog/lessons-from-building-claude-code-prompt-caching-is-everything): the design rationale for plan mode, deferred tool loading, and compaction[Explore the context window](/docs/en/context-window): what loads into context and when[Reduce token usage](/docs/en/costs#reduce-token-usage): strategies beyond caching for managing context size[Track and reduce costs](/docs/en/agent-sdk/cost-tracking): cache token tracking and TTL configuration for Agent SDK callers[Prompt caching](https://platform.claude.com/docs/en/build-with-claude/prompt-caching): the underlying API mechanism, breakpoints, and pricing", "url": "https://wpnews.pro/news/claude-code-uses-prompt-caching", "canonical_source": "https://code.claude.com/docs/en/prompt-caching", "published_at": "2026-07-01 06:02:40+00:00", "updated_at": "2026-07-01 06:19:43.309476+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure"], "entities": ["Claude Code", "Anthropic", "AWS", "Bedrock", "Vertex AI", "Foundry", "Claude API"], "alternates": {"html": "https://wpnews.pro/news/claude-code-uses-prompt-caching", "markdown": "https://wpnews.pro/news/claude-code-uses-prompt-caching.md", "text": "https://wpnews.pro/news/claude-code-uses-prompt-caching.txt", "jsonld": "https://wpnews.pro/news/claude-code-uses-prompt-caching.jsonld"}}