{"slug": "prompt-caching-claude-platform-docs", "title": "Prompt Caching – Claude Platform Docs", "summary": "Anthropic has introduced prompt caching for its Claude API, allowing developers to resume from specific prefixes in prompts to reduce processing time and costs. The feature supports automatic and explicit caching with a 5-minute default cache lifetime, and a 1-hour option at additional cost. Prompt caching is available on all active Claude models and introduces a new pricing structure with multipliers for cache writes and hits.", "body_md": "We use cookies to deliver and improve our services, analyze site usage, and if you agree, to customize or personalize your experience and market our services to you. You can read our Cookie Policy [here](https://www.anthropic.com/legal/cookies).\n\nPrompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.\n\nThis feature is eligible for [Zero Data Retention (ZDR)](/docs/en/build-with-claude/api-and-data-retention). When your organization has a ZDR arrangement, data sent through this feature is not stored after the API response is returned.\n\nThere are two ways to enable prompt caching:\n\n`cache_control`\n\nfield at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and moves it forward as conversations grow. Best for multi-turn conversations where the growing message history should be cached automatically.`cache_control`\n\ndirectly on individual content blocks for fine-grained control over exactly what gets cached.The simplest way to start is with automatic caching:\n\n```\nclient = anthropic.Anthropic()\n\nresponse = client.messages.create(\n    model=\"claude-opus-4-8\",\n    max_tokens=1024,\n    cache_control={\"type\": \"ephemeral\"},\n    system=\"You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": \"Analyze the major themes in 'Pride and Prejudice'.\",\n        }\n    ],\n)\nprint(response.usage.model_dump_json())\n```\n\nWith automatic caching, the system caches all content up to and including the last cacheable block. On subsequent requests with the same prefix, cached content is reused automatically.\n\nWhen you send a request with prompt caching enabled:\n\nThis is especially useful for:\n\nBy default, the cache has a 5-minute lifetime. The cache is refreshed for no additional cost each time the cached content is used.\n\nIf you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration [at additional cost](#pricing).\n\nFor more information, see [1-hour cache duration](#1-hour-cache-duration).\n\n**Prompt caching caches the full prefix**\n\nPrompt caching references the entire prompt - `tools`\n\n, `system`\n\n, and `messages`\n\n(in that order) up to and including the block designated with `cache_control`\n\n.\n\nPrompt caching introduces a new pricing structure. The table below shows the price per million tokens for each supported model:\n\n| Model | Base Input Tokens | 5m Cache Writes | 1h Cache Writes | Cache Hits & Refreshes | Output Tokens |\n|---|---|---|---|---|---|\n| Claude Fable 5 | $10 / MTok | $12.50 / MTok | $20 / MTok | $1 / MTok | $50 / MTok |\n| Claude Mythos 5 (\n|\n\nThe table above reflects the following pricing multipliers for prompt caching:\n\nThese multipliers stack with other pricing modifiers such as the Batch API discount and data residency. See [pricing](/docs/en/about-claude/pricing) for full details.\n\nPrompt caching (both automatic and explicit) is supported on all [active Claude models](/docs/en/about-claude/models/overview).\n\nAutomatic caching is the simplest way to enable prompt caching. Instead of placing `cache_control`\n\non individual content blocks, add a single `cache_control`\n\nfield at the top level of your request body. The system automatically applies the cache breakpoint to the last cacheable block.\n\n```\nclient = anthropic.Anthropic()\n\nresponse = client.messages.create(\n    model=\"claude-opus-4-8\",\n    max_tokens=1024,\n    cache_control={\"type\": \"ephemeral\"},\n    system=\"You are a helpful assistant that remembers our conversation.\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"My name is Alex. I work on machine learning.\"},\n        {\n            \"role\": \"assistant\",\n            \"content\": \"Nice to meet you, Alex! How can I help with your ML work today?\",\n        },\n        {\"role\": \"user\", \"content\": \"What did I say I work on?\"},\n    ],\n)\nprint(response.usage.model_dump_json())\n```\n\nWith automatic caching, the cache point moves forward automatically as conversations grow. Each new request caches everything up to the last cacheable block, and previous content is read from cache.\n\n| Request | Content | Cache behavior |\n|---|---|---|\n| Request 1 | System + User(1) + Asst(1) + User(2) ◀ cache | Everything written to cache |\n| Request 2 | System + User(1) + Asst(1) + User(2) + Asst(2) + User(3) ◀ cache | System through User(2) read from cache; Asst(2) + User(3) written to cache |\n| Request 3 | System + User(1) + Asst(1) + User(2) + Asst(2) + User(3) + Asst(3) + User(4) ◀ cache | System through User(3) read from cache; Asst(3) + User(4) written to cache |\n\nThe cache breakpoint automatically moves to the last cacheable block in each request, so you don't need to update any `cache_control`\n\nmarkers as the conversation grows.\n\nBy default, automatic caching uses a 5-minute TTL. You can specify a 1-hour TTL at 2x the base input token price:\n\n```\n{ \"cache_control\": { \"type\": \"ephemeral\", \"ttl\": \"1h\" } }\n```\n\nAutomatic caching is compatible with [explicit cache breakpoints](#explicit-cache-breakpoints). When used together, the automatic cache breakpoint uses one of the 4 available breakpoint slots.\n\nThis lets you combine both approaches. For example, use an explicit breakpoint to cache your system prompt, while automatic caching handles the conversation:\n\n```\n{\n  \"model\": \"claude-opus-4-8\",\n  \"max_tokens\": 1024,\n  \"cache_control\": { \"type\": \"ephemeral\" },\n  \"system\": [\n    {\n      \"type\": \"text\",\n      \"text\": \"You are a helpful assistant.\",\n      \"cache_control\": { \"type\": \"ephemeral\" }\n    }\n  ],\n  \"messages\": [{ \"role\": \"user\", \"content\": \"What are the key terms?\" }]\n}\n```\n\nAutomatic caching uses the same underlying caching infrastructure. Pricing, minimum token thresholds, context ordering requirements, and the 20-block lookback window all apply the same as with explicit breakpoints.\n\n`cache_control`\n\nwith the same TTL, automatic caching is a no-op.`cache_control`\n\nwith a different TTL, the API returns a 400 error.Automatic caching is available on the Claude API, [Claude Platform on AWS](/docs/en/build-with-claude/claude-platform-on-aws), and [Microsoft Foundry](/docs/en/build-with-claude/claude-in-microsoft-foundry). Bedrock and Google Cloud do not support automatic caching.\n\nFor more control over caching, you can place `cache_control`\n\ndirectly on individual content blocks. This is useful when you need to cache different sections that change at different frequencies, or need fine-grained control over exactly what gets cached.\n\nPlace static content (tool definitions, system instructions, context, examples) at the beginning of your prompt. Mark the end of the reusable content for caching using the `cache_control`\n\nparameter.\n\nCache prefixes are created in the following order: `tools`\n\n, `system`\n\n, then `messages`\n\n. This order forms a hierarchy where each level builds upon the previous ones.\n\nYou can use just one cache breakpoint at the end of your static content, and the system will automatically find the longest prefix that a prior request already wrote to the cache. Understanding how this works helps you optimize your caching strategy.\n\n**Three core principles:**\n\n**Cache writes happen only at your breakpoint.** Marking a block with `cache_control`\n\nwrites exactly one cache entry: a hash of the prefix ending at that block. The system does not write entries for any earlier position. Because the hash is cumulative, covering everything up to and including the breakpoint, changing any block at or before the breakpoint produces a different hash on the next request.\n\n**Cache reads look backward for entries that prior requests wrote.** On each request the system computes the prefix hash at your breakpoint and checks for a matching cache entry. If none exists, it walks backward one block at a time, checking whether the prefix hash at each earlier position matches something already in the cache. It is looking for prior writes, not for stable content.\n\n**The lookback window is 20 blocks.** The system checks at most 20 positions per breakpoint, counting the breakpoint itself as the first. If the system finds no matching entry in that window, checking stops (or resumes from the next explicit breakpoint, if any).\n\n**Example: Lookback in a growing conversation**\n\nYou append new blocks each turn and set `cache_control`\n\non the final block of each request:\n\n**Common mistake: Breakpoint on content that changes every request**\n\nYour prompt has a large static system context (blocks 1 through 5) followed by a per-request block containing a timestamp and the user message (block 6). You set `cache_control`\n\non block 6:\n\nThe lookback does not find stable content behind your breakpoint and cache it. It finds entries that prior requests already wrote, and writes happen only at breakpoints. Move `cache_control`\n\nto block 5, the last block that stays the same across requests, and every subsequent request reads the cached prefix. [Automatic caching](#automatic-caching) hits the same trap: it places the breakpoint on the last cacheable block, which in this structure is the one that changes every request, so use an explicit breakpoint on block 5 instead.\n\n**Key takeaway:** Place `cache_control`\n\non the last block whose prefix is identical across the requests you want to share a cache. In a growing conversation the final block works as long as each turn adds fewer than 20 blocks: earlier content never changes, so the next request's lookback finds the prior write. For a prompt with a varying suffix (timestamps, per-request context, the incoming message), place the breakpoint at the end of the static prefix, not on the varying block.\n\nYou can define up to 4 cache breakpoints if you want to:\n\n**Important limitation:** The lookback can only find entries that earlier requests already wrote. If a growing conversation pushes your breakpoint 20 or more blocks past the last write, the lookback window misses it. Add a second breakpoint closer to that position from the start so a write accumulates there before you need it.\n\n**Cache breakpoints themselves don't add any cost.** You are only charged for:\n\nAdding more `cache_control`\n\nbreakpoints doesn't increase your costs - you still pay the same amount based on what content is actually cached and read. The breakpoints simply give you control over what sections can be cached independently.\n\nOn the Claude API, [Claude Platform on AWS](/docs/en/build-with-claude/claude-platform-on-aws), [Google Cloud](/docs/en/build-with-claude/claude-on-vertex-ai), and [Microsoft Foundry](/docs/en/build-with-claude/claude-in-microsoft-foundry), the minimum cacheable prompt length is:\n\nModel availability varies by platform, and so can the minimum for newly released models: on [Amazon Bedrock](/docs/en/build-with-claude/claude-in-amazon-bedrock), the minimum cacheable prompt length for Claude Fable 5 and Claude Mythos 5 is 1,024 tokens.\n\nShorter prompts cannot be cached, even if marked with `cache_control`\n\n. Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned. To verify whether a prompt was cached, check the response usage [fields](/docs/en/build-with-claude/prompt-caching#tracking-cache-performance): if both `cache_creation_input_tokens`\n\nand `cache_read_input_tokens`\n\nare 0, the prompt was not cached (likely because it did not meet the minimum length requirement).\n\nIf your prompt falls just short of the minimum for your model and platform, expanding the cached content to reach the threshold is often worthwhile. Cache reads cost significantly less than uncached input tokens, so reaching the minimum can reduce costs for frequently reused prompts.\n\n[Bedrock](/docs/en/build-with-claude/claude-in-amazon-bedrock) is an AWS-operated platform. On Bedrock, see the [Bedrock prompt caching documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) for the per-model minimums, failure behavior, and usage-field names that apply.\n\nFor concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.\n\nCurrently, \"ephemeral\" is the only supported cache type, which by default has a 5-minute lifetime.\n\nMost blocks in the request can be cached. This includes:\n\n`tools`\n\narray`system`\n\narray`messages.content`\n\narray, for both user and assistant turns`messages.content`\n\narray, in user turns`messages.content`\n\narray, in both user and assistant turnsEach of these elements can be cached, either automatically or by marking them with `cache_control`\n\n.\n\nWhile most request blocks can be cached, there are some exceptions:\n\nThinking blocks cannot be cached directly with `cache_control`\n\n. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache.\n\nSub-content blocks (like [citations](/docs/en/build-with-claude/citations)) themselves cannot be cached directly. Instead, cache the top-level block.\n\nIn the case of citations, the top-level document content blocks that serve as the source material for citations can be cached. This allows you to use prompt caching with citations effectively by caching the documents that citations will reference.\n\nEmpty text blocks cannot be cached.\n\nModifications to cached content can invalidate some or all of the cache.\n\nAs described in [Structuring your prompt](#structuring-your-prompt), the cache follows the hierarchy: `tools`\n\n→ `system`\n\n→ `messages`\n\n. Changes at each level invalidate that level and all subsequent levels.\n\nThe following table shows which parts of the cache are invalidated by different types of changes. ✘ indicates that the cache is invalidated, while ✓ indicates that the cache remains valid.\n\n| What changes | Tools cache | System cache | Messages cache | Impact |\n|---|---|---|---|---|\nTool definitions | ✘ | ✘ | ✘ | Modifying tool definitions (names, descriptions, parameters) invalidates the entire cache |\nWeb search toggle | ✓ | ✘ | ✘ | Enabling/disabling web search modifies the system prompt |\nCitations toggle | ✓ | ✘ | ✘ | Enabling/disabling citations modifies the system prompt |\nSpeed setting | ✓ | ✘ | ✘ | Switching between\n`speed: \"fast\"` and standard speed |\n\n`tool_choice`\n\nparameter only affect message blocksOn Claude Opus 4.8, you can add a new system instruction partway through a conversation without invalidating the system or message caches. Append a `{\"role\": \"system\"}`\n\nmessage to `messages`\n\ninstead of editing the top-level `system`\n\nfield, so the cached prefix stays unchanged. See [Mid-conversation system messages](/docs/en/build-with-claude/mid-conversation-system-messages).\n\nMonitor cache performance using these API response fields, within `usage`\n\nin the response (or `message_start`\n\nevent if [streaming](/docs/en/build-with-claude/streaming)):\n\n`cache_creation_input_tokens`\n\n: Number of tokens written to the cache when creating a new entry.`cache_read_input_tokens`\n\n: Number of tokens retrieved from the cache for this request.`input_tokens`\n\n: Number of input tokens which were not read from or used to create a cache (that is, tokens after the last cache breakpoint).**Understanding the token breakdown**\n\nThe `input_tokens`\n\nfield represents only the tokens that come **after the last cache breakpoint** in your request - not all the input tokens you sent.\n\nTo calculate total input tokens:\n\n```\ntotal_input_tokens = cache_read_input_tokens + cache_creation_input_tokens + input_tokens\n```\n\n**Spatial explanation:**\n\n`cache_read_input_tokens`\n\n= tokens before breakpoint already cached (reads)`cache_creation_input_tokens`\n\n= tokens before breakpoint being cached now (writes)`input_tokens`\n\n= tokens after your last breakpoint (not eligible for cache)**Example:** If you have a request with 100,000 tokens of cached content (read from cache), 0 tokens of new content being cached, and 50 tokens in your user message (after the cache breakpoint):\n\n`cache_read_input_tokens`\n\n: 100,000`cache_creation_input_tokens`\n\n: 0`input_tokens`\n\n: 50This is important for understanding both costs and rate limits, as `input_tokens`\n\nwill typically be much smaller than your total input when using caching effectively.\n\nWhen using [extended thinking](/docs/en/build-with-claude/extended-thinking) with prompt caching, thinking blocks have special behavior:\n\n**Automatic caching alongside other content**: While thinking blocks cannot be explicitly marked with `cache_control`\n\n, they get cached as part of the request content when you make subsequent API calls with tool results. This commonly happens during tool use when you pass thinking blocks back to continue the conversation.\n\n**Input token counting**: When thinking blocks are read from cache, they count as input tokens in your usage metrics. This is important for cost calculation and token budgeting.\n\n**Cache invalidation patterns**:\n\n`cache_control`\n\nmarkersFor more details on cache invalidation, see [What invalidates the cache](#what-invalidates-the-cache).\n\n**Example with tool use**:\n\n```\nRequest 1: User: \"What's the weather in Paris?\"\nResponse: [thinking_block_1] + [tool_use block 1]\n\nRequest 2:\nUser: [\"What's the weather in Paris?\"],\nAssistant: [thinking_block_1] + [tool_use block 1],\nUser: [tool_result_1, cache=True]\nResponse: [thinking_block_2] + [text block 2]\n# Request 2 caches its request content (not the response)\n# The cache includes: user message, thinking_block_1, tool_use block 1, and tool_result_1\n\nRequest 3:\nUser: [\"What's the weather in Paris?\"],\nAssistant: [thinking_block_1] + [tool_use block 1],\nUser: [tool_result_1, cache=True],\nAssistant: [thinking_block_2] + [text block 2],\nUser: [Text response, cache=True]\n# On earlier Opus/Sonnet and all Haiku models, non-tool-result user block causes prior thinking blocks to be stripped; on Opus 4.5+/Sonnet 4.6+ they are kept\n```\n\nOn earlier Opus/Sonnet models and all Haiku models, all previous thinking blocks are removed from context at this point. On Opus 4.5+ and Sonnet 4.6+, prior thinking blocks are kept by default and remain part of the cached prefix.\n\nFor more detailed information, see the [extended thinking documentation](/docs/en/build-with-claude/extended-thinking#understanding-thinking-block-caching-behavior).\n\nAs of February 5, 2026, prompt caching uses [workspace](/docs/en/manage-claude/workspaces)-level isolation instead of organization-level isolation. Caches are isolated per workspace, ensuring data separation between workspaces within the same organization. This applies to the Claude API, Claude Platform on AWS, and Microsoft Foundry; Bedrock and Google Cloud maintain organization-level cache isolation. If you use multiple workspaces, review your caching strategy to account for this difference.\n\n**Organization and workspace isolation:** Caches are isolated between organizations. Different organizations never share caches, even if they use identical prompts. As of February 5, 2026, caches are also isolated per workspace within an organization on the Claude API, Claude Platform on AWS, and Microsoft Foundry; Bedrock and Google Cloud continue to use organization-level isolation only.\n\n**Exact matching:** Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control.\n\n**Output token generation:** Prompt caching has no effect on output token generation. The response you receive is identical to what you would get if prompt caching were not used.\n\nTo optimize prompt caching performance:\n\nTailor your prompt caching strategy to your scenario:\n\nIf experiencing unexpected behavior:\n\n[Cache diagnostics](/docs/en/build-with-claude/cache-diagnostics) (beta) has the API compare consecutive requests and report exactly where the prompt prefix diverged, which automatically handles many of the steps in this list.\n\n`cache_control`\n\nmarkers are in the same locations`tool_choice`\n\nand image usage remain consistent between calls`tool_use`\n\ncontent blocks have stable ordering as some languages (for example, Swift, Go) randomize key order during JSON conversion, breaking cachesChanges to `tool_choice`\n\nor the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created. For more details on cache invalidation, see [What invalidates the cache](#what-invalidates-the-cache).\n\nIf you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration [at additional cost](#pricing).\n\nThe 1-hour cache duration is available on the Claude API, [Claude Platform on AWS](/docs/en/build-with-claude/claude-platform-on-aws), [Amazon Bedrock](/docs/en/build-with-claude/claude-in-amazon-bedrock), [Amazon Bedrock (legacy)](/docs/en/build-with-claude/claude-on-amazon-bedrock-legacy), [Google Cloud](/docs/en/build-with-claude/claude-on-vertex-ai), and [Microsoft Foundry](/docs/en/build-with-claude/claude-in-microsoft-foundry).\n\nTo use the extended cache, include `ttl`\n\nin the `cache_control`\n\ndefinition like this:\n\n```\n\"cache_control\": {\n  \"type\": \"ephemeral\",\n  \"ttl\": \"1h\"\n}\n```\n\nThe response will include detailed cache information like the following:\n\n```\n{\n  \"usage\": {\n    \"input_tokens\": 2048,\n    \"cache_read_input_tokens\": 1800,\n    \"cache_creation_input_tokens\": 248,\n    \"output_tokens\": 503,\n\n    \"cache_creation\": {\n      \"ephemeral_5m_input_tokens\": 148,\n      \"ephemeral_1h_input_tokens\": 100\n    }\n  }\n}\n```\n\nNote that the current `cache_creation_input_tokens`\n\nfield equals the sum of the values in the `cache_creation`\n\nobject.\n\nIf you see `ephemeral_5m_input_tokens`\n\nwrites you didn't request while using server tools such as web search, see [this guide on prompt caching and tool use](/docs/en/agents-and-tools/tool-use/tool-use-with-prompt-caching#server-tool-results-are-cached-automatically).\n\nIf you have prompts that are used at a regular cadence (that is, system prompts that are used more frequently than every 5 minutes), continue to use the 5-minute cache, since this will continue to be refreshed at no additional charge.\n\nThe 1-hour cache is best used in the following scenarios:\n\nThe 5-minute and 1-hour cache behave the same with respect to latency. You will generally see improved time-to-first-token for long documents.\n\nYou can use both 1-hour and 5-minute cache controls in the same request, but with an important constraint: Cache entries with longer TTL must appear before shorter TTLs (that is, a 1-hour cache entry must appear before any 5-minute cache entries).\n\nWhen mixing TTLs, the API determines three billing locations in your prompt:\n\n`A`\n\n: The token count at the highest cache hit (or 0 if no hits).`B`\n\n: The token count at the highest 1-hour `cache_control`\n\nblock after `A`\n\n(or equals `A`\n\nif none exist).`C`\n\n: The token count at the last `cache_control`\n\nblock.If `B`\n\nand/or `C`\n\nare larger than `A`\n\n, they will necessarily be cache misses, because `A`\n\nis the highest cache hit.\n\nYou'll be charged for:\n\n`A`\n\n.`(B - A)`\n\n.`(C - B)`\n\n.Here are 3 examples. This depicts the input tokens of 3 requests, each of which has different cache hits and cache misses. Each has a different calculated pricing, shown in the colored boxes, as a result.\n\nCache pre-warming lets you load your system prompt or tool definitions into the prompt cache before a user triggers a real request. This eliminates the cache-miss latency penalty on the first user interaction, reducing time-to-first-token (TTFT) for latency-sensitive applications.\n\nSet `max_tokens: 0`\n\nin your request. The API reads your prompt into the model and writes the cache at any `cache_control`\n\nbreakpoint, then returns immediately without generating any output. The response has an empty `content`\n\narray, `stop_reason: \"max_tokens\"`\n\n, and a fully populated `usage`\n\nblock.\n\nPlace the `cache_control`\n\nbreakpoint on the last block that is shared with the follow-up request (typically your system prompt or tool definitions), not on the placeholder user message. Otherwise the cache entry is keyed to the placeholder and the follow-up request won't hit it. This means using an [explicit cache breakpoint](#explicit-cache-breakpoints) rather than [automatic caching](#automatic-caching), since automatic caching places the breakpoint on the last block, which here is the placeholder. The placeholder user message can be any string with non-whitespace content (the examples here use `\"warmup\"`\n\n); its content is read into the model but never answered.\n\nA pre-warm request incurs a **cache write** charge if the prefix is not already cached, the same as any other request. Check `usage.cache_creation_input_tokens`\n\nin the response to confirm a write occurred. Zero output tokens are billed.\n\n```\nclient = anthropic.Anthropic()\n\n# Fire this before users arrive to warm the shared system-prompt cache.\nprewarm = client.messages.create(\n    model=\"claude-opus-4-8\",\n    max_tokens=0,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": \"You are an expert software engineer with deep knowledge of distributed systems...\",\n            \"cache_control\": {\"type\": \"ephemeral\"},\n        }\n    ],\n    messages=[{\"role\": \"user\", \"content\": \"warmup\"}],\n)\nprint(prewarm.stop_reason)  # \"max_tokens\"\nprint(prewarm.content)  # []\nprint(prewarm.usage)\n```\n\nThe API returns an empty `content`\n\narray:\n\n```\n{\n  \"id\": \"msg_01XFDUDYJgAACzvnptvVoYEL\",\n  \"type\": \"message\",\n  \"role\": \"assistant\",\n  \"content\": [],\n  \"model\": \"claude-opus-4-8\",\n  \"stop_reason\": \"max_tokens\",\n  \"stop_sequence\": null,\n  \"usage\": {\n    \"input_tokens\": 8,\n    \"cache_creation_input_tokens\": 5120,\n    \"cache_read_input_tokens\": 0,\n    \"cache_creation\": {\n      \"ephemeral_5m_input_tokens\": 5120,\n      \"ephemeral_1h_input_tokens\": 0\n    },\n    \"iterations\": [\n      {\n        \"input_tokens\": 8,\n        \"output_tokens\": 0,\n        \"cache_read_input_tokens\": 0,\n        \"cache_creation_input_tokens\": 5120,\n        \"cache_creation\": {\n          \"ephemeral_5m_input_tokens\": 5120,\n          \"ephemeral_1h_input_tokens\": 0\n        },\n        \"type\": \"message\"\n      }\n    ],\n    \"output_tokens\": 0,\n    \"service_tier\": \"standard\",\n    \"inference_geo\": \"global\"\n  }\n}\n```\n\nFire a pre-warm request when your application starts (or on a scheduled interval), then send real user requests after the pre-warm completes:\n\n```\nclient = anthropic.Anthropic()\n\nSYSTEM_PROMPT = [\n    {\n        \"type\": \"text\",\n        \"text\": \"You are an expert software engineer with deep knowledge of distributed systems...\",\n        \"cache_control\": {\"type\": \"ephemeral\"},\n    }\n]\n\ndef prewarm_cache() -> None:\n    \"\"\"Call this at application startup or on a scheduled interval.\"\"\"\n    client.messages.create(\n        model=\"claude-opus-4-8\",\n        max_tokens=0,\n        system=SYSTEM_PROMPT,\n        messages=[{\"role\": \"user\", \"content\": \"warmup\"}],\n    )\n\ndef respond(user_message: str) -> anthropic.types.Message:\n    \"\"\"The real user request; benefits from a warm cache.\"\"\"\n    return client.messages.create(\n        model=\"claude-opus-4-8\",\n        max_tokens=1024,\n        system=SYSTEM_PROMPT,\n        messages=[{\"role\": \"user\", \"content\": user_message}],\n    )\n\n# Warm the cache before any user traffic arrives.\nprewarm_cache()\n\n# Later, when the user submits a message, the system-prompt prefix is already cached.\nresponse = respond(\"How do I implement a binary search tree?\")\nprint(response.content[0].text)\n```\n\nKeep in mind that the cache TTL still applies. For the default 5-minute cache, send a new pre-warm request at least every 5 minutes to keep the cache warm. For longer gaps between user requests, use the [1-hour cache duration](#1-hour-cache-duration) instead.\n\nA `max_tokens: 0`\n\nrequest is rejected with an `invalid_request_error`\n\nif any of the following are set, since each implies output that a zero-token budget cannot produce:\n\n`stream: true`\n\n`thinking.type: \"enabled\"`\n\n)`output_config.format`\n\n)`tool_choice`\n\nof `{\"type\": \"tool\", ...}`\n\nor `{\"type\": \"any\"}`\n\n`max_tokens: 0`\n\nis also rejected inside a [Message Batches](/docs/en/build-with-claude/batch-processing) request. Pre-warming targets time-to-first-token, which does not apply to batch processing, and a cache entry written during batch processing would likely expire before the follow-up request runs.\n\nBefore `max_tokens: 0`\n\nwas available, some applications used `max_tokens: 1`\n\nwarm-up calls to achieve the same effect. The `max_tokens: 0`\n\napproach is preferred: no output is produced, so there is no single-token reply to discard, no output tokens are billed, and the intent of the request is unambiguous.\n\nTo help you get started with prompt caching, the [prompt caching cookbook](https://platform.claude.com/cookbook/misc-prompt-caching) provides detailed examples and best practices.\n\nThe following code snippets showcase various prompt caching patterns. These examples demonstrate how to implement caching in different scenarios, helping you understand the practical applications of this feature:\n\nPrompt caching (both automatic and explicit) is ZDR eligible. Anthropic does not store the raw text of your prompts or Claude's responses.\n\nKV (key-value) cache representations and cryptographic hashes of cached content are held in memory only and are not stored at rest. Cached entries have a minimum lifetime of 5 minutes (standard) or 1 hour (extended), after which they are promptly, though not immediately, deleted. Cache entries are isolated between organizations and, on the Claude API, Claude Platform on AWS, and Microsoft Foundry, between workspaces within an organization.\n\nFor ZDR eligibility across all features, see [API and data retention](/docs/en/manage-claude/api-and-data-retention).\n\nWas this page helpful?", "url": "https://wpnews.pro/news/prompt-caching-claude-platform-docs", "canonical_source": "https://platform.claude.com/docs/en/build-with-claude/prompt-caching", "published_at": "2026-07-01 06:01:56+00:00", "updated_at": "2026-07-01 06:19:51.698283+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-infrastructure"], "entities": ["Anthropic", "Claude", "Claude Opus 4", "Claude Fable 5", "Claude Mythos 5"], "alternates": {"html": "https://wpnews.pro/news/prompt-caching-claude-platform-docs", "markdown": "https://wpnews.pro/news/prompt-caching-claude-platform-docs.md", "text": "https://wpnews.pro/news/prompt-caching-claude-platform-docs.txt", "jsonld": "https://wpnews.pro/news/prompt-caching-claude-platform-docs.jsonld"}}