{"slug": "fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output", "title": "Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output", "summary": "A developer implemented inline markers embedded in LLM responses to generate reply suggestions for a voice roleplay chat, avoiding structured output, stream interruption, and cache invalidation. The system appends `{{SUGGEST: option1 | option2 | option3}}` at the end of each AI response, then extracts and strips the markers server-side before sending the cleaned text to the user and storing it in history. This approach preserves prompt caching across providers like Grok and Anthropic by keeping the conversation prefix unchanged, while treating suggestions as ephemeral UI elements that do not pollute future context.", "body_md": "I wanted to add **reply suggestions** to a voice roleplay chat — the classic UX where three \"you could say this next\" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit.\n\nI ended up going with the unglamorous move of **embedding inline markers in the response and stripping them out afterward**. The path to that decision was interesting enough to write up.\n\n*What I wanted to build: three \"you could say this\" chips per AI response — no structured output, no stream interruption, no cache invalidation.*\n\nKeeping token costs down in an LLM chat comes down to caching, and every provider does it differently.\n\n`system + full history + user`\n\nevery time and ride the server's `prompt_cache_hit_tokens`\n\netc.).`x-grok-conv-id`\n\nheader ties requests to the same conversation, keeping them pinned to the cache.The common thread: the conversation prefix (persona + history) should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency.\n\nThe natural-looking approach to fetching three suggestions would be something like `{\"reply\": \"...\", \"suggestions\": [\"...\", \"...\", \"...\"]}`\n\n. I ruled it out for two reasons.\n\n**A. Separate API call to generate suggestions**\n\nFire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem.\n\n**B. Structured output, bundled in the main turn**\n\nNo second request, so cache consistency is trivial. But ruled out for the reasons above (latency + streaming conflict).\n\n**C. Inline markers, bundled in the main turn** *(chosen)*\n\nAsk the model to append `{{SUGGEST: option1 | option2 | option3}}`\n\nat the very end of its response, and extract it server-side.\n\n`{{SUGGEST}}`\n\ntrickles out at the end. Generation finishes while the user is listening.`{{SHOW: label}}`\n\n, `{{POSE: ...}}`\n\n, and `{{IMAGE: ...}}`\n\n, plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.The important part: once extracted, the marker must be **removed from both the TTS/display text and the DB history**. Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns.\n\n```\n// Extract {{SUGGEST: a | b | c}} and remove it entirely from the body\nstatic RE_SUGGEST: Lazy<Regex> =\n    Lazy::new(|| Regex::new(r\"(?is)\\{\\{\\s*SUGGEST\\s*:\\s*([\\s\\S]*?)\\}\\}\").unwrap());\n\nfn extract_suggest(text: &str) -> (String, Vec<String>) {\n    match RE_SUGGEST.captures(text) {\n        Some(cap) => {\n            let suggestions = cap[1]\n                .split('|')\n                .map(|s| s.trim().to_string())\n                .filter(|s| !s.is_empty())\n                .take(3)\n                .collect();\n            let clean = RE_SUGGEST.replace_all(text, \"\").trim().to_string();\n            (clean, suggestions)\n        }\n        None => (text.to_string(), Vec::new()),\n    }\n}\n```\n\nThis is where the existing **\"store annotated / display clean\"** separation pays off. In this chat:\n\n`ai_text`\n\nreturned to the client (display + TTS) is fully stripped of all markers.`{{SHOW}}`\n\n/`{{POSE}}`\n\nmarkers (so the model keeps seeing its own canonical format in history and continues using it correctly).`{{SUGGEST}}`\n\nis different from `{{SHOW}}`\n\n/`{{POSE}}`\n\n— **it doesn't go back into the DB at all**. It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else.\n\nOn the prompt side, it's just one extra block gated by a feature flag in the persona config:\n\n```\nAt the very end of your response, add exactly three short replies the user\nmight say next, in this format:\n{{SUGGEST: option1 | option2 | option3}}\n- Always place it last (after any {{SHOW}}/{{POSE}} markers)\n- Write each option in first person, casual, short\n- Vary the direction: one enthusiastic, one deflecting, one asking a question back\n```\n\nImplicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply **generates suggestions as part of the current turn's response** — the next turn's input prefix (system + history) is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property.\n\nThe costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format (same risk level as `{{SHOW}}`\n\n/`{{POSE}}`\n\n). Both are acceptable.\n\n*This chat is part of kotonia, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.*", "url": "https://wpnews.pro/news/fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/fitting-llm-reply-suggestions-into-every-providers-prompt-cache-without-structured-output-18f6", "published_at": "2026-05-31 09:51:11+00:00", "updated_at": "2026-05-31 10:12:36.547061+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-products", "ai-tools", "ai-infrastructure"], "entities": ["Grok"], "alternates": {"html": "https://wpnews.pro/news/fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output", "markdown": "https://wpnews.pro/news/fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output.md", "text": "https://wpnews.pro/news/fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output.txt", "jsonld": "https://wpnews.pro/news/fitting-llm-reply-suggestions-into-every-provider-s-prompt-cache-without-output.jsonld"}}