Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

A developer implemented inline markers embedded in LLM responses to generate reply suggestions for a voice roleplay chat, avoiding structured output, stream interruption, and cache invalidation. The system appends `{{SUGGEST: option1 | option2 | option3}}` at the end of each AI response, then extracts and strips the markers server-side before sending the cleaned text to the user and storing it in history. This approach preserves prompt caching across providers like Grok and Anthropic by keeping the conversation prefix unchanged, while treating suggestions as ephemeral UI elements that do not pollute future context.

I wanted to add reply suggestions to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit. I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward . The path to that decision was interesting enough to write up. What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation. Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently. system + full history + user every time and ride the server's prompt cache hit tokens etc. . x-grok-conv-id header ties requests to the same conversation, keeping them pinned to the cache.The common thread: the conversation prefix persona + history should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency. The natural-looking approach to fetching three suggestions would be something like {"reply": "...", "suggestions": "...", "...", "..." } . I ruled it out for two reasons. A. Separate API call to generate suggestions Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem. B. Structured output, bundled in the main turn No second request, so cache consistency is trivial. But ruled out for the reasons above latency + streaming conflict . C. Inline markers, bundled in the main turn chosen Ask the model to append {{SUGGEST: option1 | option2 | option3}} at the very end of its response, and extract it server-side. {{SUGGEST}} trickles out at the end. Generation finishes while the user is listening. {{SHOW: label}} , {{POSE: ...}} , and {{IMAGE: ...}} , plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.The important part: once extracted, the marker must be removed from both the TTS/display text and the DB history . Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns. // Extract {{SUGGEST: a | b | c}} and remove it entirely from the body static RE SUGGEST: Lazy<Regex = Lazy::new || Regex::new r" ?is \{\{\s SUGGEST\s :\s \s\S ? \}\}" .unwrap ; fn extract suggest text: &str - String, Vec<String { match RE SUGGEST.captures text { Some cap = { let suggestions = cap 1 .split '|' .map |s| s.trim .to string .filter |s| s.is empty .take 3 .collect ; let clean = RE SUGGEST.replace all text, "" .trim .to string ; clean, suggestions } None = text.to string , Vec::new , } } This is where the existing "store annotated / display clean" separation pays off. In this chat: ai text returned to the client display + TTS is fully stripped of all markers. {{SHOW}} / {{POSE}} markers so the model keeps seeing its own canonical format in history and continues using it correctly . {{SUGGEST}} is different from {{SHOW}} / {{POSE}} — it doesn't go back into the DB at all . It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else. On the prompt side, it's just one extra block gated by a feature flag in the persona config: At the very end of your response, add exactly three short replies the user might say next, in this format: {{SUGGEST: option1 | option2 | option3}} - Always place it last after any {{SHOW}}/{{POSE}} markers - Write each option in first person, casual, short - Vary the direction: one enthusiastic, one deflecting, one asking a question back Implicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply generates suggestions as part of the current turn's response — the next turn's input prefix system + history is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property. The costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format same risk level as {{SHOW}} / {{POSE}} . Both are acceptable. This chat is part of kotonia, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.