Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output A developer implemented inline markers embedded in LLM responses to generate reply suggestions for a voice roleplay chat, avoiding structured output, stream interruption, and cache invalidation. The system appends `{{SUGGEST: option1 | option2 | option3}}` at the end of each AI response, then extracts and strips the markers server-side before sending the cleaned text to the user and storing it in history. This approach preserves prompt caching across providers like Grok and Anthropic by keeping the conversation prefix unchanged, while treating suggestions as ephemeral UI elements that do not pollute future context. I wanted to add reply suggestions to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit. I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward . The path to that decision was interesting enough to write up. What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation. Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently. system + full history + user every time and ride the server's prompt cache hit tokens etc. . x-grok-conv-id header ties requests to the same conversation, keeping them pinned to the cache.The common thread: the conversation prefix persona + history should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency. The natural-looking approach to fetching three suggestions would be something like {"reply": "...", "suggestions": "...", "...", "..." } . I ruled it out for two reasons. A. Separate API call to generate suggestions Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem. B. Structured output, bundled in the main turn No second request, so cache consistency is trivial. But ruled out for the reasons above latency + streaming conflict . C. Inline markers, bundled in the main turn chosen Ask the model to append {{SUGGEST: option1 | option2 | option3}} at the very end of its response, and extract it server-side. {{SUGGEST}} trickles out at the end. Generation finishes while the user is listening. {{SHOW: label}} , {{POSE: ...}} , and {{IMAGE: ...}} , plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.The important part: once extracted, the marker must be removed from both the TTS/display text and the DB history . Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns. // Extract {{SUGGEST: a | b | c}} and remove it entirely from the body static RE SUGGEST: Lazy