Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.
When building the documentation assistant for my project, ** LinkShift.app** (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.
Here is how I solved token bloat and kept response times blazing fast.
Instead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:
User Request -> [1. Receptionist (gpt-5.4-nano)] -> Intent Filtering & File Routing
|
v
User Response <- [3. Response Gen (gpt-5.4-mini)] <- [2. Inject Relevant Files (Usually 3-6)]
Feeding raw Markdown files dynamically to an LLM is incredibly inefficient.
gpt-5.4-nano
model.When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a gpt-5.4-nano
model acting as a "receptionist." It handles two critical tasks:
The result? We only pass a fraction of the total documentation into the next stage.
Only now does the slightly heavier model, gpt-5.4-mini
, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.
The Chat History Hack:
Keeping full chat logs in memory quickly bloats the context window. To bypass this, every time gpt-5.4-mini
responds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject only this micro-summary instead of the entire chat history.
This keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.
The best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly zero users (free or paid).
It’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.
On the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.
If you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: linkshift.app/docs.
How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!