{"slug": "how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture", "title": "How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture", "summary": "A developer building LinkShift.app, a programmable redirect and link-mapping platform, created a multi-tier AI agent architecture to avoid high API costs from large LLM context windows. The system uses a lightweight \"receptionist\" model to filter user queries and route only relevant documentation files to a slightly heavier model for response generation, while a micro-summary of chat history replaces full logs to keep context intact. This approach reduces token usage and API bills to pennies per query, even with zero users currently testing the platform.", "body_md": "Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.\n\nWhen building the documentation assistant for my project, ** LinkShift.app** (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.\n\nHere is how I solved token bloat and kept response times blazing fast.\n\nInstead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:\n\n``` php\nUser Request -> [1. Receptionist (gpt-5.4-nano)] -> Intent Filtering & File Routing\n                                                                  |\n                                                                  v\nUser Response <- [3. Response Gen (gpt-5.4-mini)] <- [2. Inject Relevant Files (Usually 3-6)]\n```\n\nFeeding raw Markdown files dynamically to an LLM is incredibly inefficient.\n\n`gpt-5.4-nano`\n\nmodel.When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a `gpt-5.4-nano`\n\nmodel acting as a \"receptionist.\" It handles two critical tasks:\n\nThe result? We only pass a fraction of the total documentation into the next stage.\n\nOnly now does the slightly heavier model, `gpt-5.4-mini`\n\n, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.\n\n**The Chat History Hack:**\n\nKeeping full chat logs in memory quickly bloats the context window. To bypass this, every time `gpt-5.4-mini`\n\nresponds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject *only* this micro-summary instead of the entire chat history.\n\nThis keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.\n\nThe best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly **zero users** (free or paid).\n\nIt’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.\n\nOn the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.\n\nIf you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: [linkshift.app/docs](https://linkshift.app/docs).\n\n**How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!**", "url": "https://wpnews.pro/news/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture", "canonical_source": "https://dev.to/p-zielinski/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture-3hl1", "published_at": "2026-06-02 19:48:40+00:00", "updated_at": "2026-06-02 20:11:57.865689+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "ai-infrastructure", "ai-products"], "entities": ["LinkShift.app", "gpt-5.4-nano", "gpt-5.4-mini"], "alternates": {"html": "https://wpnews.pro/news/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture", "markdown": "https://wpnews.pro/news/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture.md", "text": "https://wpnews.pro/news/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture.txt", "jsonld": "https://wpnews.pro/news/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture.jsonld"}}