cd /news/large-language-models/how-to-cheat-llm-context-a-lightweig… · home topics large-language-models article
[ARTICLE · art-19239] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture

A developer building LinkShift.app, a programmable redirect and link-mapping platform, created a multi-tier AI agent architecture to avoid high API costs from large LLM context windows. The system uses a lightweight "receptionist" model to filter user queries and route only relevant documentation files to a slightly heavier model for response generation, while a micro-summary of chat history replaces full logs to keep context intact. This approach reduces token usage and API bills to pennies per query, even with zero users currently testing the platform.

read2 min publishedJun 2, 2026

Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.

When building the documentation assistant for my project, ** LinkShift.app** (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.

Here is how I solved token bloat and kept response times blazing fast.

Instead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:

User Request -> [1. Receptionist (gpt-5.4-nano)] -> Intent Filtering & File Routing
                                                                  |
                                                                  v
User Response <- [3. Response Gen (gpt-5.4-mini)] <- [2. Inject Relevant Files (Usually 3-6)]

Feeding raw Markdown files dynamically to an LLM is incredibly inefficient.

gpt-5.4-nano

model.When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a gpt-5.4-nano

model acting as a "receptionist." It handles two critical tasks:

The result? We only pass a fraction of the total documentation into the next stage.

Only now does the slightly heavier model, gpt-5.4-mini

, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.

The Chat History Hack:

Keeping full chat logs in memory quickly bloats the context window. To bypass this, every time gpt-5.4-mini

responds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject only this micro-summary instead of the entire chat history.

This keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.

The best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly zero users (free or paid).

It’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.

On the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.

If you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: linkshift.app/docs.

How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-to-cheat-llm-con…] indexed:0 read:2min 2026-06-02 ·