How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture

wpnews.pro

cd /news/large-language-models/how-to-cheat-llm-context-a-lightweig… · home › topics › large-language-models › article

[ARTICLE · art-19239] src=dev.to ↗ pub=2026-06-02T19:48Z topic=large-language-models verified=true sentiment=↑ positive

How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture

A developer building LinkShift.app, a programmable redirect and link-mapping platform, created a multi-tier AI agent architecture to avoid high API costs from large LLM context windows. The system uses a lightweight "receptionist" model to filter user queries and route only relevant documentation files to a slightly heavier model for response generation, while a micro-summary of chat history replaces full logs to keep context intact. This approach reduces token usage and API bills to pennies per query, even with zero users currently testing the platform.

read2 min views21 publishedJun 2, 2026

Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.

When building the documentation assistant for my project, ** LinkShift.app** (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.

Here is how I solved token bloat and kept response times blazing fast.

Instead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:

User Request -> [1. Receptionist (gpt-5.4-nano)] -> Intent Filtering & File Routing
                                                                  |
                                                                  v
User Response <- [3. Response Gen (gpt-5.4-mini)] <- [2. Inject Relevant Files (Usually 3-6)]

Feeding raw Markdown files dynamically to an LLM is incredibly inefficient.

gpt-5.4-nano

model.When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a gpt-5.4-nano

model acting as a "receptionist." It handles two critical tasks:

The result? We only pass a fraction of the total documentation into the next stage.

Only now does the slightly heavier model, gpt-5.4-mini

, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.

The Chat History Hack:

Keeping full chat logs in memory quickly bloats the context window. To bypass this, every time gpt-5.4-mini

responds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject only this micro-summary instead of the entire chat history.

This keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.

The best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly zero users (free or paid).

It’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.

On the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.

If you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: linkshift.app/docs.

How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!

source & further reading

dev.to — original article I Think AI Product Manager is Not Just “PM Knowing How to Use AI” The Agentic Harness: The Part That Actually Keeps Agents Reliable Building a Low-Cost AI Brainrot Video Pipeline on Cloudflare

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-cheat-llm-context…

Read original on dev.to → dev.to/p-zielinski/how-to-cheat-llm-context-a-li…

mentioned entities

LinkShift.app

gpt-5.4-nano

gpt-5.4-mini

metadata

slughow-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevMicrosoft's MAI Thinking 1 surfa…

next →Martin Scorsese Feels the Power …

── more in #large-language-models 4 stories · sorted by recency

github.com · 29 Jun · #large-language-models

Show HN: Dotdotduck – open-source Web Agent SDK

businessinsider.com · 18 Jul · #large-language-models

I interned at OpenAI in San Francisco. Here's how landed it, what it was like, and what I thought of Sam Altman.

wired.com · 18 Jul · #large-language-models

How Google’s New Gemini Rates Work and How to Track Your Usage

fastcompany.com · 18 Jul · #large-language-models

The World Cup tech innovations that will outlast the tournament

── more on @linkshift.app 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required