How I Cut Agent Token Usage by 89% Without Touching the Agent

wpnews.pro

cd /news/ai-agents/how-i-cut-agent-token-usage-by-89-wi… · home › topics › ai-agents › article

[ARTICLE · art-22546] src=dev.to ↗ pub=2026-06-05T13:10Z topic=ai-agents verified=true sentiment=· neutral

How I Cut Agent Token Usage by 89% Without Touching the Agent

A developer built a Go proxy called Trooper that sits between AI agents and large language models, reducing token usage by 89% without modifying the agent itself. The proxy replaces full conversation history with a structured "SITREP" — a situation report capturing intent, decisions, constraints, and open questions — after the first few turns of a session. In a real 15-turn debugging session, Trooper cut token consumption from 10,820 tokens per request to just 1,157 tokens while maintaining the model's ability to answer questions about earlier decisions.

read4 min views15 publishedJun 5, 2026

Every time your agent calls an LLM, it sends the full conversation history.

Turn 20 includes turns 1–19. Turn 50 includes turns 1–49. Nobody notices because it's happening inside the agent, silently, on every single request.

I noticed it while building Trooper - a Go proxy that sits between agents and LLMs. I was watching token counts climb across a long debugging session and realised the agent was replaying the same context over and over. Most of it was noise.

The model didn't need a transcript. It needed state.

After a few turns, most of what matters in a session falls into four categories:

That's it. Everything else — the back and forth, the verbose LLM responses explaining things, the repeated context — is replay. The model doesn't need it again.

I added structured session memory to Trooper. After enough turns, Trooper's local Llama model generates a SITREP — a situation report — from the user messages in the session.

It looks like this:

INTENT: Build a RAG pipeline with ChromaDB and nomic-embed-text

DECISIONS: Use cosine similarity over MMR — focused queries not broad;
           Chunk size 256, overlap 30 — locked;
           Pure vector search — ChromaDB no hybrid support;
           Top k set to 5

CONSTRAINTS: Node 18 locked — platform team constraint, no exceptions;
             Re-ranking ruled out — latency jumped 200ms to 800ms

OPEN: Poor recall on technical queries — nomic-embed-text struggles with domain jargon;
      Evaluating bge-small as alternative

From that point forward, every request to the LLM sends:

Anchor (first 2 turns verbatim)
+ SITREP (structured state)
+ Tail (last N turns verbatim)

Instead of the full history.

From a real 15-turn session:

Full history:    10,820 tokens per request
With Trooper:     1,157 tokens per request
Reduction:             89%

Visible live on the dashboard.

This was the question that mattered. Token savings are worthless if the model loses coherence.

To test it: I took the auto-generated SITREP, opened a completely fresh chat with no history, and asked questions about decisions made in the original session.

Questions:

Result: All four answered correctly. The model worked entirely from the SITREP. No history. No context bleed.

That's the claim: structured state is sufficient for the model to continue reasoning correctly — and it costs 89% less to send.

Trooper is a Go proxy — one binary, no SDK, no instrumentation. You point your existing agent at it by changing one URL.

export ANTHROPIC_BASE_URL=https://api.anthropic.com

export ANTHROPIC_BASE_URL=http://localhost:3000

Nothing else changes. Trooper intercepts every request, maintains session state, and when the SITREP is ready, rewrites the messages array before forwarding to the LLM.

The SITREP is built by a local Llama 3.1 8b model running via Ollama — fast, private, no cloud cost. The extraction happens asynchronously in the background. The main request path is not blocked.

// GetTripleAnchor assembles what gets sent to the LLM
func (s *SessionStore) GetTripleAnchor(sessionID string) []map[string]string {
    payload := append([]map[string]string{}, state.Anchor...)
    if state.SITREP != "" {
        payload = append(payload, map[string]string{
            "role":    "system",
            "content": fmt.Sprintf("[STATE_SITREP: %s]", state.SITREP),
        })
    }
    return append(payload, state.Tail...)
}

The dashboard shows the compression ratio live:

HISTORY COMPRESSED    89%
TOKENS SAVED          459
CONFIDENCE            100%

Most summarisation tools compress what was said. The SITREP extracts what matters for the next action.

Copilot's context compaction summarises the full conversation — useful for humans in long chats. The SITREP is structured specifically for agents: decisions, constraints, open loops, ruled-out paths. Not a narrative summary. A state snapshot.

The result is that subsequent turns stay coherent on intent without replaying noise. More relevant for agents running repeated structured workflows than for general chat.

The SITREP works best for structured agentic workflows — debugging sessions, research pipelines, multi-step build tasks. For open-ended creative work where tangential context might become important later, you'd want a larger tail window or higher fidelity compression.

The tail window is configurable. You can keep more raw context for less structured sessions.

The compression is the latest addition. Trooper also:

x_force_local

/recovery/{session_id}

tells you exactly where to resumeAll from one URL change.

We tend to treat conversation history as memory. But a transcript is a log. Memory is state.

Humans don't replay every prior conversation before making a decision. They carry forward conclusions, constraints, unresolved questions, and relevant context — a structured snapshot, not a full transcript.

Long-running agents may need to do the same. Not because of token costs — though that helps — but because state is a better abstraction for agent memory than history.

The SITREP is an experiment in that direction.

github.com/shouvik12/trooper — Go, MIT, zero dependencies beyond Ollama.

source & further reading

dev.to — original article How an Autonomous Agent Breached Hugging Face — And What a RAG Poisoning Filter Would Have Stopped Gemini 3.6 Flash & 3.5 Flash-Lite: Developer guide MCP Deep Dive, Part 10: When the Agent Feels Off — Debugging and Observability for MCP in Production

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-cut-agent-token-us…

Read original on dev.to → dev.to/shouvik12/how-i-cut-agent-token-usage-by-…

mentioned entities

Trooper

ChromaDB

nomic-embed-text

Llama

Node 18

metadata

slughow-i-cut-agent-token-usage-by-89-without-touching-the-agent

topic#ai-agents

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevAirTrunk commits $30B to build 5…

next →I benchmarked Python AI-app secu…

── more in #ai-agents 4 stories · sorted by recency

pub.towardsai.net · 16 Jul · #ai-agents

I Built a Hybrid RAG App That Talks to My PDF — and Knows When to Say “I Don’t Know”

github.com · 1 Jul · #ai-agents

Ragit – chat with any folder of documents using a local LLM

cryptobriefing.com · 21 Jul · #ai-agents

Google releases Gemini 3.6 Flash and 3.5 Flash-Lite updates, intensifying AI price war

arize.com · 21 Jul · #ai-agents

How OpenAI uses human feedback to evaluate and improve LLMs

── more on @trooper 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 7 Jul · #artificial-intelligence

In the age of AI, Hong Kong’s strategy as a ‘superconnector’ is progressing

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required