cd /news/large-language-models/the-cost-curve-of-unchecked-llm-cont… · home topics large-language-models article
[ARTICLE · art-30846] src=phroneses.com ↗ pub= topic=large-language-models verified=true sentiment=↓ negative

The Cost Curve of Unchecked LLM Context

Unchecked token usage in LLM-assisted engineering workflows can inflate costs from $48 to $978 per engineer per month for high-end models, as context accumulation from system prompts, chat history, RAG retrieval, and debugging logs multiplies token counts per call by over 20 times.

read7 min views19 publishedJun 16, 2026

LLM use can get expensive if the per‑call token use is unchecked.

There is a contradiction in LLM‑assisted engineering: you must provide enough context for quality, yet the workflows that supply this context also tend to inflate token usage beyond what the model can meaningfully use.

Below is a concrete, realistic example for one developer, showing exactly how costs can balloon when normal workflow patterns accumulate more context than the model can effectively handle.

This article is about unchecked token growth in engineering workflows and the cost multipliers that follow.

Unchecked growth could cost you thousands per engineer a month #

Imagine an engineer building an agent to analyse logs. The system has two major subsystems: log processor with a UI, and access to a remote runtime LLM system that works on the log contents surfaced by the log processor, providing insight back to the UI about what is in the log and what it means.

During development, the engineer is iterating, debugging, and refining the solution. To do this, the engineer is using AI support within their engineering workflow.

To support the engineer's development of the solution, the LLM being used in the engineering workflow is:

  • analysing log fragments
  • suggesting new extraction queries
  • reflecting on errors
  • retrying after failures
  • running multistep loops

Initial token cost #

Initially, the token cost for the above might be:

- System prompt: 300 tokens
- User input: 200 tokens
- Expected output: 300 tokens

This is a total of about 800 tokens.

Model Tier Tokens Used Price per Million Tokens Total Cost Per Call
Cheap 800 $0.20 $0.00016
Medium 800 $3.5 $0.0028
High‑end 800 $10 $0.008

All of this is cheap, predictable, and stable.

Unchecked token growth during engineering #

Unchecked token usage can lead to:

  • system prompt growth
  • chat history accumulation
  • a RAG component dumping too much context
  • retries repeating the entire prompt
  • additional tools added to the engineering environment, generating more tokens
  • output schemas get verbose
  • debugging prints get included in context
  • logs get pasted directly into prompts

Costs balloon #

Assume that an engineer makes 300 LLM calls per 8-hour day to develop the above log agent. Rounding up, that is 38 per hour.

Over time, unchecked token usage can increase:

  • System prompts to 2,500 tokens
  • Full chat history kept for LLM context and continuity at 4,000 tokens
  • RAG over‑retrieval dumps six log files totaling 8,000 tokens
  • Output schema at 1,000 tokens
  • Safety boilerplate in the prompt at 500 tokens
  • Engineering LLM output as before at 300 tokens

This is a total of 16,300 tokens per LLM call.

For the cheap, medium, and high-end LLM engineering models, this is the cost:

Tier Cost per 16,300‑token call Daily cost (8h) Monthly cost (20 days)
Cheap $0.00326 $0.978 $19.56
Medium $0.05705 $17.115 $342.3
High‑end $0.163 $48.9 $978.0

The cost per call is low. The monthly cost soon adds up.

The cost when token usage is checked:

| Tier | Cost per 800‑token call | Daily cost (8h) | Monthly cost (20 days) |
|---|---|---|---|

| Cheap | $0.00016 | $0.048 | $0.96 | | Medium | $0.0028 | $0.84 | $16.8 | | High‑end | $0.008 | $2.40 | $48.0 |

When using more than twenty times the tokens per call, the high-end model cost balloons from $48 per month to $978.

As a chart, the trend becomes quite clear.

Are 300 LLM calls per day realistic? #

Yes. It depends on what needs to be done.

Code generation

When using an interactive coding assistant, every on-screen autocompletion of code or refactor of the implementation to improve the solution, or explanation of what a part of the implementation does, is a call to the LLM. An hour of active coding could generate 40 to 80 LLM calls.

Agentic assistance

Agentic workflows typcally operate with a plan, act, observe, revise pipeline. If the engineer instructed an agent to "add logging so that a division by zero error is logged and all tests pass", the LLM would generate a plan to read the appropriate code file with the division code in it, inspect the tests, modify the function to add logging in the right place, run the tests and if the tests fail, revise the change to the code. This is the plan. That plan is then put into place, observations are made of the outcome and any revisions are identified. This pipeline may involve 3 to 10 calls.

Testing

When running a test suite, if the tests reach out to an LLM, it is easy to reach 100 calls per test suite run.

Teams use LLMs within test suites to:

  • evaluate natural‑language behaviour
  • judge style, tone, or reasoning
  • compare outputs that are not byte‑identical
  • classify correctness when rules are fuzzy
  • score answers in educational or assessment systems
  • validate agent behaviour
  • check explanations or rationales

Documentation

When the engineer needs to document or review code, each comment, rewrite or suggestion is an LLM call.

In these environments, 300 calls per day is realistic. Depending on your organisation, it may be on the low side.

Why so many calls? #

LLMs are: statelessness, probabilistic, have limited context, no persistent memory, and no internal world‑model so they do not understand your code nor the system that will execute it.

This design means that complex tasks — such as adding logging to the right place in code — must be decomposed into many short, repeated calls.

Multiple calls are a necessity to achieve quality collaboration from an LLM.

Not only must there be multiple calls but a minimum amount of information is required per call. Without this contextual information due to a lack of a world-model, the quality of results would be significantly lower: no one wants an LLM model hallucinating code due to its probabilitic implementation.

However, passing too much information decreases the quality of LLM output.

This is because of how how LLMs are built. An LLM will compress, blur, and mis‑prioritise information when the prompt becomes too large or too dense.

A prompt is too dense when it contains more information than the model can meaningfully separate, prioritise, or reason over. Prompt density is not only about prompt length.

If the prompt contains too many facts, instructions, examples, or files, the LLM cannot assign stable importance weights, so it treats everything as equally relevant. The result will be vague, averaged, or generic output.

If an LLM is presented with multiple coding styles, conventions, or code patterns and they appear close together, the model cannot decide which pattern to follow, so it blends them. The result is inconsistent naming, mixed styles, or contradictory behaviour.

For prompts, less is more.

With an LLM, too little context leads to poor results. With enough context, results will be good. With too much context, you will have poor results and high cost.

Conclusion #

Even though $10 per million tokens does not sound like much it is the rate of usage that is key. And a high rate of usage may come from the type of work you engineers are performing.

A minimum usage rate is required to get good results from an LLM as context needs to be restated. However, this and engineering workflow pressures can lead to an increase in the number of tokens used per call.

Unchecked token growth is both expensive and counter-productive.

Organizations must size token usage approriately, balancing cost and quality.

A system without a world‑model forces engineers to restate context, but restating too much context degrades quality and inflates cost. The job of engineering leadership is to enforce context discipline, and workflow design to keep teams on the efficient part of the curve.

The goal is not to minimise tokens but to control context so that the model receives only what it can meaningfully use.

Read next:

[Hiring in an AI World] Code generation is now automated. We need to evaluate engineering judgment.

If this was useful, you can get more pieces like it in the Phroneses newsletter.

I work with leaders and teams on clarity, capability, and momentum. Work with me →

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/the-cost-curve-of-un…] indexed:0 read:7min 2026-06-16 ·