cd /news/large-language-models/what-you-should-know-about-tokens-co… · home topics large-language-models article
[ARTICLE · art-22137] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

What You Should Know About Tokens, Context, and AI Cost

Tokens are the fundamental unit of text that AI models process, breaking down words, symbols, and spacing into small pieces for reading and writing. The context window, which can now reach up to 1 million tokens in some models, determines how much text—including messages, files, logs, and chat history—the model can hold at once. However, larger contexts and output tokens, which often cost five to eight times more than input tokens, can significantly increase both cost and latency, making concise prompts and shorter responses a key factor in managing AI expenses.

read7 min publishedJun 5, 2026

Most of us use AI coding tools in a very normal way.

We paste an error, ask for a fix, paste a file, ask again, run a command, paste the output, and keep going. After some time, we get a message saying something like you are out of tokens

or you have reached your message limit

.

Most of the time, the reason is tokens.

A token is a small piece of text the model reads or writes.

It can be a word, part of a word, a symbol, or spacing depending on the language and context. The model does not see text exactly like we do. It breaks everything into tokens first.

So when you send a message, you are sending input tokens. When the model replies, it creates output tokens. If your coding agent reads files, terminal logs, docs, diffs, and old chat history, that can also become input tokens.

The context window is the amount of text the model can keep in view at one time.

It includes your message, the previous conversation, files, tool output, system instructions, project rules, and the model's own reply.

Some models can hold a lot now. 200K tokens is already common in many coding workflows. Some newer models can go near 1M tokens. That sounds huge, and it is huge. But it does not mean you should always use it.

Roughly speaking, 1M tokens can be hundreds of pages of text. It can be a big part of a codebase, many docs, or long chat history. But the model still has to read through that text. More context can mean more cost, more waiting, and more chances for the important thing to get buried.

A rough mental model:

Context size What it might hold
32K tokens A few files, a long bug report, or a small feature discussion
128K tokens Many files, long logs, or a decent chunk of project docs
200K tokens A large debugging session with files, logs, and history
1M tokens Hundreds of pages, big docs, or a large slice of a codebase

This is not exact. Different languages, code, spacing, and tokenizers change the count. But it gives you the idea.

Large context is useful, but it is not free.

AI agents are powerful because they can read files and run commands. But that also means they can send a lot of text back into the conversation.

These things usually waste tokens:

Two people can ask the same question and pay very different cost. One sends a clean error and one file path. The other sends the whole repo, full logs, and old attempts. The second one usually pays more and may get a worse answer.

Input tokens are what you send to the model. Output tokens are what the model writes back.

In many models, output tokens cost much more than input tokens.

That matters a lot for coding agents because they do not just answer with a small paragraph. They think, call tools, explain, write code, run commands, and sometimes produce long patches. Reasoning tokens can also be counted as output tokens in many pricing systems.

Here is a simple pricing snapshot checked on June 4, 2026. These prices can change, so always check the official page before making a serious budget.

Provider / model Input per 1M tokens Output per 1M tokens What to notice
OpenAI GPT-5.4 $2.50 $15.00 Output is 6x input
OpenAI GPT-5.4 mini $0.75 $4.50 Still 6x input
OpenAI GPT-5.3 Codex $1.75 $14.00 Output is 8x input
Claude Sonnet 4.6 $3.00 $15.00 Output is 5x input
Claude Haiku 4.5 $1.00 $5.00 Cheaper, but same 5x pattern

This is why "make the answer shorter" is not just about readability. It can save real money.

It is also why output-heavy work can surprise you. If the agent writes long explanations, repeats full files, or prints large patches again and again, the expensive side of the bill grows fast.

For Codex users, there is also a credit-based view. Current Codex pricing maps credits to input, cached input, and output tokens. So the same idea still applies: long answers and output-heavy tasks cost more. Imagine a coding task uses:

The output is only one fourth of the tokens here, but it can cost more than the input.

Now imagine running many agents, long sessions, CI fixes, reviews, and retries. That small number can grow quietly.

For coding, I like this rule: Give the agent enough context to act, but not enough noise to get lost.

Instead of this:

Here are 900 lines of backend notes, frontend rules, deployment steps, test logs, and old failed attempts. Please figure it out.

Try this: The failing test is

UserService.test.ts

. The error isCannot read property id of undefined

. It started after changingsrc/auth/session.ts

. Please inspect that path first and run the relevant test.

That is usually much more useful.

If the agent needs more, it can ask or inspect the repo. Files like AGENTS.md

or CLAUDE.md

are useful. They help coding agents understand how to work inside a repo.

But even those files should be short. Your AGENTS.md

or CLAUDE.md

should point the agent to the right place, not paste every detail into every session.

A good version:

Run npm test after changing shared code.

For backend rules, read docs/backend.md

.For frontend rules, read

docs/frontend.md

.For deployment, read

docs/deploy.md

only when needed.

A noisy version:

Here are all backend rules, all frontend rules, every deployment step, every exception, every old note, and every edge case. Load this every time.

The second one feels helpful, but it can make every task more expensive before the agent even starts.

These habits help a lot:

Instead of sending 2,000 lines of test output, send the failing test name, error message, and file path.

For example: Command:

npm test

Failing test:

UserService should return current user

Error:

Cannot read property id of undefined

File:

src/auth/session.ts

That gives the agent a direction without forcing it to read a wall of noise.

For Claude Code, the main file is CLAUDE.md .

Keep CLAUDE.md

short. Put important commands and project rules there. Link to deeper docs instead of copying them.

Good things to include:

Things to avoid:

The point is to guide the agent, not overload every session.

For Codex, project instructions can come from files like AGENTS.md .

The same rule applies: keep it focused.

One setting I would watch closely is project_doc_max_bytes

. It controls how much project instruction text Codex can load from agent docs. If your AGENTS.md

is large, reducing this can stop every session from starting with too much text.

The best setup is usually:

For OpenCode, instructions and agents can also grow quickly. Common places where context can expand:

AGENTS.md

.opencode/

.opencode/agents/

Keep global rules short. Keep project rules focused on commands, conventions, and where to look.

If you have long reusable instructions, move them into skills or separate docs instead of  them every time. If you have different jobs, use separate agents instead of one giant instruction file.

I also made a small tool called [token-optimizer](https://www.npmjs.com/package/token-optimizer).

The goal is simple: reduce noisy command output before it gets sent back into an AI coding agent.

It is not meant to replace good prompting. It just helps with one common problem: terminal output, logs, diffs, and command results can get very large very quickly.

The tool tries to keep the useful parts, like:

That can make agentic coding sessions cleaner, reduce repeated noise, and help the agent focus on the actual issue.

Each device, repo, model, and tool can behave differently. A small library, a frontend app, a backend service, and a large monorepo all need different settings.

Start small. Watch what the agent keeps reading. Remove repeated logs and repeated docs. Increase limits only when the agent is clearly missing useful context.

The goal is not to use the smallest possible context.

The goal is to use the smallest useful context.

Clean context usually means better answers, lower cost, and fewer strange turns in the middle of a task.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/what-you-should-know…] indexed:0 read:7min 2026-06-05 ·