# What You Should Know About Tokens, Context, and AI Cost

> Source: <https://dev.to/edisonpappi/what-you-should-know-about-tokens-context-and-ai-cost-57b2>
> Published: 2026-06-05 03:30:00+00:00

Most of us use AI coding tools in a very normal way.

We paste an error, ask for a fix, paste a file, ask again, run a command, paste the output, and keep going. After some time, we get a message saying something like `you are out of tokens`

or `you have reached your message limit`

.

Most of the time, the reason is tokens.

A token is a small piece of text the model reads or writes.

It can be a word, part of a word, a symbol, or spacing depending on the language and context. The model does not see text exactly like we do. It breaks everything into tokens first.

So when you send a message, you are sending input tokens. When the model replies, it creates output tokens. If your coding agent reads files, terminal logs, docs, diffs, and old chat history, that can also become input tokens.

The context window is the amount of text the model can keep in view at one time.

It includes your message, the previous conversation, files, tool output, system instructions, project rules, and the model's own reply.

Some models can hold a lot now. 200K tokens is already common in many coding workflows. Some newer models can go near 1M tokens. That sounds huge, and it is huge. But it does not mean you should always use it.

Roughly speaking, 1M tokens can be hundreds of pages of text. It can be a big part of a codebase, many docs, or long chat history. But the model still has to read through that text. More context can mean more cost, more waiting, and more chances for the important thing to get buried.

A rough mental model:

| Context size | What it might hold |
|---|---|
| 32K tokens | A few files, a long bug report, or a small feature discussion |
| 128K tokens | Many files, long logs, or a decent chunk of project docs |
| 200K tokens | A large debugging session with files, logs, and history |
| 1M tokens | Hundreds of pages, big docs, or a large slice of a codebase |

This is not exact. Different languages, code, spacing, and tokenizers change the count. But it gives you the idea.

Large context is useful, but it is not free.

AI agents are powerful because they can read files and run commands. But that also means they can send a lot of text back into the conversation.

These things usually waste tokens:

Two people can ask the same question and pay very different cost. One sends a clean error and one file path. The other sends the whole repo, full logs, and old attempts. The second one usually pays more and may get a worse answer.

Input tokens are what you send to the model. Output tokens are what the model writes back.

In many models, output tokens cost much more than input tokens.

That matters a lot for coding agents because they do not just answer with a small paragraph. They think, call tools, explain, write code, run commands, and sometimes produce long patches. Reasoning tokens can also be counted as output tokens in many pricing systems.

Here is a simple pricing snapshot checked on June 4, 2026. These prices can change, so always check the official page before making a serious budget.

| Provider / model | Input per 1M tokens | Output per 1M tokens | What to notice |
|---|---|---|---|
| OpenAI GPT-5.4 | $2.50 | $15.00 | Output is 6x input |
| OpenAI GPT-5.4 mini | $0.75 | $4.50 | Still 6x input |
| OpenAI GPT-5.3 Codex | $1.75 | $14.00 | Output is 8x input |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Output is 5x input |
| Claude Haiku 4.5 | $1.00 | $5.00 | Cheaper, but same 5x pattern |

This is why "make the answer shorter" is not just about readability. It can save real money.

It is also why output-heavy work can surprise you. If the agent writes long explanations, repeats full files, or prints large patches again and again, the expensive side of the bill grows fast.

For Codex users, there is also a credit-based view. Current Codex pricing maps credits to input, cached input, and output tokens. So the same idea still applies: long answers and output-heavy tasks cost more.

Imagine a coding task uses:

The output is only one fourth of the tokens here, but it can cost more than the input.

Now imagine running many agents, long sessions, CI fixes, reviews, and retries. That small number can grow quietly.

For coding, I like this rule:

Give the agent enough context to act, but not enough noise to get lost.

Instead of this:

Here are 900 lines of backend notes, frontend rules, deployment steps, test logs, and old failed attempts. Please figure it out.

Try this:

The failing test is

`UserService.test.ts`

. The error is`Cannot read property id of undefined`

. It started after changing`src/auth/session.ts`

. Please inspect that path first and run the relevant test.

That is usually much more useful.

If the agent needs more, it can ask or inspect the repo.

Files like `AGENTS.md`

or `CLAUDE.md`

are useful. They help coding agents understand how to work inside a repo.

But even those files should be short. Your `AGENTS.md`

or `CLAUDE.md`

should point the agent to the right place, not paste every detail into every session.

A good version:

Run npm test after changing shared code.

For backend rules, read

`docs/backend.md`

.For frontend rules, read

`docs/frontend.md`

.For deployment, read

`docs/deploy.md`

only when needed.

A noisy version:

Here are all backend rules, all frontend rules, every deployment step, every exception, every old note, and every edge case. Load this every time.

The second one feels helpful, but it can make every task more expensive before the agent even starts.

These habits help a lot:

Instead of sending 2,000 lines of test output, send the failing test name, error message, and file path.

For example:

Command:

`npm test`

Failing test:

`UserService should return current user`

Error:

`Cannot read property id of undefined`

File:

`src/auth/session.ts`

That gives the agent a direction without forcing it to read a wall of noise.

For Claude Code, the main file is `CLAUDE.md`

.

Keep `CLAUDE.md`

short. Put important commands and project rules there. Link to deeper docs instead of copying them.

Good things to include:

Things to avoid:

The point is to guide the agent, not overload every session.

For Codex, project instructions can come from files like `AGENTS.md`

.

The same rule applies: keep it focused.

One setting I would watch closely is `project_doc_max_bytes`

. It controls how much project instruction text Codex can load from agent docs. If your `AGENTS.md`

is large, reducing this can stop every session from starting with too much text.

The best setup is usually:

For OpenCode, instructions and agents can also grow quickly.

Common places where context can expand:

`AGENTS.md`

`.opencode/`

`.opencode/agents/`

Keep global rules short. Keep project rules focused on commands, conventions, and where to look.

If you have long reusable instructions, move them into skills or separate docs instead of loading them every time. If you have different jobs, use separate agents instead of one giant instruction file.

I also made a small tool called [token-optimizer](https://www.npmjs.com/package/token-optimizer).

The goal is simple: reduce noisy command output before it gets sent back into an AI coding agent.

It is not meant to replace good prompting. It just helps with one common problem: terminal output, logs, diffs, and command results can get very large very quickly.

The tool tries to keep the useful parts, like:

That can make agentic coding sessions cleaner, reduce repeated noise, and help the agent focus on the actual issue.

Each device, repo, model, and tool can behave differently. A small library, a frontend app, a backend service, and a large monorepo all need different settings.

Start small. Watch what the agent keeps reading. Remove repeated logs and repeated docs. Increase limits only when the agent is clearly missing useful context.

The goal is not to use the smallest possible context.

The goal is to use the smallest useful context.

Clean context usually means better answers, lower cost, and fewer strange turns in the middle of a task.
