cd /news/developer-tools/your-llm-prompt-doesn-t-fit-pack-it-… · home topics developer-tools article
[ARTICLE · art-28730] src=dev.to ↗ pub= topic=developer-tools verified=true sentiment=↑ positive

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

A developer built contextcram, a zero-dependency Python library that prioritizes and packs context pieces (system prompts, chat history, documents) into a fixed token budget for LLM applications. It assigns each piece a priority and a strategy (drop, truncate, trim) to ensure important content is retained while fitting within the model's context window. The library also includes a reserve parameter to prevent the prompt from consuming all tokens needed for the model's reply.

read3 min views1 publishedJun 15, 2026

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget.

The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things.

I got tired of rewriting that logic in every project, so I built ** contextcram** — a tiny, zero-dependency library that treats this as a

Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram

assembles the largest in-budget context that keeps the important parts.

pip install contextcram
python
from contextcram import Packer

ctx = (
    Packer(budget=8000)
    .add(system_prompt, priority="required")                 # never dropped
    .add(chat_history, priority="high", strategy="trim")     # drop oldest turns
    .add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing
    .add(tool_output, priority="low", strategy="truncate")   # cut to fit
    .fit()
)

print(ctx.text)          # the assembled, in-budget context
print(ctx.used_tokens)   # e.g. 7840
print(ctx.dropped_names) # what didn't make the cut

When an optional item doesn't fully fit, its strategy

decides what happens:

Strategy Behavior
drop
Include it whole, or not at all
truncate
Cut from the end, keep the head (default)
truncate_head
Cut from the start, keep the tail
trim
For lists (e.g. messages): drop oldest first

required

items are always kept; if they alone blow the budget, you get a clear BudgetExceeded

error instead of a silently mangled prompt.

Two recurring annoyances solved in one line:

from contextcram import Packer

packer = Packer(model="gpt-4o", reserve=2000)
print(packer.full_budget)  # 128000
print(packer.budget)       # 126000  <- what you actually pack into

reserve=

kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens

and you can't get it wrong.

from contextcram import Packer
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o")
docs = [d.page_content for d in retriever.invoke(question)]
history = [f"{m.type}: {m.content}" for m in memory.messages]

ctx = (
    Packer(model="gpt-4o", reserve=1500)
    .add(SYSTEM_PROMPT, priority="required")
    .add(history, priority="high", strategy="trim")
    .add("\n\n".join(docs), priority="medium", strategy="drop")
    .fit()
)

response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])

Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o")

, or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer

. The default is a fast characters-per-token heuristic so there are no required dependencies.

Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating).

contextcram

deliberately trades features for simplicity and zero dependencies:

Packer(...).add(...).fit()

.If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it.

pip install contextcram

It's MIT, fully typed (mypy --strict

), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.

── more in #developer-tools 4 stories · sorted by recency
── more on @contextcram 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/your-llm-prompt-does…] indexed:0 read:3min 2026-06-15 ·