{"slug": "your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies", "title": "Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)", "summary": "A developer built contextcram, a zero-dependency Python library that prioritizes and packs context pieces (system prompts, chat history, documents) into a fixed token budget for LLM applications. It assigns each piece a priority and a strategy (drop, truncate, trim) to ensure important content is retained while fitting within the model's context window. The library also includes a reserve parameter to prevent the prompt from consuming all tokens needed for the model's reply.", "body_md": "Every RAG app and agent eventually hits the same wall: you have **more stuff than fits** in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget.\n\nThe usual \"fix\" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the *wrong* things.\n\nI got tired of rewriting that logic in every project, so I built ** contextcram** — a tiny, zero-dependency library that treats this as a\n\nGive each piece of context a **priority** and a **strategy** for what should happen if it doesn't fit. Set a token budget. `contextcram`\n\nassembles the largest in-budget context that keeps the important parts.\n\n```\npip install contextcram\npython\nfrom contextcram import Packer\n\nctx = (\n    Packer(budget=8000)\n    .add(system_prompt, priority=\"required\")                 # never dropped\n    .add(chat_history, priority=\"high\", strategy=\"trim\")     # drop oldest turns\n    .add(retrieved_docs, priority=\"medium\", strategy=\"drop\") # all-or-nothing\n    .add(tool_output, priority=\"low\", strategy=\"truncate\")   # cut to fit\n    .fit()\n)\n\nprint(ctx.text)          # the assembled, in-budget context\nprint(ctx.used_tokens)   # e.g. 7840\nprint(ctx.dropped_names) # what didn't make the cut\n```\n\nWhen an optional item doesn't fully fit, its `strategy`\n\ndecides what happens:\n\n| Strategy | Behavior |\n|---|---|\n`drop` |\nInclude it whole, or not at all |\n`truncate` |\nCut from the end, keep the head (default) |\n`truncate_head` |\nCut from the start, keep the tail |\n`trim` |\nFor lists (e.g. messages): drop oldest first |\n\n`required`\n\nitems are always kept; if they alone blow the budget, you get a clear `BudgetExceeded`\n\nerror instead of a silently mangled prompt.\n\nTwo recurring annoyances solved in one line:\n\n``` python\nfrom contextcram import Packer\n\n# Budget pulled from the model; hold back 2k tokens for the reply\npacker = Packer(model=\"gpt-4o\", reserve=2000)\nprint(packer.full_budget)  # 128000\nprint(packer.budget)       # 126000  <- what you actually pack into\n```\n\n`reserve=`\n\nkills the classic *\"the prompt fit, but there's no room left for the model to answer\"* bug. Tie it to your `max_tokens`\n\nand you can't get it wrong.\n\n``` python\nfrom contextcram import Packer\nfrom langchain_openai import ChatOpenAI\nfrom langchain_core.messages import SystemMessage, HumanMessage\n\nllm = ChatOpenAI(model=\"gpt-4o\")\ndocs = [d.page_content for d in retriever.invoke(question)]\nhistory = [f\"{m.type}: {m.content}\" for m in memory.messages]\n\nctx = (\n    Packer(model=\"gpt-4o\", reserve=1500)\n    .add(SYSTEM_PROMPT, priority=\"required\")\n    .add(history, priority=\"high\", strategy=\"trim\")\n    .add(\"\\n\\n\".join(docs), priority=\"medium\", strategy=\"drop\")\n    .fit()\n)\n\nresponse = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)])\n```\n\nNeed exact token counts? Pass `tokenizer=tiktoken_tokenizer(\"gpt-4o\")`\n\n, or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line `CallableTokenizer`\n\n. The default is a fast characters-per-token heuristic so there are **no required dependencies**.\n\nHonest answer: the *concept* isn't new. [Priompt](https://github.com/anysphere/priompt) (and its Python port) and Character.AI's [Prompt Poet](https://pypi.org/project/prompt-poet/) do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating).\n\n`contextcram`\n\ndeliberately trades features for **simplicity and zero dependencies**:\n\n`Packer(...).add(...).fit()`\n\n.If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it.\n\n```\npip install contextcram\n```\n\nIt's MIT, fully typed (`mypy --strict`\n\n), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.", "url": "https://wpnews.pro/news/your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies", "canonical_source": "https://dev.to/wael_rahhal_790f328ac4301/your-llm-prompt-doesnt-fit-pack-it-by-priority-zero-dependencies-nhg", "published_at": "2026-06-15 23:09:00+00:00", "updated_at": "2026-06-15 23:47:06.001022+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "natural-language-processing"], "entities": ["contextcram", "Priompt", "Prompt Poet", "Character.AI", "OpenAI", "LangChain", "Hugging Face", "llama.cpp"], "alternates": {"html": "https://wpnews.pro/news/your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies", "markdown": "https://wpnews.pro/news/your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies.md", "text": "https://wpnews.pro/news/your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies.txt", "jsonld": "https://wpnews.pro/news/your-llm-prompt-doesn-t-fit-pack-it-by-priority-zero-dependencies.jsonld"}}