60% Fable cost cut by converting code to images and having the model OCR it

A new open-source proxy tool called pxpipe reduces Claude Code input token costs by up to 60% by converting dense text content such as system prompts, tool documentation, and history into compact PNG images before sending requests to the API. The tool exploits the fixed token cost of images based on pixel dimensions, achieving roughly 3.1 characters per image token versus 1 character per text token, resulting in a 59–70% reduction in end-to-end billing on Fable workloads.

Cut Claude Code's input tokens by rendering bulky context as images — the same system prompt, tool docs, and history, in a fraction of the tokens. An image's token cost is fixed by its pixel dimensions, not by how much text is inside it. Dense content code, JSON, tool output packs ~3.1 chars per image-token vs ~1 char per text-token on real Claude Code traffic. pxpipe is a local proxy that exploits that gap: it rewrites the bulky parts of your request system prompt, tool docs, older history into compact PNGs before the request leaves your machine. Savings are workload-dependent — pxpipe wins on token-dense content and leaves sparse/small requests untouched — so these are measured snapshots, not constants. The primary, durable result is input-token reduction : dense system prompts, tool docs, and history go in as compact images instead of text the example above is ≈25k text tokens rendered as ≈2.7k image tokens , every request measured against its own count tokens counterfactual. Dollars are downstream of that — at current Fable list prices the token cut lands as a ~59–70% lower end-to-end bill ~72–74% on compressed requests; full pricing math in the FAQ . But list prices can change tomorrow and the token count won't, so tokens — not dollars — are the number to watch. Reproduce both from ~/.pxpipe/events.jsonl . This is what the model sees instead of text: ~48k characters of system prompt + tool docs this repo's own README, FINDINGS, and source , ≈25k tokens as text, ≈2.7k image tokens as this page. Produced by the real transformRequest pipeline: whitespace-minified, reflowed into full rows with ↵ marking original newlines, OCR instruction banner co-rendered on top. The model reads renders like this at 100/100 on a clean eval see benchmarks . Fable 5 demo the default, 100/100 reader : Fable-AB-Demo.mp4 - Both demos with both panes on Fable 5 plain left, pxpipe right . Fable reads what Opus can't. The imaged phrase-count that Opus refuses see the Opus demo below : the pxpipe arm counts the exact token 10/10 across 39 imaged filler files matches grep ground truth line-for-line and gets the multi-step ledger arithmetic right 8037 → … → 15,021 . Same answers, ~7× cheaper. Session totals after both demos: plain $42.21 , context 96% full 964.5k/1M — one task away from forced compaction vs pxpipe $6.06 with context to spare 73.5k/1M . Honest caveat, visible in the clip: the pxpipe arm answered the count first and needed one follow-up nudge to also print the ledger balance in the requested one-line format; the plain arm followed the format on the first try. Legibility is solved on Fable — single-reply format compliance is the remaining rough edge. Opus 4.8 demo Opus disabled by default : Opus-AB-Demo.mp4 Side-by-side — plain Claude left vs pxpipe right , both on Opus 4.8 opt-in; pxpipe is tuned for Fable — see the Fable clip above . Click the image to watch Google Drive . Demo 1 — fix a failing test suite: both pass; the dashboard shows pxpipe cut the request to a fraction of the tokens real, server-measured context/token reduction . Demo 2 — a big file-context 40 files, ~382k tokens plus a math question and a "count this phrase" task: the math answer a small text needle reads on both. The phrase-count needs reading the imaged filler — so pxpipe-on-Opus can't read it and honestly surfaces that it won't fabricate a number the documented lossy limit: exact values stay text . Plain, meanwhile, bogs down counting file-by-file. npx pxpipe-proxy proxy on 127.0.0.1:47821 ANTHROPIC BASE URL=http://localhost:47821 claude point Claude Code at it Open http://127.0.0.1:47821/ http://127.0.0.1:47821/ for a live dashboard: tokens saved, per-session stats, every text→image conversion side by side, a global kill switch, and runtime model chips including GPT 5.6 and GPT 5.5. Nothing else changes. Responses stream normally; pxpipe only compresses the request your context going up , never the model's output. Recent turns stay text; the system prompt, tool docs, and older bulk history are imaged. It is lossy. pxpipe is a gist tier, not a lossless store. In a needle-in-haystack eval, exact 12-char hex strings inside dense imaged content came back 0/15 on Opus and 13/15 on Fable 5, and the failure mode is silent confabulation : a plausible wrong value, not an error. Anything you need back byte-exact IDs, hashes, secrets, exact numbers must stay text. Recent turns do; a dedicated verbatim-risk guard is not built yet. Exact-recall escape hatch. pxpipe only images Fable requests PXPIPE MODELS=claude-fable-5 , so any subagent on a non-Fable model passes through as text. Route work that needs byte-exact values to one — globally with CLAUDE CODE SUBAGENT MODEL=claude-sonnet-4-6 , or per-agent with model: sonnet in the agent frontmatter. It reads from source file/JSONL , not the imaged history. This covers exact-recall you route on purpose; it does not catch a silent misread you did not expect — that is the unbuilt guard above. Does it break real work? Parity in what we measured: a 10-instance SWE-bench Lite pilot the easy subset resolved 10/10 on both arms , pxpipe ON at $27 vs OFF at $54 token-equivalent, and 19 SWE-bench Pro pairs harder, long-horizon resolved 14/19 ON vs 15/19 OFF at -60% per-request : verdicts agree on 18/19, and the single split one ON fail re-resolved 3/3 when replicated, i.e. run-to-run agentic variance, not compression. Small n, details and caveats below. Savings are workload-dependent. It wins on token-dense content ~1 char/token: code, JSON, hashes and loses money on sparse English prose ~3.5 chars/token . The built-in gate only images content where the math wins, calibrated against N=391 production rows. Model scope: one PXPIPE MODELS CSV controls which model bases get imaged across both families — default claude-fable-5,gpt-5.6 GPT 5.5 is opt-in; it degrades on imaged context . Set PXPIPE MODELS=off to disable imaging entirely, or use ~/.config/pxpipe/config.json with { "models": "off" } or a list . For GPT, pxpipe keeps tool definitions in native JSON only verbose schema prose moves into the image so tool-calling stays reliable; unlike the Claude path, the GPT path does not add or depend on Anthropic cache control prompt-cache markers. The dashboard chips can flip any model live without changing client configs. Opus 4.7/4.8 was the original Claude scope but misread ~7% of renders 10200 → 9400 , so it was turned off by default once Fable 5 hit 100/100 with identical image billing — opt it back in at your own risk via PXPIPE MODELS or the dashboard chips. Everything else passes through untouched. Measured with novel random-number problems the model cannot have memorized: | test | N | text | pxpipe image | tokens | |---|---|---|---|---| novel arithmetic, claude-fable-5 | 100 | 100% | 100% | −38% | novel arithmetic, claude-opus-4-8 | 100 | 100% | 93% | −38% | | gist recall A/B decisions, values, paths, names, negations; with distractors; 15k-45k char sessions , Fable 5 | 98/arm | 98/98 | 98/98 | - | | state tracking value mutated 3x, final/first/count , Fable 5 | 18/arm | 18/18 | 18/18 | - | | confabulation on never-stated facts lower is better , Fable 5 | 16/arm | 0/16 | 0/16 | - | | verbatim 12-char hex recall, dense render, Opus | 15 | 15/15 | 0/15 | - | | verbatim 12-char hex recall, dense render, Fable 5 | 15 | - | 13/15 | - | 10 SWE-bench Lite instances, Claude Code + Fable 5, paired runs through pxpipe ON vs OFF, graded with the official swebench Docker harness: | pxpipe ON | OFF | | |---|---|---| | resolved | 10/10 | 10/10 | | request size vs own uncompressed body | −65% | ±0 | The −65% is per-request count tokens probe of each body before compression , so it has no turn-count confound. n=10/arm, Lite skews easy. Run totals, receipts, caveats: eval/swe-bench/ /teamchong/pxpipe/blob/main/eval/swe-bench . 19 completed pairs across two runs 2 dropped: checkout failed both arms , same setup, official SWE-bench Pro-os Docker harness: | pxpipe ON | OFF | | |---|---|---| | resolved | 14/19 | 15/19 | | request size vs own uncompressed body | −60% | ±0 | Verdicts agree on 18/19 three instances failed both arms, one with byte-identical patches across arms . The single split navidrome, ON fail was replicated 3x on the ON arm: all three runs produced an identical patch and resolved , so the original loss was run-to-run agentic variance, not compression. Receipts: eval/swe-bench-pro/ /teamchong/pxpipe/blob/main/eval/swe-bench-pro . We also ran GSM8K: 96% imaged. But GSM8K is in training data, so the model recalls memorized answers through its own misreads, inflating the score, so we lead with the clean novel-number eval instead. Reproduce: eval/gsm8k/ · eval/needle-haystack/ · eval/gist-recall/ · full analysis in FINDINGS.md . Is the headline end-to-end, or only on the requests you touched? End-to-end, the whole bill. Most compression tools report savings only on the input slice they touched, which flatters the number. The end-to-end denominator is every production request: the small ones pxpipe correctly left untouched, all cache writes and reads, and all output tokens which the proxy never compresses . On a 13,709-request snapshot that was 59% $100 → ~$41 ; a later 8,904-compressed-request trace measured ~70%. Compressed-only runs higher ~72–74% and is quoted separately, never as the headline. The exact figure is workload-dependent — reproduce it on your own log. How is the math measured? Both sides of the same request, at the same moment. For every /v1/messages POST the proxy fires a free count tokens probe on the original uncompressed body the counterfactual in parallel with the real forward, and reads Anthropic's actually-billed usage block off the response. Both land in the same row of ~/.pxpipe/events.jsonl , so there is no turn-count or run-to-run confound. Dollar conversion uses Fable 5 list ratios: input ×1.0, cache write ×1.25, cache read ×0.1, output ×5. Cache pricing is applied identically to both sides, so the caching discount cancels and cannot be double-counted as "savings". Re-derive it yourself from the events log: the formula and field names are documented in src/core/baseline.ts . What does it actually compress? Three kinds of input blocks, each behind a profitability gate: - large tool result bodies file reads, command output, logs above ~6k chars of token-dense content - older collapsed history: turns behind the live tail get re-rendered as image pages, recent turns always stay text - the static system prompt + tool docs slab Everything else passes through byte-identical: your messages, recent turns, the model's output it is the response, the proxy never touches it , sparse prose, and anything too small to win. Non-Fable models pass through entirely. Has it ever failed for real, outside the benchmarks? Yes, once in weeks of daily use: the model recalled a person's name from imaged chat history and got it confidently wrong. No error, just a plausible wrong name. That is the documented failure mode: exact strings in imaged content are not byte-safe. Coding sessions tolerate this because the agent re-reads files before editing; pure chat recall has no such check. tool result string ──► wrap at 1928px-wide columns ──► pack ~92,000 chars/page ──► PNG The proxy intercepts /v1/messages , rewrites eligible bulk history into image blocks, splices them back cache-friendly static prefix preserved, so prompt caching keeps working , and forwards. Per-request events log to ~/.pxpipe/events.jsonl . The economics: a 1928×1928 image costs ≈4,761 vision tokens and holds up to ≈92,000 chars ≈48,000 text tokens at the observed density , so plain text is cheaper only when it runs denser than ~19 chars/token. Claude Code transcripts are far below that observed 1.91 chars/token, N=391 . The runtime estimator estimateImageCount plus a chars/token gate decides per-request; sparse prose is left as text. Same engine, no proxy. Render text → PNGs, or run the full cache-safe transform: js import { renderTextToPngs, transformAnthropicMessages } from "pxpipe"; const imgs = await renderTextToPngs toolResultText ; // RenderedImage const { body, applied, info } = await transformAnthropicMessages { body: requestBytes, model: "claude-fable-5", } ; options.keepSharp block pins blocks as text override the heuristic for IDs, hashes, paths ; options.emitRecoverable returns the originals of imaged blocks so a stateful caller can recover them — the two halves of the fidelity contract for the lossy limitation below. Runtime is pure-JS Node and edge/Workers ; @napi-rs/canvas is build-time only. Full API, types, and constants: src/core/index.ts . pnpm install && pnpm test 376 tests pnpm run build regenerates dist/ Lossy : see "the honest part" above. Verbatim recall from images is unreliable.- Render latency: encoding PNGs adds time to large requests before they leave partly offset by the model ingesting fewer tokens . Responses stream normally. - ASCII/Latin-1 well tested; CJK works but conservatively. - Runtime is pure-JS — runs on Node and edge/Workers. @napi-rs/canvas is a build-time-only dev dep regenerating the glyph atlas , not a runtime dep. - Fable 5 only. Everything above is measured. Everything here is not. These are hypotheses, not claims; they ship as numbers with an n or they get cut. Sharper glyphs. The 13/15 verbatim gap is partly font legibility, not just the model. A per-char confusion matrix across render styles is paused mid-run eval/glyph-matrix/ ; if a zero-cost style lowers read error, the gate compresses harder at the same fidelity. Effective context. Dense text carries at ~3x fewer tokens as images. If that holds in the live window and not just the bill, 1M tokens holds ~2x the real content. Open question: can a task needing ~2M raw context run inside Fable's 1M once the bulk is imaged? Less active text, sharper model. Long contexts degrade reasoning as they fill. Imaging old bulk shrinks what the model actively reads while keeping it reachable. Hypothesis: same information, smaller active context, better long-task accuracy. One bet: longer effective context and a sharper model on long tasks, from the same Fable 5. Numbers or retraction, no hype between. MIT.