# 60% Fable cost cut by converting code to images and having the model OCR it

> Source: <https://github.com/teamchong/pxpipe>
> Published: 2026-07-03 15:50:49+00:00

**Cut Claude Code's input tokens by rendering bulky context as images — the same system prompt, tool docs, and history, in a fraction of the tokens.**

An image's token cost is fixed by its pixel dimensions, not by how much text is inside it. Dense content (code, JSON, tool output) packs ~3.1 chars per image-token vs ~1 char per text-token on real Claude Code traffic. pxpipe is a local proxy that exploits that gap: it rewrites the bulky parts of your request (system prompt, tool docs, older history) into compact PNGs before the request leaves your machine.

Savings are **workload-dependent** — pxpipe wins on token-dense content and
leaves sparse/small requests untouched — so these are measured snapshots, not
constants. The primary, durable result is **input-token reduction**: dense
system prompts, tool docs, and history go in as compact images instead of text
(the example above is ≈25k text tokens rendered as ≈2.7k image tokens), every
request measured against its own `count_tokens`

counterfactual. **Dollars are
downstream of that** — at current Fable list prices the token cut lands as a
**~59–70% lower end-to-end bill** (~72–74% on compressed requests; full pricing
math in the FAQ). But list prices can change tomorrow and the token count
won't, so tokens — not dollars — are the number to watch. Reproduce both from
`~/.pxpipe/events.jsonl`

.

This is what the model sees instead of text:

*~48k characters of system prompt + tool docs (this repo's own README,
FINDINGS, and source), ≈25k tokens as text, ≈2.7k image tokens as this page.
Produced by the real transformRequest pipeline: whitespace-minified, reflowed
into full rows with ↵ marking original newlines, OCR instruction banner
co-rendered on top. The model reads renders like this at 100/100 on a clean
eval (see benchmarks).*

**Fable 5 demo (the default, 100/100 reader):**

## Fable-AB-Demo.mp4

- Both demos with both panes on
**Fable 5**(plain left, pxpipe right).

**Fable reads what Opus can't.** The imaged phrase-count that Opus refuses (see the Opus demo below): the pxpipe arm counts the exact token**10/10** across 39 imaged filler files (matches`grep`

ground truth line-for-line) and gets the multi-step ledger arithmetic right (8037 → … → 15,021).**Same answers, ~7× cheaper.** Session totals after both demos: plain**$42.21**, context** 96% full**(964.5k/1M — one task away from forced compaction) vs pxpipe**$6.06** with context to spare (73.5k/1M).**Honest caveat, visible in the clip:** the pxpipe arm answered the count first and needed one follow-up nudge to also print the ledger balance in the requested one-line format; the plain arm followed the format on the first try. Legibility is solved on Fable — single-reply format compliance is the remaining rough edge.

**Opus 4.8 demo (Opus disabled by default):**

## Opus-AB-Demo.mp4

*Side-by-side — plain Claude (left) vs pxpipe (right), both on Opus 4.8 (opt-in; pxpipe is tuned for Fable — see the Fable clip above). Click the image to watch (Google Drive).*

**Demo 1 — fix a failing test suite:** both pass; the dashboard shows pxpipe cut the request to a fraction of the tokens (real, server-measured**context/token reduction**).** Demo 2 — a big file-context (40 files, ~382k tokens) plus a math question and a "count this phrase" task:**the math answer (a small** text**needle) reads on both. The phrase-count needs reading the** imaged**filler — so pxpipe-on-Opus can't read it and** honestly surfaces that it won't fabricate a number**(the documented lossy limit: exact values stay text). Plain, meanwhile, bogs down counting file-by-file.

```
npx pxpipe-proxy                                  # proxy on 127.0.0.1:47821
ANTHROPIC_BASE_URL=http://localhost:47821 claude  # point Claude Code at it
```

Open [http://127.0.0.1:47821/](http://127.0.0.1:47821/) for a live dashboard: tokens saved, per-session
stats, every text→image conversion side by side, a global kill switch, and
runtime model chips including GPT 5.6 and GPT 5.5.

Nothing else changes. Responses stream normally; pxpipe only compresses the
*request* (your context going up), never the model's output. Recent turns stay
text; the system prompt, tool docs, and older bulk history are imaged.

**It is lossy.** pxpipe is a *gist* tier, not a lossless store. In a
needle-in-haystack eval, exact 12-char hex strings inside dense imaged content
came back **0/15** on Opus and 13/15 on Fable 5, and the failure mode is
*silent confabulation*: a plausible wrong value, not an error. Anything you
need back byte-exact (IDs, hashes, secrets, exact numbers) must stay text.
Recent turns do; a dedicated verbatim-risk guard is not built yet.

**Exact-recall escape hatch.** pxpipe only images Fable requests
(`PXPIPE_MODELS=claude-fable-5`

), so any subagent on a non-Fable model passes
through as text. Route work that needs byte-exact values to one — globally with
`CLAUDE_CODE_SUBAGENT_MODEL=claude-sonnet-4-6`

, or per-agent with `model: sonnet`

in the agent frontmatter. It reads from source (file/JSONL), not the imaged
history. This covers exact-recall you route on purpose; it does **not** catch a
silent misread you did not expect — that is the unbuilt guard above.

**Does it break real work?** Parity in what we measured: a 10-instance
SWE-bench Lite pilot (the easy subset) resolved **10/10 on both arms**,
pxpipe ON at $27 vs OFF at $54 token-equivalent, and 19 SWE-bench Pro
pairs (harder, long-horizon) resolved **14/19 ON vs 15/19 OFF** at
**-60% per-request**: verdicts agree on 18/19, and the single split
(one ON fail) re-resolved 3/3 when replicated, i.e. run-to-run agentic
variance, not compression. Small n, details and caveats below.

**Savings are workload-dependent.** It wins on token-dense content
(~1 char/token: code, JSON, hashes) and *loses money* on sparse English prose
(~3.5 chars/token). The built-in gate only images content where the math wins,
calibrated against N=391 production rows.

**Model scope:** one `PXPIPE_MODELS`

CSV controls which model bases get imaged
across both families — default `claude-fable-5,gpt-5.6`

(GPT 5.5 is opt-in;
it degrades on imaged context). Set
`PXPIPE_MODELS=off`

to disable imaging entirely, or use
`~/.config/pxpipe/config.json`

with `{ "models": "off" }`

(or a list). For GPT,
pxpipe keeps tool definitions in native JSON (only verbose schema prose moves
into the image) so tool-calling stays reliable; unlike the Claude path, the GPT
path does not add or depend on Anthropic `cache_control`

prompt-cache markers.
The dashboard chips can flip any model live without changing client configs.
Opus 4.7/4.8 was the original Claude scope but misread ~7% of renders
(`10200`

→`9400`

), so it was turned off by default once Fable 5 hit 100/100 with
identical image billing — opt it back in at your own risk via `PXPIPE_MODELS`

or
the dashboard chips. Everything else passes through untouched.

Measured with novel random-number problems the model cannot have memorized:

| test | N | text | pxpipe (image) | tokens |
|---|---|---|---|---|
novel arithmetic, `claude-fable-5` |
100 | 100% | 100% |
−38% |
novel arithmetic, `claude-opus-4-8` |
100 | 100% | 93% | −38% |
| gist recall A/B (decisions, values, paths, names, negations; with distractors; 15k-45k char sessions), Fable 5 | 98/arm | 98/98 | 98/98 |
- |
| state tracking (value mutated 3x, final/first/count), Fable 5 | 18/arm | 18/18 | 18/18 |
- |
| confabulation on never-stated facts (lower is better), Fable 5 | 16/arm | 0/16 | 0/16 |
- |
| verbatim 12-char hex recall, dense render, Opus | 15 | 15/15 | 0/15 |
- |
| verbatim 12-char hex recall, dense render, Fable 5 | 15 | - | 13/15 |
- |

10 SWE-bench Lite instances, Claude Code + Fable 5, paired runs through
pxpipe ON vs OFF, graded with the official `swebench`

Docker harness:

| pxpipe ON | OFF | |
|---|---|---|
| resolved | 10/10 |
10/10 |
| request size vs own uncompressed body | −65% |
±0 |

The −65% is per-request (`count_tokens`

probe of each body before
compression), so it has no turn-count confound. n=10/arm, Lite skews easy.
Run totals, receipts, caveats: [ eval/swe-bench/](/teamchong/pxpipe/blob/main/eval/swe-bench).

19 completed pairs across two runs (2 dropped: checkout failed both
arms), same setup, official `SWE-bench_Pro-os`

Docker harness:

| pxpipe ON | OFF | |
|---|---|---|
| resolved | 14/19 | 15/19 |
| request size vs own uncompressed body | −60% |
±0 |

Verdicts agree on 18/19 (three instances failed both arms, one with
byte-identical patches across arms). The single split (navidrome, ON
fail) was replicated 3x on the ON arm: all three runs produced an
identical patch and **resolved**, so the original loss was run-to-run
agentic variance, not compression. Receipts:
[ eval/swe-bench-pro/](/teamchong/pxpipe/blob/main/eval/swe-bench-pro).

We also ran GSM8K: 96% imaged. But GSM8K is in training data, so the model
recalls memorized answers through its own misreads, inflating the score, so we
lead with the clean novel-number eval instead. Reproduce:

`eval/gsm8k/`

· `eval/needle-haystack/`

·
`eval/gist-recall/`

·
full analysis in `FINDINGS.md`

.**Is the headline end-to-end, or only on the requests you touched?**
End-to-end, the whole bill. Most compression tools report savings only on
the input slice they touched, which flatters the number. The end-to-end
denominator is *every* production request: the small ones pxpipe correctly
left untouched, all cache writes and reads, and all output tokens (which the
proxy never compresses). On a 13,709-request snapshot that was 59% ($100 →
~$41); a later 8,904-compressed-request trace measured ~70%. Compressed-only
runs higher (~72–74%) and is quoted separately, never as the headline. The
exact figure is workload-dependent — reproduce it on your own log.

**How is the math measured?**
Both sides of the same request, at the same moment. For every `/v1/messages`

POST the proxy fires a free `count_tokens`

probe on the original uncompressed
body (the counterfactual) in parallel with the real forward, and reads
Anthropic's actually-billed usage block off the response. Both land in the
same row of `~/.pxpipe/events.jsonl`

, so there is no turn-count or
run-to-run confound. Dollar conversion uses Fable 5 list ratios: input ×1.0,
cache write ×1.25, cache read ×0.1, output ×5. Cache pricing is applied
identically to both sides, so the caching discount cancels and cannot be
double-counted as "savings". Re-derive it yourself from the events log: the
formula and field names are documented in `src/core/baseline.ts`

.

**What does it actually compress?**
Three kinds of *input* blocks, each behind a profitability gate:

- large
`tool_result`

bodies (file reads, command output, logs) above ~6k chars of token-dense content - older collapsed history: turns behind the live tail get re-rendered as image pages, recent turns always stay text
- the static system prompt + tool docs slab

Everything else passes through byte-identical: your messages, recent turns, the model's output (it is the response, the proxy never touches it), sparse prose, and anything too small to win. Non-Fable models pass through entirely.

**Has it ever failed for real, outside the benchmarks?**
Yes, once in weeks of daily use: the model recalled a person's name from
imaged chat history and got it confidently wrong. No error, just a
plausible wrong name. That is the documented failure mode: exact strings
in imaged content are not byte-safe. Coding sessions tolerate this because
the agent re-reads files before editing; pure chat recall has no such check.

```
tool_result string ──► wrap at 1928px-wide columns ──► pack ~92,000 chars/page ──► PNG[]
```

The proxy intercepts `/v1/messages`

, rewrites eligible bulk history into image
blocks, splices them back cache-friendly (static prefix preserved, so prompt
caching keeps working), and forwards. Per-request events log to
`~/.pxpipe/events.jsonl`

.

The economics: a 1928×1928 image costs ≈4,761 vision tokens and holds up to
≈92,000 chars (≈48,000 text tokens at the observed density), so plain text is
cheaper *only* when it runs denser than ~19 chars/token. Claude Code transcripts
are far below that (observed 1.91 chars/token, N=391). The runtime estimator (`estimateImageCount`

) plus a chars/token gate
decides per-request; sparse prose is left as text.

Same engine, no proxy. Render text → PNGs, or run the full cache-safe transform:

``` js
import { renderTextToPngs, transformAnthropicMessages } from "pxpipe";

const imgs = await renderTextToPngs(toolResultText);            // RenderedImage[]
const { body, applied, info } = await transformAnthropicMessages({
  body: requestBytes,
  model: "claude-fable-5",
});
```

`options.keepSharp(block)`

pins blocks as text (override the heuristic for IDs,
hashes, paths); `options.emitRecoverable`

returns the originals of imaged blocks
so a stateful caller can recover them — the two halves of the fidelity contract
for the lossy limitation below. Runtime is pure-JS (Node and edge/Workers);
`@napi-rs/canvas`

is build-time only. Full API, types, and constants:
`src/core/index.ts`

.

```
pnpm install && pnpm test     # 376 tests
pnpm run build                # regenerates dist/
```

**Lossy**: see "the honest part" above. Verbatim recall from images is unreliable.- Render latency: encoding PNGs adds time to large requests before they leave (partly offset by the model ingesting fewer tokens). Responses stream normally.
- ASCII/Latin-1 well tested; CJK works but conservatively.
- Runtime is pure-JS — runs on Node and edge/Workers.
`@napi-rs/canvas`

is a build-time-only dev dep (regenerating the glyph atlas), not a runtime dep. - Fable 5 only.

Everything above is measured. Everything here is not. These are hypotheses, not claims; they ship as numbers with an n or they get cut.

**Sharper glyphs.** The 13/15 verbatim gap is partly font legibility, not just the model. A per-char confusion matrix across render styles is paused mid-run (`eval/glyph-matrix/`

); if a zero-cost style lowers read error, the gate compresses harder at the same fidelity.**Effective context.** Dense text carries at ~3x fewer tokens as images. If that holds in the live window and not just the bill, 1M tokens holds ~2x the real content. Open question: can a task needing ~2M raw context run inside Fable's 1M once the bulk is imaged?**Less active text, sharper model.** Long contexts degrade reasoning as they fill. Imaging old bulk shrinks what the model actively reads while keeping it reachable. Hypothesis: same information, smaller active context, better long-task accuracy.

One bet: longer effective context and a sharper model on long tasks, from the same Fable 5. Numbers or retraction, no hype between.

MIT.
