Snapcompact: SoTA Compaction – Instant, Local, Free. Pick 3

wpnews.pro

A 1568×1568 PNG fits about 40,000 characters of text in a 6×10 pixel font. That’s ~10,000 tokens worth of text, billed by Anthropic’s pixel formula as 3,279 image tokens. Do you see where I’m going with this?

Snapcompact: when the context window fills up, render it into dense pixel-font bitmaps and hand those back as images. “A picture is worth a thousand words” turns out to be quite literally true — the model reads it back near-verbatim, at a third of the input price. Obligatory benchmark:

This started as a joke (“free token glitch lol”). Then I benchmarked it, identified where it went wrong, cracked open Qwen’s attention layer, fixed the issues, benchmarked it again — and here I am writing it up, because it generalized remarkably well to frontier models.

¶ 0x0: The Case Against Compaction

I am not a big fan of compaction. In every single harness, including my own, I’ve always felt like it “crippled” the model to the point where you would have been better off with a completely new session.

Eliding tool results is an okay alternative — instant, deterministic — but sometimes not really sufficient. It also occasionally confuses the model about tool calling. LLMs complete stories; if half of your story is [elided...]

, how confident do you think it will be about using them?

Handoffs are as good as it gets — but unlike a plan, you don’t usually steer the handoffs, and when you don’t, agents waste precious context writing an unnecessarily detailed diary, followed by a TODO list that practically begs the next agent to declare the goal impossible and ship an “MVP” instead.

My thinking was essentially that if you need compaction often, you’re doing something wrong: the plan either has scope creep, or should have been explicitly orchestrated via subagents so that the main agent could stay responsible for the entire scope.

However, spoiled by the 1M context window, these days I often hit the 500k mark by the end of a session — a mortal sin in my book a few months ago. But long-horizon tasks do better when one coherent agent drives the plan uninterrupted, and that easily reaches those levels even with aggressive delegation.

So there I was, staring at the 5h usage limit bar going red while this thing grinned back at me, thinking: maybe I should compact regularly…

Fine. But if I have to compact, it’s going to lose nothing.

¶ 0x1: A Stupid Experiment

It began with a 328KB session log and a simple question: what if I just printed this thing out and started the session with it?

Attempt one was maximally greedy: Tom Thumb, a 3×5 pixel font, 122,696 characters in a single image.

I sent it to a fresh agent session, zero explanation, and got back:

The image appears to be pure noise with random pixels, which suggests it might be corrupted or a file that’s been misnamed as PNG.

Fair. Attempt two used the X11 6x10

font (glyphs actually designed for that cell size), 40,716 characters, with each text row cycling through six colors. Same model, and there it was:

It identified the session’s topic and quoted me back verbatim. - It named 18 identifiers from the log with 100% recall.
Asked about a single assignment in the bottom-most row of the image, where the log cuts off, it hedged (“I’d be guessing — possibly 0

”) — and guessed the state right.

10k tokens of text, carried by 3,279 image tokens, recalled with near-perfect precision. Okay. Now I’m invested.

¶ 0x2: Optimizing the Fonts

How small can the font go? I swept some font configurations and asked the model to transcribe fixed regions, scoring edit similarity against ground truth:

font	px²/char	chars/image	transcription	identifiers read
8×13	104	23,520	1.00	20/20
6×10	60	40,716	0.79	20/20
5×8	40	61,348	0.37	17/19
5×7	35	70,112	0.30	10/20
4×6	24	102,312	0.02	9/20

The cliff is sharp and it sits around 35–40 px² per character. Above it, exact transcription degrades but identifier-level recall stays weirdly strong: the model can’t reproduce every byte, but it reads the names. Below it, nothing.

The funny thing is, this section was worse than useless — this exact optimization comes back to bite us in a bit.

¶ 0x3: Thinking…

Anecdotes about my own log don’t generalize, so let’s get a proper benchmark: SQuAD v1.1, extractive questions with gold answers. The harness packs passages into chunks sized to each technique’s carrying capacity, samples 30 questions per chunk spread evenly (so answers land at every image row, top to bottom), and runs every technique over the same corpus:

text— the corpus passed verbatim; the ceiling,** handoff**— a simple handoff prompt,** compact**— provider-side compaction where available, a summarization call otherwise,** img-{font}-{variant}— snapcompact, where the variant is bw**(plain black-on-white) or** sent**(glyph ink cycles color per sentence).

Scores are SQuAD F1; models are told to answer UNREADABLE when they can’t extract the fact.

| technique | fable-5 | opus-4.8 | gpt-5.5 | gemini-3.5-flash |
|---|---|---|---|---|

| text (ceiling) | 0.904 $0.4984 | 0.911 $0.6367 | 0.861 $0.0847 | 0.898 $0.0577 | | handoff | 0.540 $1.2241 | 0.248 $1.0065 | 0.368 $0.2386 | 0.889$0.1759 | | compact | 0.406 $0.9427 | 0.000 $0.7576 | 0.896$0.3393 | 0.000 $0.0420 | | img-6×10-sent | 0.882$0.6400 | 0.601 $0.2430 | 0.822 $0.2452 | 0.805 $0.0970 | | img-6×10-bw | 0.856 $0.7568 | 0.652$0.2369 | 0.792 $0.3026 | 0.767 $0.1135 | | img-5×8-sent | 0.773 $0.4532 | 0.409 $0.1626 | 0.751 $0.1819 | 0.738 $0.1006 | | img-5×8-bw | 0.830 $0.6866 | 0.425 $0.1619 | 0.778 $0.2359 | 0.674 $0.0941 |

For a first attempt not bad… wait, muh token savings, how is this more expensive? Let’s have a look at this other table.

[¶](#tokens-inputoutputthinking)

Tokens: input/output/thinking

| technique | fable-5 | opus-4.8 | gpt-5.5 | gemini-3.5-flash |
|---|---|---|---|---|

| text (ceiling) | 37,793 / 2,410 / 1,435 | 37,793 / 931 / 0 | 17,761 / 2,998 / 2,298 | 24,535 / 10,745 / 9,983 | | handoff | 49,363 / 14,609 / 3,717 | 43,237 / 4,773 / 0 | 25,032 / 11,710 / 4,835 | 49,387 / 36,577 / 11,959 | | compact | 45,130 / 9,828 / 3,248 | 40,436 / 2,014 / 0 | 45,562 / 15,329 / 1,032 | 26,151 / 6,582 / 5,422 | | img-6×10-sent | 11,816 / 10,437 / 9,483 | 11,816 / 877 / 0 | 10,188 / 14,049 / 13,368 | 4,991 / 23,491 / 22,764 | | img-6×10-bw | 11,816 / 12,772 / 11,782 | 11,816 / 796 / 0 | 10,188 / 17,639 / 16,958 | 4,991 / 27,616 / 26,879 | | img-5×8-sent | 7,955 / 7,474 / 6,897 | 7,955 / 577 / 0 | 4,519 / 10,773 / 10,336 | 3,355 / 24,659 / 24,196 | | img-5×8-bw | 7,955 / 12,141 / 11,559 | 7,955 / 568 / 0 | 6,823 / 13,892 / 13,463 | 3,355 / 23,014 / 22,542 |

A few conclusions:

It does work: on fable, 0.86–0.96 F1 across every corpus length I tested, carrying the same information for a third of the input price. Amazing.
The input savings aren’t free: models decode dense images by reasoningabout them, and that thinking costs ~5× the output tokens of the text condition (in this example).

At Anthropic’s output pricing the decode tax can eat the input savings in a single pass. This is a nitpick at 40k tokens — (a) nobody compacts at that range, (b) the decode happens once, not every turn — but still: suboptimal.

The baselines mostly confirm why this is worth doing at all. Prose compaction is a fact shredder: on compacted context, Gemini answered UNREADABLE 240 times out of 240, Opus 209 — the summaries preserve what you were doing, not what you knew. Two exceptions: OpenAI’s opaque server-side compaction retains nearly everything (but they might just be skipping it, who knows?), and Gemini’s handoff documents disobey the prompt’s spirit and write down the trivia, lol.

Anyhow — whether the technique works is an empirical property of each model’s vision stack, and you have to test it. So now we’re gonna have to learn how the vision stack actually works.

¶ 0x4: Two Carriers, One State

The stronger claim, the one that makes snapcompact a memory format instead of a party trick, is that the model thinks the same with either carrier. We know it reads the image; the question is whether the internal result is text-shaped.

Setup, on a local Qwen2.5-VL-7B-Instruct: take one SQuAD chunk and twelve questions over it. Run each question twice: once with the chunk as plain text in the prompt, once with the chunk as a 1568² bitmap — and capture the hidden state at the last prompt token, the model’s “about to answer” summary, at every decoder layer.

Raw states look similar for boring reasons (same template, same model), so the comparison subtracts each carrier’s per-layer mean — anything that survives centering is content, not carrier. Then three measurements:

Matched pairs(same question, text ↔ image): cosine** 0.66at layer 19. Mismatched pairs**(different questions):−0.06. The state encodeswhich question against which content, not which input format.Cross-carrier retrieval: for every text run, find the nearest image run. From layer 2 onward it’s the same question** 12 out of 12 times**.** Representational geometry**: the 12×12 question-similarity matrix computed inside the text carrier correlates with the image carrier’s at** r = 0.94by layer 1, settling to 0.85**at the final layer. The two carriers print the same relational structure almost immediately; what deepens with depth is the per-question state fusion.

Behaviorally, both carriers generate the same answers. That’s the property the pricing math cashes in on: the PNG isn’t a picture of your context — it converges to being your context.

If pixels become text inside the model, you can ask *where*. The instrument is a logit lens: at every layer, take the hidden state of the visual token covering the answer word, push it through the final RMSNorm and the LM head, and check the top-1 vocabulary entry. **Lock-on** is the first layer whose top-1 is a BPE piece of the answer.

For the baseline 8×13 rendering, the patch containing the tail of “spectacular” decodes as CJK noise for seventeen layers, passes through *letter-shaped* noise around L18 (`ALLERY`

, IGHL

— strokes assembling into orthography), and flips to acular

at L24, climbing to p=0.39 by the last layer.

This matters because layers before lock-on are spent turning pixels into words, and layers after are free to work with them. So I decided to be a simpleton. Attention accumulates evidence, yes? Repeat the lines, and the read should get stronger?

Another thing that’s very obvious once you know how the vision tokens work — Qwen slices the image into 28×28-pixel windows, one visual token each — is that the font size we picked at the beginning of this article, purely by yolo, won’t fly. At 6×10, every token window holds fragments of ~13 glyphs smeared across three text rows.

Less overlapping garbage in a token’s window: less thinking required. Simple.

condition	lock-on	peak p(answer)	chars/visual token
base 8×13	L24	0.39	7.5
repeat lines ×2, colored	L23	0.94	3.75
aligned, 4×2 chars/token	L24	0.74	8.0
aligned, 2×1 chars/token	L23	0.99	2.0
aligned, 1 char/token	L22	0.99	1.0
aligned + repeated	L23	1.00	1.0

The depth refused to move (significantly): L24 to L22, best case. That’s consistent with the OCR-routing literature where vision becomes text is an architectural property. But the confidence is fully controllable: line repetition alone took the decode from 0.39 to 0.94 while still carrying 3.75 chars per visual token. The format can’t make the model read sooner; it can make the reading unambiguous. Grug model not think much.

Now at this point, while doing lit review, I noticed this is not really a new idea: DeepSeek-OCR trained a custom encoder for optical context compression, and Karpathy riffed on pixels maybe beating tokens as an input medium.

In the agent-harness context, though, it seems like people saw the amount of thinking being wasted and called it a day. But would you look at that? The grug-brain optimization we made for Qwen generalizes remarkably well to frontier models!

Dense-text bitmaps as context carriers, carefully adjusted, do very well. Synthetic pixel-font renderings at the legibility floor, benchmarked — billing formulas, silent downscales, and adaptive-thinking costs included — against the compaction strategies agents actually ship. Go team PNG!Lock-on you can drive to certainty on a 7B open model. From p=0.39 to p=1.00 with such a small change is significant. Yes, repetition doubles the pixel area — we could have had much more significant savings than amere~3× — but that’s still cheaper than text, now with near-perfect recall. Hell, return tool results as PNGs if you want.

Here it is, running live in the harness:

The harness wins again: nothing about the models changed; we changed the context around them.

Eval harness, font renderer, per-question records, white-box probes: omp — uv run final.py reproduces the API grid (~$35 cold, free from cache after); the representation runs need a local GPU with Qwen2.5-VL-7B.

Coming very soon to oh-my-pi!

source & further reading

blog.can.ac — original article Quick tips for fast iteration in Haskell

Snapcompact: SoTA Compaction – Instant, Local, Free. Pick 3

Run your AI side-project on zahid.host