{"slug": "snapcompact-sota-compaction-instant-local-free-pick-3", "title": "Snapcompact: SoTA Compaction – Instant, Local, Free. Pick 3", "summary": "A developer discovered that rendering text into dense pixel-font bitmaps and feeding them as images to large language models achieves near-perfect recall at a third of the input token cost, effectively creating a free compaction method. The technique, called Snapcompact, was benchmarked against frontier models and generalized well, offering a way to compress context windows without losing information.", "body_md": "# Snapcompact: SoTA Compaction — Instant, Local, Free. Pick 3\n\nA 1568×1568 PNG fits about 40,000 characters of text in a 6×10 pixel font. That’s ~10,000 tokens worth of text, billed by Anthropic’s pixel formula as 3,279 image tokens. Do you see where I’m going with this?\n\n**Snapcompact**: when the context window fills up, render it into dense pixel-font bitmaps and hand those back as images. “A picture is worth a thousand words” turns out to be quite literally true — the model reads it back near-verbatim, at a third of the input price. Obligatory benchmark:\n\nThis started as a joke (“free token glitch lol”). Then I benchmarked it, identified where it went wrong, cracked open Qwen’s attention layer, fixed the issues, benchmarked it again — and here I am writing it up, because it generalized remarkably well to frontier models.\n\n[¶](#0x0-the-case-against-compaction)\n0x0: The Case Against Compaction\n\nI am not a big fan of compaction. In every single harness, including my own, I’ve always felt like it “crippled” the model to the point where you would have been better off with a completely new session.\n\nEliding tool results is an okay alternative — instant, deterministic — but sometimes not really sufficient. It also occasionally confuses the model about tool calling. LLMs complete stories; if half of your story is `[elided...]`\n\n, how confident do you think it will be about using them?\n\nHandoffs are as good as it gets — but unlike a plan, you don’t usually steer the handoffs, and when you don’t, agents waste precious context writing an unnecessarily detailed diary, followed by a TODO list that practically begs the next agent to declare the goal impossible and ship an “MVP” instead.\n\nMy thinking was essentially that if you need compaction often, you’re doing something wrong: the plan either has scope creep, or should have been explicitly orchestrated via subagents so that the main agent could stay responsible for the entire scope.\n\nHowever, spoiled by the 1M context window, these days I often hit the 500k mark by the end of a session — a mortal sin in my book a few months ago. But long-horizon tasks do better when one coherent agent drives the plan uninterrupted, and that easily reaches those levels even with aggressive delegation.\n\nSo there I was, staring at the 5h usage limit bar going red while this thing grinned back at me, thinking: maybe I should compact regularly…\n\nFine. But if I have to compact, it’s going to lose nothing.\n\n[¶](#0x1-a-stupid-experiment)\n0x1: A Stupid Experiment\n\nIt began with a 328KB session log and a simple question: what if I just printed this thing out and started the session with it?\n\nAttempt one was maximally greedy: [Tom Thumb](https://robey.lag.net/2010/01/23/tiny-monospace-font.html), a 3×5 pixel font, 122,696 characters in a single image.\n\nI sent it to a fresh agent session, zero explanation, and got back:\n\nThe image appears to be pure noise with random pixels, which suggests it might be corrupted or a file that’s been misnamed as PNG.\n\nFair. Attempt two used the X11 `6x10`\n\nfont (glyphs actually designed for that cell size), 40,716 characters, with each text row cycling through six colors. Same model, and there it was:\n\n- It identified the session’s topic and\n**quoted me back verbatim**. - It named 18 identifiers from the log with 100% recall.\n- Asked about a single assignment in the bottom-most row of the image, where the log cuts off, it hedged (“I’d be guessing — possibly\n`0`\n\n”) — and guessed the state right.\n\n10k tokens of text, carried by 3,279 image tokens, recalled with near-perfect precision. Okay. Now I’m invested.\n\n[¶](#0x2-optimizing-the-fonts)\n0x2: Optimizing the Fonts\n\nHow small can the font go? I swept some font configurations and asked the model to transcribe fixed regions, scoring edit similarity against ground truth:\n\n| font | px²/char | chars/image | transcription | identifiers read |\n|---|---|---|---|---|\n| 8×13 | 104 | 23,520 | 1.00 | 20/20 |\n| 6×10 | 60 | 40,716 | 0.79 | 20/20 |\n| 5×8 | 40 | 61,348 | 0.37 | 17/19 |\n| 5×7 | 35 | 70,112 | 0.30 | 10/20 |\n| 4×6 | 24 | 102,312 | 0.02 | 9/20 |\n\nThe cliff is sharp and it sits around **35–40 px² per character**. Above it, exact transcription degrades but *identifier-level* recall stays weirdly strong: the model can’t reproduce every byte, but it reads the names. Below it, nothing.\n\nThe funny thing is, this section was worse than useless — this exact optimization comes back to bite us in a bit.\n\n[¶](#0x3-thinking)\n0x3: Thinking…\n\nAnecdotes about my own log don’t generalize, so let’s get a proper benchmark: SQuAD v1.1, extractive questions with gold answers. The harness packs passages into chunks sized to each technique’s carrying capacity, samples 30 questions per chunk spread evenly (so answers land at every image row, top to bottom), and runs every technique over the same corpus:\n\n**text**— the corpus passed verbatim; the ceiling,** handoff**— a simple handoff prompt,** compact**— provider-side compaction where available, a summarization call otherwise,** img-{font}-{variant}**— snapcompact, where the variant is** bw**(plain black-on-white) or** sent**(glyph ink cycles color per sentence).\n\nScores are SQuAD F1; models are told to answer UNREADABLE when they can’t extract the fact.\n\n| technique | fable-5 | opus-4.8 | gpt-5.5 | gemini-3.5-flash |\n|---|---|---|---|---|\n| text (ceiling) | 0.904 $0.4984 | 0.911 $0.6367 | 0.861 $0.0847 | 0.898 $0.0577 |\n| handoff | 0.540 $1.2241 | 0.248 $1.0065 | 0.368 $0.2386 | 0.889$0.1759 |\n| compact | 0.406 $0.9427 | 0.000 $0.7576 | 0.896$0.3393 | 0.000 $0.0420 |\n| img-6×10-sent | 0.882$0.6400 | 0.601 $0.2430 | 0.822 $0.2452 | 0.805 $0.0970 |\n| img-6×10-bw | 0.856 $0.7568 | 0.652$0.2369 | 0.792 $0.3026 | 0.767 $0.1135 |\n| img-5×8-sent | 0.773 $0.4532 | 0.409 $0.1626 | 0.751 $0.1819 | 0.738 $0.1006 |\n| img-5×8-bw | 0.830 $0.6866 | 0.425 $0.1619 | 0.778 $0.2359 | 0.674 $0.0941 |\n\nFor a first attempt not bad… wait, muh token savings, how is this more expensive? Let’s have a look at this other table.\n\n[¶](#tokens-inputoutputthinking)\nTokens: input/output/thinking\n\n| technique | fable-5 | opus-4.8 | gpt-5.5 | gemini-3.5-flash |\n|---|---|---|---|---|\n| text (ceiling) | 37,793 / 2,410 / 1,435 | 37,793 / 931 / 0 | 17,761 / 2,998 / 2,298 | 24,535 / 10,745 / 9,983 |\n| handoff | 49,363 / 14,609 / 3,717 | 43,237 / 4,773 / 0 | 25,032 / 11,710 / 4,835 | 49,387 / 36,577 / 11,959 |\n| compact | 45,130 / 9,828 / 3,248 | 40,436 / 2,014 / 0 | 45,562 / 15,329 / 1,032 | 26,151 / 6,582 / 5,422 |\n| img-6×10-sent | 11,816 / 10,437 / 9,483 | 11,816 / 877 / 0 | 10,188 / 14,049 / 13,368 | 4,991 / 23,491 / 22,764 |\n| img-6×10-bw | 11,816 / 12,772 / 11,782 | 11,816 / 796 / 0 | 10,188 / 17,639 / 16,958 | 4,991 / 27,616 / 26,879 |\n| img-5×8-sent | 7,955 / 7,474 / 6,897 | 7,955 / 577 / 0 | 4,519 / 10,773 / 10,336 | 3,355 / 24,659 / 24,196 |\n| img-5×8-bw | 7,955 / 12,141 / 11,559 | 7,955 / 568 / 0 | 6,823 / 13,892 / 13,463 | 3,355 / 23,014 / 22,542 |\n\nA few conclusions:\n\n- It does work: on fable, 0.86–0.96 F1 across every corpus length I tested, carrying the same information for a third of the input price. Amazing.\n- The input savings aren’t free: models decode dense images by\n*reasoning*about them, and that thinking costs ~5× the output tokens of the text condition (in this example).\n\nAt Anthropic’s output pricing the decode tax can eat the input savings in a single pass. This is a nitpick at 40k tokens — (a) nobody compacts at that range, (b) the decode happens once, not every turn — but still: suboptimal.\n\nThe baselines mostly confirm why this is worth doing at all. Prose compaction is a fact shredder: on compacted context, Gemini answered UNREADABLE **240 times out of 240**, Opus 209 — the summaries preserve what you were *doing*, not what you *knew*. Two exceptions: OpenAI’s opaque server-side compaction retains nearly everything (but they might just be skipping it, who knows?), and Gemini’s handoff documents disobey the prompt’s spirit and write down the trivia, lol.\n\nAnyhow — whether the technique works is an empirical property of each model’s vision stack, and you have to test it. So now we’re gonna have to learn how the vision stack actually works.\n\n[¶](#0x4-two-carriers-one-state)\n0x4: Two Carriers, One State\n\nThe stronger claim, the one that makes snapcompact a memory format instead of a party trick, is that the model *thinks* the same with either carrier. We know it reads the image; the question is whether the internal result is text-shaped.\n\nSetup, on a local Qwen2.5-VL-7B-Instruct: take one SQuAD chunk and twelve questions over it. Run each question twice: once with the chunk as plain text in the prompt, once with the chunk as a 1568² bitmap — and capture the hidden state at the **last prompt token**, the model’s “about to answer” summary, at every decoder layer.\n\nRaw states look similar for boring reasons (same template, same model), so the comparison subtracts each carrier’s per-layer mean — anything that survives centering is content, not carrier. Then three measurements:\n\n**Matched pairs**(same question, text ↔ image): cosine** 0.66**at layer 19.** Mismatched pairs**(different questions):**−0.06**. The state encodes*which question against which content*, not which input format.**Cross-carrier retrieval**: for every text run, find the nearest image run. From layer 2 onward it’s the same question** 12 out of 12 times**.** Representational geometry**: the 12×12 question-similarity matrix computed inside the text carrier correlates with the image carrier’s at** r = 0.94**by layer 1, settling to** 0.85**at the final layer. The two carriers print the same relational structure almost immediately; what deepens with depth is the per-question state fusion.\n\nBehaviorally, both carriers generate the same answers. That’s the property the pricing math cashes in on: the PNG isn’t a picture *of* your context — it converges to *being* your context.\n\nIf pixels become text inside the model, you can ask *where*. The instrument is a logit lens: at every layer, take the hidden state of the visual token covering the answer word, push it through the final RMSNorm and the LM head, and check the top-1 vocabulary entry. **Lock-on** is the first layer whose top-1 is a BPE piece of the answer.\n\nFor the baseline 8×13 rendering, the patch containing the tail of “spectacular” decodes as CJK noise for seventeen layers, passes through *letter-shaped* noise around L18 (`ALLERY`\n\n, `IGHL`\n\n— strokes assembling into orthography), and flips to `acular`\n\nat **L24**, climbing to p=0.39 by the last layer.\n\nThis matters because layers before lock-on are spent turning pixels into words, and layers after are free to work with them. So I decided to be a simpleton. Attention accumulates evidence, yes? Repeat the lines, and the read should get stronger?\n\nAnother thing that’s very obvious once you know how the vision tokens work — Qwen slices the image into 28×28-pixel windows, one visual token each — is that the font size we picked at the beginning of this article, purely by yolo, won’t fly. At 6×10, every token window holds fragments of ~13 glyphs smeared across three text rows.\n\nLess overlapping garbage in a token’s window: less thinking required. Simple.\n\n| condition | lock-on | peak p(answer) | chars/visual token |\n|---|---|---|---|\n| base 8×13 | L24 | 0.39 | 7.5 |\n| repeat lines ×2, colored | L23 | 0.94 | 3.75 |\n| aligned, 4×2 chars/token | L24 | 0.74 | 8.0 |\n| aligned, 2×1 chars/token | L23 | 0.99 | 2.0 |\n| aligned, 1 char/token | L22 | 0.99 | 1.0 |\n| aligned + repeated | L23 | 1.00 | 1.0 |\n\nThe depth refused to move (significantly): L24 to L22, best case. That’s consistent with the OCR-routing literature [where vision becomes text is an architectural property](https://arxiv.org/abs/2602.22918). But the *confidence* is fully controllable: line repetition alone took the decode from 0.39 to 0.94 while still carrying 3.75 chars per visual token. The format can’t make the model read sooner; it can make the reading unambiguous. Grug model not think much.\n\nNow at this point, while doing lit review, I noticed this is not really a new idea: [DeepSeek-OCR](https://arxiv.org/abs/2510.18234) trained a custom encoder for optical context compression, and [Karpathy riffed](https://x.com/karpathy/status/1980397031542989305) on pixels maybe beating tokens as an input medium.\n\nIn the agent-harness context, though, it seems like people saw the amount of thinking being wasted and called it a day. But would you look at that? The grug-brain optimization we made for Qwen generalizes remarkably well to frontier models!\n\n**Dense-text bitmaps as context carriers, carefully adjusted, do very well.** Synthetic pixel-font renderings at the legibility floor, benchmarked — billing formulas, silent downscales, and adaptive-thinking costs included — against the compaction strategies agents actually ship. Go team PNG!**Lock-on you can drive to certainty on a 7B open model.** From p=0.39 to p=1.00 with such a small change is significant. Yes, repetition doubles the pixel area — we could have had much more significant savings than a*mere*~3× — but that’s still cheaper than text, now with near-perfect recall. Hell, return tool results as PNGs if you want.\n\nHere it is, running live in the harness:\n\nThe harness wins again: nothing about the models changed; we changed the context around them.\n\n*Eval harness, font renderer, per-question records, white-box probes: omp — uv run final.py reproduces the API grid (~$35 cold, free from cache after); the representation runs need a local GPU with Qwen2.5-VL-7B.*\n\nComing very soon to [oh-my-pi](https://omp.sh)!", "url": "https://wpnews.pro/news/snapcompact-sota-compaction-instant-local-free-pick-3", "canonical_source": "https://blog.can.ac/2026/06/10/snapcompact/", "published_at": "2026-06-13 18:49:55+00:00", "updated_at": "2026-06-13 19:17:22.378989+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-research", "ai-tools", "developer-tools"], "entities": ["Snapcompact", "Anthropic", "Qwen", "Tom Thumb", "X11"], "alternates": {"html": "https://wpnews.pro/news/snapcompact-sota-compaction-instant-local-free-pick-3", "markdown": "https://wpnews.pro/news/snapcompact-sota-compaction-instant-local-free-pick-3.md", "text": "https://wpnews.pro/news/snapcompact-sota-compaction-instant-local-free-pick-3.txt", "jsonld": "https://wpnews.pro/news/snapcompact-sota-compaction-instant-local-free-pick-3.jsonld"}}