{"slug": "llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write", "title": "LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?", "summary": "A developer benchmarked 10 AI models across three wire formats—GCF, TOON, and JSON—and found that GCF achieved 100% comprehension and generation accuracy on frontier models like Claude Sonnet and Gemini 3.5 Flash, while JSON broke at 500 records and TOON scored 0/5 on generation for every model tested. The GCF format also used 79% fewer tokens than JSON, with no trade-off between cost and accuracy.", "body_md": "Every LLM wire format claims token savings. Nobody proves whether AI models can actually comprehend the format at scale, or produce valid output in it.\n\nWe ran 23 comprehension evals across 10 models and 3 providers. We ran generation evals across 11 models. Deterministic ground truth. No LLM judge. Reproducible from one command.\n\nJSON breaks at 500 records. GPT-5.5 returns empty strings. It can't even attempt an answer. Opus miscounts 500 as 356 and then spends 143 lines manually enumerating symbols to verify its own wrong answer. The format designed for \"human readability\" is incomprehensible to the systems actually reading it.\n\nTOON can't produce valid output. Claude Opus, the most capable model on the planet, scores 0/5 on TOON generation. GPT-5.4: 0/5. GPT-5.4-mini: 0/5. Gemini 3.1 Flash Lite: 0/5. The error is always the same: `toon: cannot assign string to int`\n\n. The model writes \"target\" in the distance column. TOON expects `0`\n\n. Every model fails the same way because the format's design forces an unnatural encoding step that models cannot perform unprompted.\n\nGCF wins both dimensions on every model tested. 100% comprehension on Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, and Gemini 3.5 Flash. 5/5 valid generation on every frontier model. Zero prior training. The format didn't exist until we built it and every model speaks it natively.\n\nA 500-symbol, 200-edge code graph. Encoded in GCF, TOON, and JSON. 13 structured extraction questions. The model gets the payload and a question. No format instructions. No system prompt. No hints.\n\n| Model | Runs | GCF avg | TOON avg | JSON avg | GCF margin |\n|---|---|---|---|---|---|\n| Claude Opus 4.6 | 2 | 96.2% |\n84.6% | 73.1% | +11.6 vs TOON |\n| Claude Sonnet 4.6 | 2 | 100% |\n73.1% | 53.8% | +26.9 vs TOON |\n| Claude Haiku 4.5 | 2 | 96.2% |\n69.2% | 57.7% | +27.0 vs TOON |\n| GPT-5.5 | 5 | 84.1% |\n67.7% | 45.8% | +16.4 vs TOON |\n| GPT-5.4 | 4 | 76.4% |\n56.0% | 44.1% | +20.4 vs TOON |\n| GPT-5.4-mini | 2 | 71.8% |\n64.1% | 54.2% | +7.7 vs TOON |\n| Gemini 2.5 Flash | 3 | 80.6% |\n54.6% | 57.0% | +26.0 vs TOON |\n| Gemini 2.5 Pro | 1 | 100% |\n76.9% | 58.3% | +23.1 vs TOON |\n| Gemini 3.1 Pro | 1 | 100% |\n76.9% | 46.2% | +23.1 vs TOON |\n| Gemini 3.5 Flash | 1 | 100% |\n61.5% | 46.2% | +38.5 vs TOON |\n\nGCF > TOON > JSON on every model from every provider. No exceptions. Four models achieve 100%: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.\n\n| Format | Tokens | vs JSON |\n|---|---|---|\n| GCF | 11,090 | 79% fewer |\n| TOON | 16,378 | 69% fewer |\n| JSON | 53,341 | baseline |\n\nGCF is the cheapest format. It's also the most accurate. Usually you trade cost for quality. Not here.\n\nAt 8 symbols, JSON scores 100%. Everything works. At 500 symbols, it falls apart.\n\n**GPT-5.5 returns empty strings.** Not wrong answers. Nothing. The model receives 53,341 tokens of `{\"qualifiedName\": \"...\", \"kind\": \"...\", \"score\": ..., \"provenance\": \"...\", \"distance\": ...}`\n\nrepeated 500 times and cannot produce any response. Ask \"how many symbols?\" and it returns `\"\"`\n\n. The attention mechanism drowns in 2,500 identical field-name tokens.\n\n**Claude Opus miscounts 500 as 356.** Then it tries to verify by manually listing symbols. 143 lines of chain-of-thought enumeration. Burns output tokens. Still gets the wrong answer. The most capable model in the world cannot count JSON objects because the structural noise overwhelms the signal.\n\n**Every model fails distance filtering.** \"How many symbols have distance 0?\" requires parsing 500 JSON objects, reading the `distance`\n\nfield on each, and counting matches. Correct answer: 166. Opus answers 200 (read the edge count instead). GPT-5.4 answers 300-404. GPT-5.4-mini answers 300.\n\nJSON repeats `\"qualified_name\":`\n\n, `\"kind\":`\n\n, `\"score\":`\n\n, `\"provenance\":`\n\n, `\"distance\":`\n\non every single record. That's 2,500 structurally identical tokens carrying zero semantic content. They exist for human readability. The consumer isn't a human.\n\nTOON does better than JSON on counting. It gets symbol_count=500 correct. But it fails on anything that requires filtering by column value.\n\n**Distance grouping fails on every model.** \"How many targets (distance 0)?\" requires scanning 500 TOON rows and filtering by the last column. Correct answer: 166.\n\nThe answers are wildly inconsistent across runs. The models aren't wrong in a systematic way; they're guessing. TOON has no section headers for distance groups. The only way to answer \"how many targets?\" is to scan every row and count. At 500 rows, models give up and guess round numbers.\n\n**Attention decays by row 500.** \"What kind is the last symbol?\" should be trivial. TOON answers \"method\" instead of \"interface\" on multiple models. By the time the model reaches row 500 of a flat table, attention has diluted to noise.\n\nGCF answers are structural, not computational.\n\n\"How many symbols?\" Read the header: `symbols=500`\n\n. Done.\n\n\"How many edges?\" Read the section header: `## edges [200]`\n\n. Done.\n\n\"How many targets?\" Count lines in `## targets`\n\n. The section boundary gives the grouping for free. No column filtering. No scanning 500 rows.\n\n\"What kind is the last symbol?\" The last line in `## extended`\n\nis the last symbol. The model reads the last line of the last section. No attention decay across 500 flat rows.\n\nGCF median error magnitude: **4** (off-by-one tokenization artifacts).\n\nTOON median error magnitude: **53** (comprehension failure).\n\nJSON median error magnitude: **56** (structural overwhelm).\n\nOne design decision creates this gap: hierarchical sections vs flat tabular. GCF groups data by category. TOON and JSON present flat lists and force the model to compute groupings from raw values. At scale, that computation fails.\n\nWe asked every model to produce structured output in each format. 3-line primer in the prompt. Output validated through the real decoder. No hand-holding.\n\n| Model | GCF | TOON (natural) | JSON |\n|---|---|---|---|\n| Claude Opus 4.6 | 5/5 |\n0/5 | 5/5 |\n| Claude Sonnet 4.6 | 5/5 |\n2-3/5 | 5/5 |\n| Claude Haiku 4.5 | 5/5 |\n1-3/5 | 5/5 |\n| GPT-5.5 | 4-5/5 |\n1-2/5 | 5/5 |\n| GPT-5.4 | 5/5 |\n0/5 | 5/5 |\n| GPT-5.4-mini | 5/5 |\n0/5 | 5/5 |\n| Gemini 2.5 Pro | 5/5 |\n1/5 | 5/5 |\n| Gemini 3.1 Pro | 5/5 |\n0/5 | 5/5 |\n| Gemini 3.1 Flash Lite | 4-5/5 |\n0/5 | 4/5 |\n| Gemini 3.5 Flash | 3/5 | 1/5 | 3/5 |\n| Gemini 2.5 Flash | 2-3/5 | 0-4/5 | 0-3/5 |\n\nNo model has ever been trained on GCF. It didn't exist before we built it. Yet every frontier model (Opus, Sonnet, GPT-5.5, Gemini 2.5 Pro, Gemini 3.1 Pro) produces valid, decoder-parseable output on first exposure with a 3-line primer.\n\nTOON has been published for months. It has documentation, examples, a playground, SDK implementations. And Claude Opus scores 0/5. Gemini 3.1 Pro scores 0/5. GPT-5.4 scores 0/5.\n\nEvery TOON generation failure produces the same error:\n\n```\nINVALID: symbols: index 0: distance: toon: cannot assign string to int\n```\n\nThe model writes:\n\n```\nsymbols[5]{name,kind,score,provenance,distance}:\n  pkg/api.HandleRequest,function,0.95,lsp_resolved,target\n```\n\nTOON expects:\n\n```\nsymbols[5]{name,kind,score,provenance,distance}:\n  pkg/api.HandleRequest,function,0.95,lsp_resolved,0\n```\n\nThe model is told \"this symbol is a target.\" It writes `target`\n\n. TOON's decoder rejects it because it expects the integer `0`\n\n. The model would need to know, unprompted, that \"target\" maps to 0, \"related\" maps to 1, \"extended\" maps to 2. No model does this.\n\nThis isn't a training problem. This is a design flaw. TOON's flat tabular format encodes semantic categories as integers. The model has to perform a mapping step that has no structural cue in the format itself. When does a column value need to be an integer? When is a string acceptable? TOON gives no signal. The model guesses wrong.\n\nGCF expresses distance through section placement:\n\n```\n## targets\n@0 fn pkg.HandleRequest 0.95 lsp_resolved\n## related\n@1 type pkg.ProcessResponse 0.74 ast_inferred\n## extended\n@2 method pkg.ValidateConfig 0.52 structural\n```\n\nThe model is told \"this symbol is a target.\" It writes it in `## targets`\n\n. No integer mapping. No encoding step. The format aligns with how the model naturally expresses grouped data. Sections are categories. That's how markdown works. That's how every model already thinks.\n\nWhen we explicitly pre-encode distances as integers in the prompt (\"distance 0\" instead of \"target\"), TOON passes. But this means the caller must know TOON's internal encoding and pre-process every field before the model can write valid output.\n\n| Format | Prompt style | Valid | 100 sym output |\n|---|---|---|---|\nGCF |\nnatural labels | 5/5 |\n5,984 B |\n| TOON | hand-held (integers) | 5/5 | 8,336 B |\n| TOON | natural labels | 0/5 | invalid |\n| JSON | natural labels | 5/5 | 16,121 B |\n\nGCF works with natural language. TOON requires a preprocessing step. And even with that step, GCF output is 28% smaller.\n\nNo model has seen GCF before. The format is days old. And yet:\n\nThis happens because GCF is aligned with patterns LLMs already understand:\n\n`## section_name`\n\nis a markdown header. Every model knows this.`@0 fn pkg.Auth 0.78 lsp_resolved`\n\nis positional. One token per field. No ambiguity.`@1<@0 calls`\n\nis 4 tokens. Self-contained. No nested objects.The format was designed for the machine's native expression patterns. TOON was designed for human readability. JSON was designed for human readability. Neither format was designed for the reader that's actually doing the work.\n\nWe forked TOON's benchmark repository, added a GCF formatter, and ran their datasets with their tokenizer and their methodology.\n\n| Dataset | GCF | TOON | Result |\n|---|---|---|---|\n| Semi-uniform event logs | 108,158 | 154,032 | GCF 42% smaller |\n| E-commerce orders | 61,593 | 73,246 | GCF 19% smaller |\n| Deeply nested config | 616 | 618 | GCF 0.3% smaller |\n| Employee records | 49,055 | 49,966 | GCF 2% smaller |\n| Analytics time-series | 8,398 | 9,127 | GCF 8% smaller |\n| GitHub repos | 8,576 | 8,744 | GCF 2% smaller |\n\nTOON's home turf. TOON's datasets. TOON's methodology. GCF wins every single one.\n\nEven on flat tabular employee records, the dataset TOON was literally designed for, GCF is smaller. The gap is small (2%) but it exists. On semi-uniform data where structures vary, the gap blows open to 42%.\n\nGCF has a feature no other format supports: session statefulness. Symbols seen in prior tool calls are referenced by ID instead of re-serialized.\n\nFirst call: full payload. Second call: only new symbols, plus `@ref`\n\nIDs for previously-seen ones. By the 5th call in a conversation: **92.7% token savings.**\n\nTOON and JSON re-serialize everything on every call. There is no mechanism for cross-call deduplication. Every tool response pays full price regardless of what the model already knows.\n\nThis is where GCF's advantage compounds over a session. The per-call savings (32-79% vs TOON) multiply across 5-10 tool calls in a typical agent interaction.\n\nThe eval is open source. Every result is committed. Every log file is in the repository.\n\n```\n# Comprehension (any provider)\ncd gcf-go/eval\nGOWORK=off EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 \\\n  go test -run TestComprehension -v -timeout 0\n\n# Generation\ncd gcf/eval\npython3 generation_gcf_eval.py\npython3 generation_toon_eval.py\n\n# Token efficiency (TOON's benchmark)\ncd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens\n```\n\nRun it yourself. The numbers don't change.", "url": "https://wpnews.pro/news/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write", "canonical_source": "https://dev.to/daynablackwell/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write-1lob", "published_at": "2026-06-07 00:11:45+00:00", "updated_at": "2026-06-07 00:42:02.538211+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-research", "ai-tools", "ai-infrastructure"], "entities": ["Claude Opus", "GPT-5.5", "GPT-5.4", "Gemini 3.1 Flash Lite", "Claude Sonnet", "Gemini 2.5 Pro", "Gemini 3.1 Pro", "Gemini 3.5 Flash"], "alternates": {"html": "https://wpnews.pro/news/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write", "markdown": "https://wpnews.pro/news/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write.md", "text": "https://wpnews.pro/news/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write.txt", "jsonld": "https://wpnews.pro/news/llm-wire-format-benchmark-which-format-can-ai-actually-read-and-write.jsonld"}}