Claude Code 2.1.154–2.1.158 streaming tool-result delivery regression — probe + side-by-side evidence + workaround (and what you give up by pinning to 2.1.153) Claude Code versions 2.1.154 through 2.1.158 contain a regression that corrupts the streaming delivery of tool results back to the model, causing results to arrive empty-then-late in bursts, duplicate inline blocks up to 15 times, and fabricate phantom content in file reads. A developer built a probe that appends labels to a log file before echoing stdout, comparing ground-truth execution data against what the delivery channel returned, confirming every command executed exactly once while only the result channel was broken. Pinning to version 2.1.153 restores clean delivery, as demonstrated by side-by-side evidence showing monotonic execution counts and no phantom rows in a 500-line file read. Date: 2026-05-30 • Verified clean build: 2.1.153 • Broken builds: 2.1.154–2.1.158 clean 2.1.157-good / 2.1.158-bad bisect from another reporter; symptoms predate 2.1.158 CC 2.1.154–2.1.158 corrupts the path that delivers tool results back to the model. Execution is clean — commands run exactly once and disk state is correct. The channel that returns results is what's broken: results arrive empty-then-late in bursts, the tail of a parallel batch renders ~15× duplicated, Read sometimes returns phantom content a 76-line file reported as 5333 lines, with a row that wasn't in the file , and results flush out of order. Downstream: burns context + output tokens, invalidates prompt-cache hits, and provokes model-side amplification — busy-wait polling and fabricated results that drive retry cascades. Suspect temporal, not maintainer-confirmed : the 2.1.154 changelog entry "Streaming tool execution is now always enabled, including when telemetry is disabled or on Bedrock/Vertex/Foundry previously behind a feature flag " lines up exactly with when the cross-platform cluster ignites. The trick: append a label to a log file before echoing. The log is ground truth of what executed immune to the delivery channel ; stdout is what the delivery channel returned. Comparing the two cleanly separates "did the command run?" from "did the result arrive intact?" bash /usr/bin/env bash /tmp/toolprobe/probe.sh log=/tmp/toolprobe/log.txt echo "$ date +%s.%N $1" "$log" echo "PROBE OK label=$1 count=$ wc -l < "$log" " Run N labels as a single parallel Bash batch, then compare per-label execution count lines in log.txt vs inline result blocks received. The probe above was built because the 2.1.158 session corrupted tool-result delivery mid-task. So this side is qualitative — what the original broken session produced before the probe existed: Tail of a parallel Bash batch rendered ~15× duplicated in the inline result blocks the model received. Empty-then-burst arrival lagged ~1 turn — multiple parallel calls returned blank, then their content flushed together later.including a fabricated duplicate row not in the source file. Read of a 76-line file reported 5333 lines wc -l on the same file from a sibling Bash call returned 76.- Cross-checking ground truth a log file appended to from inside each command confirmed every command executed exactly once. Only the channel handing results back to the model was corrupted. The reason there isn't a structured 2.1.158 probe run below: surviving a corrupted-delivery session well enough to drive a structured probe is itself hard — the corruption shreds the harness state you'd use to script it. The clean 2.1.153 run below was the first opportunity to instrument cleanly. claude --version: 2.1.153 Claude Code date: Sat May 30 19:46:13 CDT 2026 PROBE OK label=R1 count=1 Call A → PROBE OK label=A count=2 Call B → PROBE OK label=B count=3 Call C → PROBE OK label=C count=4 Call D → PROBE OK label=D count=5 Call E → PROBE OK label=E count=6 Call F → PROBE OK label=F count=7 Call G → PROBE OK label=G count=8 Call H → PROBE OK label=H count=9 Every label's count= value was strictly monotonic 1..25 across all three parallel batches — no two parallel calls reported the same count, no gaps, no dupes. seq 1 500 /tmp/toolprobe/big.txt Read tool on big.txt : reported 501 numbered rows. Rows 1..500 contained the exact values "1".."500"; row 501 was empty trailing-newline artifact — wc -l reports 500, matching . No phantom rows, no fabrication. bash $ wc -l /tmp/toolprobe/log.txt 25 /tmp/toolprobe/log.txt $ awk '{print $2}' /tmp/toolprobe/log.txt | sort | uniq -c 1 A 1 B 1 C 1 D 1 E 1 F 1 G 1 H 1 I 1 J 1 K 1 L 1 M 1 N 1 O 1 P 1 Q 1 R 1 S 1 T 1 U 1 V 1 W 1 X 1 R1 | Label | exec count log | inline result blocks received | |---|---|---| | R1, A–X 25 total | 1 each | 1 each | Distinct labels: 25. Max count any label reached: 1. Dupes: 0. Empties: 0. Cancels: 0. Read fidelity exact. n=1 — the bug is intermittent, so one clean pass on 2.1.153 isn't absolute proof of absence. But it converges with the changelog-timed hypothesis and with upstream's 2.1.157-good / 2.1.158-bad bisect. — Linux/WSL primary. 63797 https://github.com/anthropics/claude-code/issues/63797 Reproduced fresh/short session after reboot , which empirically rules out the naive 64KB-pipe-buffer theory and points at the streaming pipeline itself.— macOS, fullest forensics. 63966 https://github.com/anthropics/claude-code/issues/63966 88/88 Proves delivery lag, not execution loss. One tool use ↔ tool result paired in JSONL 0 orphans . Read had ~15 intervening tool uses before its result.— bisect tripwire. Clean 63935 https://github.com/anthropics/claude-code/issues/63935 2.1.157-good / 2.1.158-bad ; /clear does not fix; downgrade to 2.1.157 resolves the worst form. We target 2.1.153 not 2.1.157 because streaming-always-on dates to 2.1.154.— multi-minute buffering → model busy-waits with ~500 no-op poll commands. 64077 https://github.com/anthropics/claude-code/issues/64077 — out-of-order results with stale batch data when a parallel batch partially fails. 63859 https://github.com/anthropics/claude-code/issues/63859 / 63538 https://github.com/anthropics/claude-code/issues/63538 / 63884 https://github.com/anthropics/claude-code/issues/63884 / 64065 https://github.com/anthropics/claude-code/issues/64065 — model-behavior facet: when a batch looks empty/cancelled, the model fabricates output once even a fake user instruction . 64076 https://github.com/anthropics/claude-code/issues/64076 Year-long lineage of the same delivery-not-execution class: 13984 2025-12 → canonical precedent, closed not planned → the current 2.1.154+ cluster. 36038 https://github.com/anthropics/claude-code/issues/36038 Clean window: Agent View requires ≥2.1.139; regression starts at 2.1.154. So 2.1.153 keeps Agent View and predates the bug. Disable the auto-updater FIRST else the supervisor binary-watch re-bumps you . Add to ~/.claude/settings.json :There is { "env": { "DISABLE AUTOUPDATER": "1" } } no autoUpdates settings key — DISABLE AUTOUPDATER is an env var inside settings.json . Do not use minimumVersion — it sets a floor that would block the downgrade. Downgrade with the built-in installer no npm : claude install 2.1.153 Restart. claude respawn --all for background sessions. Verify claude --version → 2.1.153. Rollback: claude install 2.1.158 + remove DISABLE AUTOUPDATER . This is non-trivial — Anthropic shipped Opus 4.8 in the same release that broke delivery. From the official CC changelog, here is everything 2.1.154–2.1.158 added that you lose by pinning to 2.1.153: Opus 4.8 — "Opus 4.8 is here Now defaults to high effort · Pinning means your top model is Opus 4.7. /effort xhigh for your hardest tasks." Dynamic workflows — /workflows orchestrates work across tens to hundreds of agents in the background.- Fast mode on Opus 4.8 at "a fraction of its previous cost: 2x the standard rate for 2.5x the speed." - Lean system prompt as default for all models except Haiku, Sonnet, and Opus 4.7 and earlier. /simplify runs a cleanup-only review reuse/simplification/efficiency instead of full /code-review --fix . claude agents :