Claude Code 2.1.154–2.1.158 streaming tool-result delivery regression — probe + side-by-side evidence + workaround (and what you give up by pinning to 2.1.153)

Claude Code versions 2.1.154 through 2.1.158 contain a regression that corrupts the streaming delivery of tool results back to the model, causing results to arrive empty-then-late in bursts, duplicate inline blocks up to 15 times, and fabricate phantom content in file reads. A developer built a probe that appends labels to a log file before echoing stdout, comparing ground-truth execution data against what the delivery channel returned, confirming every command executed exactly once while only the result channel was broken. Pinning to version 2.1.153 restores clean delivery, as demonstrated by side-by-side evidence showing monotonic execution counts and no phantom rows in a 500-line file read.

Date: 2026-05-30 • Verified clean build: 2.1.153 • Broken builds: 2.1.154–2.1.158 clean 2.1.157-good / 2.1.158-bad bisect from another reporter; symptoms predate 2.1.158 CC 2.1.154–2.1.158 corrupts the path that delivers tool results back to the model. Execution is clean — commands run exactly once and disk state is correct. The channel that returns results is what's broken: results arrive empty-then-late in bursts, the tail of a parallel batch renders ~15× duplicated, Read sometimes returns phantom content a 76-line file reported as 5333 lines, with a row that wasn't in the file , and results flush out of order. Downstream: burns context + output tokens, invalidates prompt-cache hits, and provokes model-side amplification — busy-wait polling and fabricated results that drive retry cascades. Suspect temporal, not maintainer-confirmed : the 2.1.154 changelog entry "Streaming tool execution is now always enabled, including when telemetry is disabled or on Bedrock/Vertex/Foundry previously behind a feature flag " lines up exactly with when the cross-platform cluster ignites. The trick: append a label to a log file before echoing. The log is ground truth of what executed immune to the delivery channel ; stdout is what the delivery channel returned. Comparing the two cleanly separates "did the command run?" from "did the result arrive intact?" bash /usr/bin/env bash /tmp/toolprobe/probe.sh log=/tmp/toolprobe/log.txt echo "$ date +%s.%N $1" "$log" echo "PROBE OK label=$1 count=$ wc -l < "$log" " Run N labels as a single parallel Bash batch, then compare per-label execution count lines in log.txt vs inline result blocks received. The probe above was built because the 2.1.158 session corrupted tool-result delivery mid-task. So this side is qualitative — what the original broken session produced before the probe existed: Tail of a parallel Bash batch rendered ~15× duplicated in the inline result blocks the model received. Empty-then-burst arrival lagged ~1 turn — multiple parallel calls returned blank, then their content flushed together later.including a fabricated duplicate row not in the source file. Read of a 76-line file reported 5333 lines wc -l on the same file from a sibling Bash call returned 76.- Cross-checking ground truth a log file appended to from inside each command confirmed every command executed exactly once. Only the channel handing results back to the model was corrupted. The reason there isn't a structured 2.1.158 probe run below: surviving a corrupted-delivery session well enough to drive a structured probe is itself hard — the corruption shreds the harness state you'd use to script it. The clean 2.1.153 run below was the first opportunity to instrument cleanly. claude --version: 2.1.153 Claude Code date: Sat May 30 19:46:13 CDT 2026 PROBE OK label=R1 count=1 Call A → PROBE OK label=A count=2 Call B → PROBE OK label=B count=3 Call C → PROBE OK label=C count=4 Call D → PROBE OK label=D count=5 Call E → PROBE OK label=E count=6 Call F → PROBE OK label=F count=7 Call G → PROBE OK label=G count=8 Call H → PROBE OK label=H count=9 Every label's count= value was strictly monotonic 1..25 across all three parallel batches — no two parallel calls reported the same count, no gaps, no dupes. seq 1 500 /tmp/toolprobe/big.txt Read tool on big.txt : reported 501 numbered rows. Rows 1..500 contained the exact values "1".."500"; row 501 was empty trailing-newline artifact — wc -l reports 500, matching . No phantom rows, no fabrication. bash $ wc -l /tmp/toolprobe/log.txt 25 /tmp/toolprobe/log.txt $ awk '{print $2}' /tmp/toolprobe/log.txt | sort | uniq -c 1 A 1 B 1 C 1 D 1 E 1 F 1 G 1 H 1 I 1 J 1 K 1 L 1 M 1 N 1 O 1 P 1 Q 1 R 1 S 1 T 1 U 1 V 1 W 1 X 1 R1 | Label | exec count log | inline result blocks received | |---|---|---| | R1, A–X 25 total | 1 each | 1 each | Distinct labels: 25. Max count any label reached: 1. Dupes: 0. Empties: 0. Cancels: 0. Read fidelity exact. n=1 — the bug is intermittent, so one clean pass on 2.1.153 isn't absolute proof of absence. But it converges with the changelog-timed hypothesis and with upstream's 2.1.157-good / 2.1.158-bad bisect. — Linux/WSL primary. 63797 https://github.com/anthropics/claude-code/issues/63797 Reproduced fresh/short session after reboot , which empirically rules out the naive 64KB-pipe-buffer theory and points at the streaming pipeline itself.— macOS, fullest forensics. 63966 https://github.com/anthropics/claude-code/issues/63966 88/88 Proves delivery lag, not execution loss. One tool use ↔ tool result paired in JSONL 0 orphans . Read had ~15 intervening tool uses before its result.— bisect tripwire. Clean 63935 https://github.com/anthropics/claude-code/issues/63935 2.1.157-good / 2.1.158-bad ; /clear does not fix; downgrade to 2.1.157 resolves the worst form. We target 2.1.153 not 2.1.157 because streaming-always-on dates to 2.1.154.— multi-minute buffering → model busy-waits with ~500 no-op poll commands. 64077 https://github.com/anthropics/claude-code/issues/64077 — out-of-order results with stale batch data when a parallel batch partially fails. 63859 https://github.com/anthropics/claude-code/issues/63859 / 63538 https://github.com/anthropics/claude-code/issues/63538 / 63884 https://github.com/anthropics/claude-code/issues/63884 / 64065 https://github.com/anthropics/claude-code/issues/64065 — model-behavior facet: when a batch looks empty/cancelled, the model fabricates output once even a fake user instruction . 64076 https://github.com/anthropics/claude-code/issues/64076 Year-long lineage of the same delivery-not-execution class: 13984 2025-12 → canonical precedent, closed not planned → the current 2.1.154+ cluster. 36038 https://github.com/anthropics/claude-code/issues/36038 Clean window: Agent View requires ≥2.1.139; regression starts at 2.1.154. So 2.1.153 keeps Agent View and predates the bug. Disable the auto-updater FIRST else the supervisor binary-watch re-bumps you . Add to ~/.claude/settings.json :There is { "env": { "DISABLE AUTOUPDATER": "1" } } no autoUpdates settings key — DISABLE AUTOUPDATER is an env var inside settings.json . Do not use minimumVersion — it sets a floor that would block the downgrade. Downgrade with the built-in installer no npm : claude install 2.1.153 Restart. claude respawn --all for background sessions. Verify claude --version → 2.1.153. Rollback: claude install 2.1.158 + remove DISABLE AUTOUPDATER . This is non-trivial — Anthropic shipped Opus 4.8 in the same release that broke delivery. From the official CC changelog, here is everything 2.1.154–2.1.158 added that you lose by pinning to 2.1.153: Opus 4.8 — "Opus 4.8 is here Now defaults to high effort · Pinning means your top model is Opus 4.7. /effort xhigh for your hardest tasks." Dynamic workflows — /workflows orchestrates work across tens to hundreds of agents in the background.- Fast mode on Opus 4.8 at "a fraction of its previous cost: 2x the standard rate for 2.5x the speed." - Lean system prompt as default for all models except Haiku, Sonnet, and Opus 4.7 and earlier. /simplify runs a cleanup-only review reuse/simplification/efficiency instead of full /code-review --fix . claude agents : <command to run a shell command as a background session you can attach to / detach from.- Improved auto-mode classifier detection of bulk-repo data-exfil patterns. - Claude Opus 4.8 support added to the /claude-api skill. - Fix for Opus 4.8 thinking-block modification causing API errors. - Plugins in .claude/skills/ auto-load no marketplace required . claude plugin init <name to scaffold a new plugin in .claude/skills . /plugin autocomplete for subcommands, installed plugin names, and marketplaces. claude agents : agent field in settings.json honored for dispatched sessions, --agent <name to override. EnterWorktree can switch between Claude-managed worktrees mid-session.- Several IDE / VSCode / WSL fixes image paste on WSL, --resume background-subagent reporting, etc. . - Auto mode on Bedrock/Vertex/Foundry for Opus 4.7 + 4.8 via CLAUDE CODE ENABLE AUTO MODE=1 . Pin to 2.1.153 if reliable tool-result delivery matters more than Opus 4.8 + dynamic workflows for your work. Stay on 2.1.158 if Opus 4.8 or dynamic workflows are load-bearing for you, and ride the mitigations checklist below. The bug is real on that path; you'd be managing it, not avoiding it. Redirect to file, then Read. cmd /tmp/out 2 &1 then read the file. Most-cited workaround history of the same class: 27004, 63966 . Defeat block-buffering when piping. stdbuf -oL / -o0 , unbuffer , grep --line-buffered , PYTHONUNBUFFERED=1 , or a PTY. Keep stdout small. BASH MAX OUTPUT LENGTH . Don't poll a pending result. Polling can't pull a buffered result through; it only burns turns 64077 . Don't fabricate to fill a gap. Empty = "delivery lag, re-verify", never confabulate this is what produces the model-side fabrication cascade . Fewer, fatter tool calls. Burst delivery hurts wide parallel fan-outs more than serial ones. Operational palliatives: /compact , restart between tasks, move heavy work to a subagent fresh subprocess often works when the main loop is wedged . CLAUDE CODE DISABLE NONSTREAMING FALLBACK=1 — Anthropic's own env-var docs say the streaming→non-streaming fallback can "produce duplicate tool execution." Kills the duplicating fallback but not the underlying pipeline. CLAUDE CODE ENABLE FINE GRAINED TOOL STREAMING=0 — input-side only; doesn't address tool-result delivery. /clear — does not fix 63935 .- "Wait it out" — execution finished; the channel won't unstick on its own. - The "streaming-always-on @ 2.1.154 ↔ regression cluster" link is temporal/circumstantial , not maintainer-confirmed. No maintainer reply on the cluster issues as of writing. - Pure 64KB-pipe-buffer exhaustion was empirically contradicted as the cluster trigger fresh/short sessions repro per 63797 . Still explains older "worsens as session grows" reports like 36038. - Env vars seen only in third-party blog lists CLAUDE CODE EAGER FLUSH , CLAUDE STREAM IDLE TIMEOUT MS , etc. are not on the official env-vars page — don't rely on them unverified. - Latest CC at write-time = 2.1.158 ; its changelog has no entry addressing tool-result delivery/buffering.