cd /news/ai-tools/claude-code-2-1-154-2-1-158-streamin… · home topics ai-tools article
[ARTICLE · art-18973] src=gist.github.com pub= topic=ai-tools verified=true sentiment=↓ negative

Claude Code 2.1.154–2.1.158 streaming tool-result delivery regression — probe + side-by-side evidence + workaround (and what you give up by pinning to 2.1.153)

Claude Code versions 2.1.154 through 2.1.158 contain a regression that corrupts the streaming delivery of tool results back to the model, causing results to arrive empty-then-late in bursts, duplicate inline blocks up to 15 times, and fabricate phantom content in file reads. A developer built a probe that appends labels to a log file before echoing stdout, comparing ground-truth execution data against what the delivery channel returned, confirming every command executed exactly once while only the result channel was broken. Pinning to version 2.1.153 restores clean delivery, as demonstrated by side-by-side evidence showing monotonic execution counts and no phantom rows in a 500-line file read.

read7 min publishedMay 31, 2026

Date: 2026-05-30 • Verified clean build: 2.1.153 • Broken builds: 2.1.154–2.1.158 (clean 2.1.157-good / 2.1.158-bad bisect from another reporter; symptoms predate 2.1.158)

CC 2.1.154–2.1.158 corrupts the path that delivers tool results back to the model. Execution is clean — commands run exactly once and disk state is correct. The channel that returns results is what's broken: results arrive empty-then-late in bursts, the tail of a parallel batch renders ~15× duplicated, Read

sometimes returns phantom content (a 76-line file reported as 5333 lines, with a row that wasn't in the file), and results flush out of order.

Downstream: burns context + output tokens, invalidates prompt-cache hits, and provokes model-side amplification — busy-wait polling and fabricated results that drive retry cascades.

Suspect (temporal, not maintainer-confirmed): the 2.1.154 changelog entry "Streaming tool execution is now always enabled, including when telemetry is disabled or on Bedrock/Vertex/Foundry (previously behind a feature flag)" lines up exactly with when the cross-platform cluster ignites.

The trick: append a label to a log file before echoing. The log is ground truth of what executed (immune to the delivery channel); stdout is what the delivery channel returned. Comparing the two cleanly separates "did the command run?" from "did the result arrive intact?"

#!/usr/bin/env bash
log=/tmp/toolprobe/log.txt
echo "$(date +%s.%N) $1" >> "$log"
echo "PROBE_OK label=$1 count=$(wc -l < "$log")"

Run N labels as a single parallel Bash batch, then compare per-label execution count (lines in log.txt

) vs inline result blocks received.

The probe above was built because the 2.1.158 session corrupted tool-result delivery mid-task. So this side is qualitative — what the original broken session produced before the probe existed:

Tail of a parallel Bash batch rendered ~15× duplicated in the inline result blocks the model received.Empty-then-burst arrival lagged ~1 turn— multiple parallel calls returned blank, then their content flushed together later.including a fabricated duplicate row not in the source file.Read

of a 76-line file reported 5333 lineswc -l

on the same file from a sibling Bash call returned 76.- Cross-checking ground truth (a log file appended to from inside each command) confirmed every command executed exactly once. Only the channel handing results back to the model was corrupted.

The reason there isn't a structured 2.1.158 probe run below: surviving a corrupted-delivery session well enough to drive a structured probe is itself hard — the corruption shreds the harness state you'd use to script it. The clean 2.1.153 run below was the first opportunity to instrument cleanly.

claude --version: 2.1.153 (Claude Code)
date: Sat May 30 19:46:13 CDT 2026
PROBE_OK label=R1 count=1
Call A → PROBE_OK label=A count=2
Call B → PROBE_OK label=B count=3
Call C → PROBE_OK label=C count=4
Call D → PROBE_OK label=D count=5
Call E → PROBE_OK label=E count=6
Call F → PROBE_OK label=F count=7
Call G → PROBE_OK label=G count=8
Call H → PROBE_OK label=H count=9

(Every label's count=

value was strictly monotonic 1..25 across all three parallel batches — no two parallel calls reported the same count, no gaps, no dupes.)

seq 1 500 > /tmp/toolprobe/big.txt

Read tool on big.txt

: reported 501 numbered rows. Rows 1..500 contained the exact values "1".."500"; row 501 was empty (trailing-newline artifact — wc -l

reports 500, matching). No phantom rows, no fabrication.

$ wc -l /tmp/toolprobe/log.txt
25 /tmp/toolprobe/log.txt

$ awk '{print $2}' /tmp/toolprobe/log.txt | sort | uniq -c
      1 A   1 B   1 C   1 D   1 E   1 F   1 G   1 H
      1 I   1 J   1 K   1 L   1 M   1 N   1 O   1 P
      1 Q   1 R   1 S   1 T   1 U   1 V   1 W   1 X
      1 R1
Label exec count (log) inline result blocks received
R1, A–X (25 total) 1 each 1 each

Distinct labels: 25. Max count any label reached: 1. Dupes: 0. Empties: 0. Cancels: 0. Read fidelity exact.

n=1 — the bug is intermittent, so one clean pass on 2.1.153 isn't absolute proof of absence. But it converges with the changelog-timed hypothesis and with upstream's 2.1.157-good / 2.1.158-bad bisect.

— Linux/WSL primary.#63797Reproduced fresh/short session after reboot, which empirically rules out the naive 64KB-pipe-buffer theory and points at the streaming pipeline itself.— macOS, fullest forensics.#6396688/88 Proves delivery lag, not execution loss. Onetool_use

tool_result

paired in JSONL (0 orphans).Read

had ~15 intervening tool uses before its result.— bisect tripwire. Clean#639352.1.157-good / 2.1.158-bad;/clear

does not fix; downgrade to 2.1.157 resolves the worst form. We target 2.1.153 not 2.1.157 because streaming-always-on dates to 2.1.154.— multi-minute buffering → model busy-waits with ~500 no-op poll commands.#64077— out-of-order results with stale batch data when a parallel batch partially fails.#63859/#63538/#63884/#64065— model-behavior facet: when a batch looks empty/cancelled, the model fabricates output (once even a fake user instruction).#64076

Year-long lineage of the same delivery-not-execution class: ** #13984** (2025-12) →

(canonical precedent, closed not_planned) → the current 2.1.154+ cluster.

#36038Clean window: Agent View requires ≥2.1.139; regression starts at 2.1.154. So 2.1.153 keeps Agent View and predates the bug.

Disable the auto-updater FIRST(else the supervisor binary-watch re-bumps you). Add to~/.claude/settings.json

:There is

{ "env": { "DISABLE_AUTOUPDATER": "1" } }

noautoUpdates

settings key —DISABLE_AUTOUPDATER

is an env var insidesettings.json

. Donot useminimumVersion

— it sets a floor that would block the downgrade.Downgrade with the built-in installer (no npm):

claude install 2.1.153

Restart.claude respawn --all

for background sessions. Verifyclaude --version

→ 2.1.153.

Rollback: claude install 2.1.158

  • remove DISABLE_AUTOUPDATER

.

This is non-trivial — Anthropic shipped Opus 4.8 in the same release that broke delivery. From the official CC changelog, here is everything 2.1.154–2.1.158 added that you lose by pinning to 2.1.153:

Opus 4.8—*"Opus 4.8 is here! Now defaults to high effort ·*Pinning means your top model is Opus 4.7./effort xhigh

for your hardest tasks."Dynamic workflows/workflows

orchestrates work across tens to hundreds of agents in the background.- Fast mode on Opus 4.8 at "a fraction of its previous cost: 2x the standard rate for 2.5x the speed." - Lean system prompt as default for all models except Haiku, Sonnet, and Opus 4.7 and earlier. /simplify

runs a cleanup-only review (reuse/simplification/efficiency) instead of full/code-review --fix

.claude agents

:! <command>

to run a shell command as a background session you can attach to / detach from.- Improved auto-mode classifier detection of bulk-repo data-exfil patterns.

  • Claude Opus 4.8 support added to the /claude-api

skill.

  • Fix for Opus 4.8 thinking-block modification causing API errors.

  • Plugins in .claude/skills/

auto-load (no marketplace required). claude plugin init <name>

to scaffold a new plugin in.claude/skills

./plugin

autocomplete for subcommands, installed plugin names, and marketplaces.claude agents

:agent

field insettings.json

honored for dispatched sessions,--agent <name>

to override.EnterWorktree

can switch between Claude-managed worktrees mid-session.- Several IDE / VSCode / WSL fixes (image paste on WSL, --resume

background-subagent reporting, etc.).

  • Auto mode on Bedrock/Vertex/Foundry for Opus 4.7 + 4.8 via CLAUDE_CODE_ENABLE_AUTO_MODE=1

.

Pin to 2.1.153 if reliable tool-result delivery matters more than Opus 4.8 + dynamic workflows for your work.Stay on 2.1.158 if Opus 4.8 or dynamic workflows are load-bearing for you, and ride the mitigations checklist below. The bug is real on that path; you'd be managing it, not avoiding it.

Redirect to file, then Read.cmd > /tmp/out 2>&1

then read the file. Most-cited workaround (history of the same class: #27004, #63966).Defeat block-buffering when piping.stdbuf -oL

/-o0

,unbuffer

,grep --line-buffered

,PYTHONUNBUFFERED=1

, or a PTY.Keep stdout small.BASH_MAX_OUTPUT_LENGTH

.Don't poll a pending result. Polling can't pull a buffered result through; it only burns turns (#64077).Don't fabricate to fill a gap. Empty = "delivery lag, re-verify", never confabulate (this is what produces the model-side fabrication cascade).Fewer, fatter tool calls. Burst delivery hurts wide parallel fan-outs more than serial ones.Operational palliatives:/compact

, restart between tasks, move heavy work to a subagent (fresh subprocess often works when the main loop is wedged).

CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1

— Anthropic's own env-var docs say the streaming→non-streaming fallback can "produce duplicate tool execution." Kills the duplicating fallback but not the underlying pipeline.CLAUDE_CODE_ENABLE_FINE_GRAINED_TOOL_STREAMING=0

— input-side only; doesn't address tool-result delivery./clear

— does not fix (#63935).- "Wait it out" — execution finished; the channel won't unstick on its own.

  • The "streaming-always-on @ 2.1.154 ↔ regression cluster" link is temporal/circumstantial, not maintainer-confirmed. No maintainer reply on the cluster issues as of writing. - Pure 64KB-pipe-buffer exhaustion was empirically contradicted as the cluster trigger (fresh/short sessions repro per #63797). Still explains older "worsens as session grows" reports like #36038. - Env vars seen only in third-party blog lists ( CLAUDE_CODE_EAGER_FLUSH

,CLAUDE_STREAM_IDLE_TIMEOUT_MS

, etc.) arenot on the official env-vars page — don't rely on them unverified. - Latest CC at write-time = 2.1.158; its changelog has** no**entry addressing tool-result delivery/buffering.

── more in #ai-tools 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-code-2-1-154-…] indexed:0 read:7min 2026-05-31 ·