# The token is valid — but your headless Claude Code agent just 401'd forever

> Source: <https://dev.to/drickon/the-token-is-valid-but-your-headless-claude-code-agent-just-401d-forever-48ip>
> Published: 2026-06-28 09:03:09+00:00

**TL;DR:** A static OAuth access token can return HTTP 200 on a raw `/v1/messages`

call at the exact instant a long-running Claude Code instance using that *same token* gets 401 "Invalid authentication credentials" — because the rejection is bound to the instance's own server-side session identity, not the token. Worse, once it 401s the instance hard-latches and never self-recovers until you restart the process, so any "is the token valid?" probe is structurally blind to the problem.

We run several headless Claude Code instances on Linux — long-running, unattended (systemd services in our case). Authentication is a single static `CLAUDE_CODE_OAUTH_TOKEN`

environment variable: an `sk-ant-oat01…`

OAuth access token from a Claude Max subscription, minted with `claude setup-token`

. It has **no refresh token**, and the instances never touch `~/.claude/.credentials.json`

(the rotating credential file). Auth is purely the static env token. We're on Claude Code v2.1.195, the latest stable as of this writing.

Recurrently, an instance's model API calls start returning HTTP 401 ("Invalid authentication credentials" / the CLI shows "Please run /login"). Across our fleet over 2026-06-13..06-28 we logged **212 distinct 401 windows / 245 request_ids** — roughly 8 per day fleet-wide. Windows last from seconds to ~125 minutes, rarely up to ~7 hours.

The obvious diagnosis is "the token expired / got revoked." We chased that and found it's wrong. Here's what's actually happening, finding by finding.

This is the non-obvious one, so lead with it.

During a *live* wedge — an instance actively returning 401 on its own turns — we fired raw `POST https://api.anthropic.com/v1/messages`

using the **same static oat01 token the wedged instance uses**. We tried it in many shapes: minimal; agent-shaped; large cache-creation; streaming; 12 tools; with metadata; resumed-style.

The token is valid. The account is fine. The request shape, size, model, and source IP are all fine — the raw probe shares all of them and succeeds. The only thing the probe does *not* share is the wedged instance's own long-lived server-side session/process identity.

Conclusion: **the rejection is bound to the instance's own server-side session identity** — not the token, not the request, not the account.

Across **412 sessions / 153 distinct 401 events**, the number that self-recovered without a process restart was **zero**. Even after the upstream rejection window closes — even after a raw probe on that token is happily returning 200 — the instance stays latched until you restart it.

Note what this rules out. We're on v2.1.195, which already ships Anthropic's v2.1.117 "reactive token refresh on 401" fix and the v2.1.178 "stale cached request configuration" fix. It still latches. That's consistent with Finding 1: re-minting or refreshing the token cannot help when the rejection is bound to session identity rather than to the token.

This follows directly. Any external "is the token valid?" probe shares the token but **not** the wedged session identity, so it returns 200 throughout the entire outage. "Token is valid" tells you nothing about whether the instance is latched.

This is the single most important operational lesson here: **never gate recovery on a token probe.** A green probe and a dead agent coexist happily. We verify recovery only by observing an actual non-401 turn from the instance itself.

Flag this as *distinct* from the 401 latch — it's a different failure that's easy to conflate.

In one ~7-hour outage, direct probes showed **Opus 4.8 and Sonnet 4.6 returning HTTP 429 rate_limit_error** (a generic "Error" body,

`x-should-retry: true`

, no `retry-after`

header) while The trap: a naive probe that hits Haiku reads 200 and reports "token fine," completely missing a big-model-tier throttle. If you're going to probe at all (and per Finding 3, be skeptical), probe the tier you actually run on.

One more pattern, marked unproven because the mechanism isn't established. Rebuilding 54 genuine 401 episodes from session logs, **idle-wake episodes (>1h idle) were 71% morning vs. mid-use episodes (≤1h idle) at 0% morning**. That's suggestive that the server-side session identity may go stale after a long idle period. It's real but a minority of episodes, and we have not proven the mechanism — treat it as a lead, not a conclusion.

This isn't just our fleet. GitHub `anthropics/claude-code`

#61912 captured the **same token returning 200 on /oauth/hello and 401 on /v1/messages in the same second**, token unexpired — the same session-bound, probe-blind phenomenon. (That report attributes it to credential-file corruption, which can't apply here: our token is static with no refresh and the instances never read the credential file.)

Our mitigation is a watchdog with two design choices worth stealing:

The third design choice that matters: a **quiet-window backoff**. The upstream rejection window can stay open for many minutes. If the watchdog restarts on a fixed short interval, it just restart-storms *into* a still-open window and churns. So it backs off, giving the upstream window time to close before the next recovery attempt, and confirms by outcome rather than by a clock.

We're characterizing the failure precisely, not claiming we know its upstream root cause. Two asks:

A few request_ids for tracing (all HTTP 401 `authentication_failed`

, token valid throughout):

`req_011CcVDWWs8GPfDyX8R9LEfW`

(2026-06-28 01:52 CDT)`req_011CcVDW3MetrtoQqLU2m8cn`

(2026-06-28 01:52 CDT)`req_011CcUaNDekFKPWogaeZ9adT`

(2026-06-27 17:45 CDT)`req_011Cc3X8oSfApMWRCs66taQw`

(2026-06-14 12:10 CDT)If you run headless Claude Code agents and have seen the silent-death-after-401 pattern, the takeaways are: restart clears it, the token was never the problem, and a token probe will read green the entire time. Build your watchdog to verify by outcome, and back off so you don't restart into an open window.