# The billing bet that killed my coding-agent harness

> Source: <https://dev.to/bfxavier/the-billing-bet-that-killed-my-coding-agent-harness-17a9>
> Published: 2026-06-13 10:49:59+00:00

Filomena is a local harness I built to run a few Claude Code agents from one terminal. Six months of work. Last week I stopped pretending I'd keep going with it. Claude Code now ships most of what it did, and the bet underneath the whole thing was wrong.

Filomena drove the `claude`

CLI through a pseudo-terminal. A virtual VT100 screen (pyte) read what the agent was doing, and a versioned classifier mapped the screen to states. Safety was a Claude Code PreToolUse hook on Bash/Write/Edit: the hook blocked before a tool ran, sent the action to the harness, and waited for allow or deny.

The point of it was a single approval gate with graduated autonomy. Low-risk actions ran automatically and got recorded. High-risk ones stopped for me. The gate was supposed to be a brake, not something I pressed all day.

Headless mode (`claude -p`

) bills differently. So I drove the interactive CLI instead, over a PTY, because interactive use runs on the flat subscription. The plan was to run two or three agents at once on my Max plan without watching a meter.

I didn't think about how long that would last.

Over the same months, Claude Code grew most of what I was hand-building. Agent View became the dashboard I'd written a cockpit for. Subagents and Agent Teams were my multi-agent loop. Workflows let it write an orchestration script and fan out a fleet, which happened to be the next thing on my roadmap. The shared-memory store I'd hacked together turned into CLAUDE.md and an MCP server. Watching it arrive feature by feature was grim. More than once I opened the release notes and found the feature I'd blocked out the weekend to build.

On June 15 2026 the subscription stopped subsidizing agents. The `claude -p`

command, the Agent SDK, and third-party apps built on it moved to a separate metered credit pool ($100/month on Max 5x, $200 on Max 20x, API rates after). Interactive Claude Code stayed on the subscription.

I read that as good news at first. SDK-based tools would meter, I'd stay on subscription, that was my edge. Then I looked at what native multi-agent actually is. Subagents, Agent View and Workflows are native Claude Code surfaces, not third-party SDK harnesses. I couldn't defend a product thesis built on my PTY harness having a cost edge over first-party Claude Code. The only tools I was clearly cheaper than were the SDK ones nobody runs on a laptop anyway.

The moment it landed was boring. I had the billing docs open next to my roadmap, and the next feature on the roadmap was basically rebuild Workflows, but through a terminal screen.

I'm reading a terminal screen with pyte and guessing state from text. The native agents have APIs into the same loop. When the `claude`

TUI changes, my classifier breaks and theirs doesn't. I was always going to be the less reliable option for the same job. That's structural, not a bug I could fix.

One release shipped with 2793 tests green and the entire secrets vault missing from the wheel. The tests imported the package from the repo checkout; the built wheel shipped a different path. Green locally, ImportError on install. My own dogfood log had flagged it the day before ("flag for v1.1.1") and it shipped broken anyway. v1.1.0 was dead on arrival, fixed the next day in v1.1.1.

One night the agents building it kept dying with 529 Overloaded. There was a blip on the status page, so an outage was the easy thing to blame. It wasn't one. I'd burned about 9.5 million tokens of subagent work and hit my own plan's usage limit without watching it. A dying agent finally printed "You've hit your session limit, resets 7:30pm." No outage, just me not reading my own meter.

Another time the suite just stopped for seventeen minutes. No error, no leaked processes. A self-test probe inside the approval gate had leaked on cancellation and sat holding pytest's captured stdout. Took me far too long to find. The fix was one commit (62b12c1), reap the probe in a finally block.

And the one that's on theme: the agents I used to verify the harness couldn't run the harness. No tty in their environment, so the certs ran through the Python API and pytest, never the real screen-scraping path. The least reliable part was the hardest to test, so it got tested the least.

It's small, and none of it is the machinery I was proud of.

The core is the risk classifier. Native background agents pre-grant a permission set and auto-deny anything outside it, which is blunt. Mine reads each action: read-only vs test vs network vs destructive, write paths against scope and deny globs, a file-count ceiling, whether a secret is involved, fail-closed when it can't tell. That difference is the only reason I didn't just delete the repo.

The rest is the stuff I'd have missed if I had. A scrubber that strips configured secret values out of tool output before Claude sees the result and before my receipts write it down. A composed environment, so a spawned agent gets what I hand it and not my whole shell. The panic stop, now a hook-level flag instead of the real process-group kill it used to be, since the hook doesn't own the child processes. A runaway check that denies a tool input once it has repeated too many times.

It still writes a receipt:

```
[approved] Edit calc.py           via policy:write_scope
[approved] Bash python -m pytest  via policy:check_command
[merged]   agent1 -> main         (check green)
```

I can see what ran and what decided it, and it doesn't stop me for the safe parts.

None of that needs a PTY or a cockpit. It rides on the native agents instead of replacing them.

The orchestration was never going to stay mine. Six months in, the cockpit, the multi-agent loop, the memory store and the worktree plumbing all exist first-party now, and they run better than mine ever did.

The policy is the part that didn't get absorbed. What an agent can do on my machine, and a record of what it did, I still had to write myself. Anthropic can have the orchestration. I wanted my own rules about what the agent is allowed to touch, and those I keep.

So I pulled the classifier, the scrubber, the env composer and the two emergency controls out of the harness and threw the rest away. That's Guard Dog: [guard-dog](https://github.com/bfxavier/guard-dog). The harness is archived.