cd /news/developer-tools/my-agent-was-succeeding-on-slack-whi… · home topics developer-tools article
[ARTICLE · art-30377] src=dev.to ↗ pub= topic=developer-tools verified=true sentiment=· neutral

My agent was 'succeeding' on Slack while silently doing nothing — here's the monitoring stack that caught it

A developer discovered that their Slack bot was reporting success while failing to perform actual work, caught by a monitoring stack that treats Slack, D1 logging, and cost tracking as a connected pipeline. The fix involved a lean schema with a tokens_used column to track agent-specific costs, revealing that an analysis tool's monthly Claude spend was $38 instead of the estimated $20. The developer also identified MCP stdio transport timeouts and Worker Slack notification termination as hidden failure modes.

read2 min views3 publishedJun 17, 2026

My Slack bot was firing success messages while D1 row count sat completely frozen. The agent wasn't crashing — it was completing its Slack notification step and skipping all the actual work. Alerts alone would never have caught this.

The fix was treating Slack, D1 logging, and cost tracking as one connected pipeline instead of three separate tools. The schema I landed on is deliberately lean — 7 columns, with tokens_used

being the one that matters most:

CREATE TABLE agent_runs (
  id          TEXT PRIMARY KEY,
  agent_name  TEXT NOT NULL,
  run_at      INTEGER NOT NULL,
  status      TEXT NOT NULL,  -- 'ok' | 'error' | 'timeout'
  duration_ms INTEGER,
  tokens_used INTEGER,
  error_msg   TEXT
);

Without tokens_used

in D1, your cost dashboard only tells you what you spent today — not which agent burned it. I was mentally estimating my analysis tool's monthly Claude spend at around $20. The actual number was $38. I had no idea until I ran the aggregation query against the logs.

The other thing that bit me: MCP stdio transport timeouts. I spent days convinced it was an OAuth misconfiguration. It wasn't. One tool was calling an external API that occasionally took 30+ seconds. The stdio transport's read timeout was shorter than that, and Workers' 30-second wall clock limit made it worse — the subprocess got force-killed with a misleading exit code 1 error. The timeout

status rows piling up in D1 are what actually surfaced the pattern. Without that log, it would have stayed a vague "sometimes slow" complaint.

On the Workers side: if you fire Slack notifications with a plain await

instead of wrapping in ctx.waitUntil

, the fetch gets killed the moment your Worker returns a response. That's why Webhooks seem to fail randomly — they're not failing, they're being terminated. Also worth knowing: Slack returns a 400 if your payload exceeds 4,000 characters, and raw error stack traces will blow past that limit silently.

I wrote up the full breakdown — including the KV write explosion that hit 12K writes in 2 minutes and how I traced it back to a specific agent using D1, plus the Durable Object refactor for slow tools — over on riversealab.com.

── more in #developer-tools 4 stories · sorted by recency
── more on @slack 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/my-agent-was-succeed…] indexed:0 read:2min 2026-06-17 ·