We built a free status monitor for 77 AI APIs. Here's what 6 weeks of data taught us.

Prismix, a free status monitor for 77 AI APIs, has been running for six weeks and reveals that AI APIs fail in partial, non-traditional ways. OpenAI has the most incidents, typically resolving in 45–90 minutes, while Anthropic runs cleaner with rarer and shorter incidents. The tool also exposes a 'silent degradation' problem where services show operational on status pages but time out on probes.

Every AI developer has been here: your app is throwing 503s, users are pinging you, and you have 12 browser tabs open — OpenAI status page, Anthropic status page, the GitHub Copilot health page, three different Discord servers — trying to figure out is this me or is it them? That's the problem we set out to solve. Prismix https://prismix.dev aggregates status from 77 AI services in one place. Six weeks of running it in production taught us some things that might save you time. AI APIs don't fail like traditional infrastructure. They fail in weird, partial ways: The official status pages are optimistic by design. They're customer-facing communications tools, not real-time engineering dashboards. There's nothing wrong with this — but it means you need a different mental model for "is this service down?" When you watch 77 AI services simultaneously, patterns emerge fast. OpenAI is the most-watched service and has the most incidents to watch . The pattern is almost always the same: investigating → identified → monitoring → resolved , typically in 45–90 minutes. The investigating phase is where most developers panic — it looks bad but usually resolves without action on your end. Anthropic runs noticeably clean compared to its API usage growth. Incidents are rarer and shorter. When they do happen, updates arrive faster than most providers. The long tail is interesting. Services like Replicate, Runway, ElevenLabs, and Suno have incident patterns that don't correlate with OpenAI at all. If you're routing across multiple providers for redundancy, these are genuinely independent failure domains — worth knowing. The "silent degradation" problem is real. Multiple times we've seen a service show "operational" on its status page while our uptime probe was timing out. This is the main reason Prismix shows a latency sparkline per service — the status page is authoritative for announced incidents, but the probe catches real ones. Prismix https://prismix.dev pulls from official status pages, aggregates them into a single dashboard, and adds a few things that the individual pages don't have: Per-service latency probes — 24-hour sparklines showing actual response times, not just announced incidents. This catches the "silent degradation" cases. Cross-service incident timeline — /incidents shows everything that happened across all 77 services in one scrollable feed. Useful for postmortems "was anything else degraded when our error rate spiked at 3pm Tuesday?" . Embeddable status badges — put a live "OpenAI: operational" badge in your own app's status page with one line of HTML. Public REST API — GET /api/v1/statuses returns current status for all 77 services as JSON. No auth, no rate limit for reasonable use, CORS open. Free forever. RSS feed — /incidents.rss if you want AI incident updates in your feed reader. It's free because it runs entirely on Cloudflare's free tier Workers + KV . The Pro tier $10/mo adds email and webhook alerts for services you care about, but the core dashboard stays free. The stack is Astro 5 SSR + Cloudflare Workers + KV. We wrote about the performance walls we hit in a previous post https://dev.to/max 98b3db49c06de66802dcd/4-perf-walls-i-hit-shipping-an-ai-hub-on-cloudflare-workers-kv-246 — the short version is that 77 parallel KV reads per request is a bad idea and a single pre-aggregated snapshot blob is much better. One thing that surprised us: KV's free tier gives you 100,000 reads per day but only 1,000 writes . The cron job that refreshes status runs every 5 minutes, so every write is conditional — only write if the content actually changed. That dropped writes from ~8,400/day to ~600/day. Monitoring infrastructure has to be cheap to run, otherwise the incentive to keep it free disappears. Six weeks in, Prismix tracks 77 services with a clean incident timeline and growing usage. What we don't have yet is signal on what matters to you . Some things we're genuinely uncertain about: If any of that resonates, drop a comment. Honest feedback shapes what gets built next. Live at prismix.dev https://prismix.dev . Also at Prismix: an MCP server directory with 500+ servers and a curated AI news feed — but the status monitoring is the part we're most curious to hear about.