Let Aurora Sleep: Multi-Tenant SaaS Cost, Reconsidered with AI IaC

A developer proposes a cost-optimized architecture for multi-tenant SaaS on AWS that demotes Aurora PostgreSQL from the center of every request to a thin control plane, allowing it to sleep when idle. By shifting viewing, LLM hot paths, and ephemeral state to S3, CloudFront, DynamoDB, and non-VPC Lambda, the design reduces fixed costs for dev, internal tools, and bursty workloads while maintaining enterprise requirements like VPC isolation. The approach was developed by iterating with AI coding assistants Claude Code and Codex against a real dev deployment.

For most multi-tenant SaaS, the default still rhymes: PostgreSQL on Aurora, behind an API, in a private subnet. Stable, well-understood, the shape you'd draw on a whiteboard. But I keep feeling a mismatch between what that shape costs and what it actually does while a product is small or bursty. A lot of the time, Vercel + Cloudflare + Supabase or Neon would give me similar real-world performance for less. And yet I get pulled back to AWS and Aurora — not for raw performance, but for enterprise requirements: VPC isolation, audit posture, "it has to live in our AWS org." What changed for me is the third option in that fork. It used to be "cheap edge stack or heavy AWS stack." Now there's a middle path: keep the AWS/Aurora skeleton the enterprise wants, but redesign it so it stops costing like it's always on — and reach that redesign by sparring with an AI against a real dev deployment, instead of needing to already be an infra specialist. Here's the whole idea in one picture: BEFORE — everything flows through Aurora Client ── API ── VPC App ── Aurora always warm │ └── NAT ── OpenAI NAT always provisioned AFTER — Aurora only for control/commit Client ── CloudFront / S3 viewer & artifacts no DB └─ Hot Lambda ── DynamoDB hot path / cache / counters no DB Control actions ── VPC Lambda ── Aurora wakes on purpose LLM calls ──────── non-VPC Lambda ── OpenAI no NAT, no DB One scoping note up front: this is primarily for dev, internal tools, bursty early-stage SaaS, and low-frequency enterprise environments — not a blanket "let your production database sleep" recommendation for steady, high-traffic workloads. It's an optimization for a specific shape of usage, not a rejection of the proven production default. Not a "Postgres is over" piece, and not a best-practice writeup either. This is something I'm still validating — a design space I mostly reached by sparring with Claude Code and Codex while keeping one eye on cost-performance, then deploying to dev to see what actually held. Notes on where my thinking has drifted, written down mostly so I can find out where it's wrong. If anything here is useful, take it as "you can stumble into shapes like this too," not "do it this way." Where I've landed for now: - The cost pain usually isn't Aurora. It's making Aurora the center of everyrequest.- Demote Aurora to a thin control plane— canonical state, commit, approval, audit — and let it sleep.- Push viewing, LLM hot paths, and ephemeral state to S3 / CloudFront / DynamoDB / non-VPC Lambda. - What makes this reachablefor non-specialists isdeclarative schema + IaC + AI, validated agilely in dev. The textbook shape — CloudFront → ALB/API Gateway → app → Aurora, pool model with tenant id everywhere and RLS for isolation — is fine. The problem is what accretes around it to make it reliable: always-on Aurora, RDS Proxy, NAT Gateway, readers, VPC endpoints, logs. For dev or a 1–30 person workload, the fixed cost of reliability is wildly out of proportion to the traffic. You're paying ledger-grade rent to store scratch work deleted in an hour. The mismatch isn't "Aurora is expensive." It's "Aurora is expensive when it can never go idle ." Here's the cost shape the redesign is chasing — not exact dollars, but where the fixed costs go: BEFORE AFTER - Aurora always warm - Aurora wakes only for control/commit - NAT always provisioned - No NAT for LLM egress - RDS Proxy holding connections - No Proxy in the sleep path - DB hit on health/viewer/LLM - Viewer/LLM/health paths stay DB-free Don't put Aurora in the path of every request. Make it a thin control plane: canonical state, approval, audit, commit. Move viewing, LLM execution, short-lived state, and delivery to S3 / CloudFront / DynamoDB / non-VPC Lambda. There's a system of work — high-churn, disposable, fine to lose — and a system of record — money, contracts, audit, singular and strict. The mistake is paying record-grade prices for work-grade state. So split by job: | Layer | Job | |---|---| | S3 | artifacts, reports, raw payloads cheap bulk | | CloudFront | viewer delivery — keeps reads off Aurora | | DynamoDB | projections, cache, locks, progress, counters hot path | | Aurora | tenant, RBAC, manifest, lineage, approval, audit, rollups | | non-VPC Lambda | outbound LLM calls — no NAT, never touches the DB | Once viewing and the LLM hot path stop touching Aurora, it stops waking for trivia. That, more than any price knob, moves the bill. One discipline keeps DynamoDB from quietly becoming a second source of truth: anything stored there should be either TTL-bound, recomputable from Aurora/S3, or an explicit live counter with a reconciliation path. If a value is none of those, it probably wants to be in Aurora. Aurora Serverless v2 can scale to zero in supported configurations min acu = 0 , and while paused, compute charge goes to zero storage still bills . The flag is the easy part; the discipline is not poking it awake. php Bad: set min=0, but login/health/LLM/viewer all read Aurora - never sleeps Good: only login, admin, manifest commit, rollups, audit wake it on purpose Two silent traps: in this sleep-first setup, RDS Proxy works against you — it keeps database connections around, which prevents pause — and any open user-initiated connection does the same. So sleep-first wants no Proxy, a small pool, short idle timeouts. None of this is a knock on RDS Proxy in general; it's doing exactly its job, which happens to be the opposite of what you want here. The goal isn't to make Aurora cheap by configuration; it's to make the application structurally capable of not needing Aurora most of the time. The min acu = 0 flag only pays off once the app no longer reaches for the DB on every request. Dropping NAT is usually pitched as savings it bills per hour and per GB . The better reason is blast radius: A Lambda that can touch the data cannot go out. A Lambda that can go out cannot touch the data. One function with both powers, if compromised, can read and exfiltrate. Split it: a VPC control Lambda reaches RDS but can't egress; a non-VPC egress Lambda calls the external LLM but has no line to the DB. Cheaper and smaller blast radius at once. Caveat: an /ask flow hands tenant context to the egress function, so minimize the payload, forbid logging it, keep a request id audit trail. The point isn't that non-VPC is magically safe. The egress Lambda still touches the OpenAI key and whatever prompt/context you pass it, so it still deserves a narrow IAM role, a single-purpose secret, no broad Secrets Manager or S3 read permissions, and strict logging rules. The win is narrower and more durable: it cannot both read the database and ship it somewhere. And the boundary only holds if the network isn't the only thing keeping the egress Lambda away from the DB. Aurora has to be private, its security group must not allow the egress path, and the egress Lambda must have no Data API access or broad Secrets Manager permissions that would quietly recreate a database path through IAM. "Not in the VPC" is necessary, not sufficient. This design means making the same placement call constantly: canonical Aurora? DynamoDB cache? S3 artifact? hot path? allowed to wake the DB? Migration-history schema fights you here — to know what is , you replay what happened 001 create… 006 revert… . Declarative schema flips it: you describe the desired current state and let the tool diff it. Both you and the AI read one artifact that says what should be true now . It doesn't abolish prod migrations — the loop is declarative - plan - reviewed migration - apply - drift check . You think in declarative state; you apply via migration. That's what fits AI-assisted work: the AI reasons over a stable description, not a changelog. This is the claim I most want to make — and hold most loosely. A sleep-first, no-NAT, projection-driven design used to need someone who lived in AWS networking. The wall wasn't the idea; it was getting VPC routing, IAM boundaries, and serverless wiring all correct. Most of this design, honestly, isn't something I knew up front — it's where I drifted by sparring with Claude Code and Codex against a real dev deployment: "does this Lambda actually have egress?", "what wakes Aurora here?", "where's the tenant boundary?" With IaC describing the whole thing as code, the feedback is a deployed stack you can poke, not a whiteboard argument — and that's what let me reach a shape like this at all. It costs more than a managed edge platform and burns dev cycles. But infra design became something you validate agilely rather than get right up front from expertise alone. To be precise about the claim: AI doesn't remove the need for infrastructure expertise. It changes the iteration loop — from "know the answer upfront" to "generate, deploy, inspect, and correct faster." You still need the judgment to know what to inspect; you just acquire it by iterating instead of having to bring all of it to the whiteboard. The honest caveat: the AI will confidently produce wiring that's "plausible but subtly wrong" — a security group more open than you think, a function you believe is sandboxed but isn't. So the loop must include verification you can eyeball: probe the egress, confirm what wakes the DB, read the plan diff. And be clear-eyed that demoting Aurora is really distributed-systems-ification — you trade fixed cost for consistency, projections, and sync. It's a tradeoff, not a free win. Aurora awake < ~100h/mo: min acu=0 + wake-ahead — keep going ~100–250h/mo: compare against RDS db.t4g.micro/small ~250h/mo: always-on RDS, or Aurora min acu=0.5 prod-like / heavy validation: Aurora min acu=0.5 Judge total cost ACU + NAT + Proxy + endpoints + DynamoDB + S3 + Lambda + CloudWatch , not Aurora alone. For bursty dev, killing Aurora compute and NAT usually wins; for steady production, often it won't. These aren't universal thresholds — they're decision triggers for my environment, and they move with region, log volume, and DynamoDB/S3 usage. Numbers beat vibes here. If I were deciding whether this is paying off, these are the signals I'd watch: JWT + DynamoDB cache is fine for low-risk reads, but role/budget changes, manifest commits, approvals, and any tenant-data-injecting /ask must consult canonical Aurora. "Fast but stale permissions" is how you ship an authz incident. /warmup doing select 1 leaks almost nothing — but it's a If you want to see whether your workload fits, run down these before touching anything: If most boxes are unchecked, the boring always-on RDS/Aurora is probably still the right call — and that's fine. The shift isn't any one AWS feature. It's that the boundary between "only an infra specialist could safely build this" and "a product engineer can reach it by sparring with an AI against a dev deployment" has moved a lot. Enterprise gravity toward AWS and Aurora is real and isn't leaving. But it no longer forces the always-on, NAT-heavy, everything-through-the-DB default. Keep Aurora as the vault — strict, singular — let it sleep, and run the workbench at the edge, with explicit events carrying workbench changes back into the control plane. Whether this particular shape survives more validation, I don't know yet. What I'm fairly sure of is that the path to trying designs like it is now something you can iterate into — with AI and IaC, deploying and inspecting — instead of having to know it cold beforehand. If you've run something like this in production — or watched it fall apart — I'd like to hear where it broke for you.