{"slug": "run-the-readiness-audit-before-you-flip-dns", "title": "Run the Readiness Audit Before You Flip DNS", "summary": "DiagnosticPro migrated its live product from Firebase, Firestore, GCP, and Vertex AI to a single self-hosted VPS. Before flipping DNS, an adversarial readiness audit revealed that the new database was missing nine columns on the submissions table and three on the analyses table, which would have caused payment failures after the switch. The team implemented an application-level schema migration that automatically adds missing columns on startup, preventing future drift.", "body_md": "The DiagnosticPro migration moved a live product off Firebase, Firestore, GCP, and Vertex AI onto a single self-hosted VPS. New database engine, new secrets model, new LLM client, new proxy, new deployment shape — the whole substrate replaced at once. The plan ended with the usual last step: flip `diagnosticpro.io`\n\nDNS from the old host to the new one and watch the traffic move over.\n\nThat last step is the one that cannot be un-done cheaply. The moment DNS propagates, real customers hit the new stack with real credit cards. Everything before that moment is reversible: the old host is still authoritative, rollback costs a config revert. Everything after it is a live incident.\n\nSo before the flip, the new stack was put through an *adversarial readiness audit*: not a smoke test against the happy path, but a deliberate, multi-lens attempt to find how the migration would fail in production — run against the deployed stack while the old host was still authoritative and rollback was still free. It found a failure, and it was the worst possible kind.\n\nThe live VPS database was missing **every column the payment and membership write-paths depended on** — nine columns on the `submissions`\n\ntable, three on the `analyses`\n\ntable. The schema the code was written against and the schema actually deployed on the box had drifted apart.\n\nTrace what that means through a real Stripe checkout:\n\n`checkout.session.completed`\n\nto the webhook.The customer has already been charged. The system then errors out and cannot record the purchase, cannot queue the diagnostic, cannot deliver the report. This is the worst kind of failure: **the money moves, the system errors, and no record survives on either side.** Not a declined card, not a graceful \"try again\" — a completed charge followed by a server error, with no record on your side that the transaction happened.\n\nEvery single paid checkout after the DNS flip would have hit this. Not an edge case. The main revenue path, guaranteed to fail, on a stack that had already collected the customer's money.\n\nThe repair was not a one-time hand-run `ALTER TABLE`\n\non the box — that fixes today's database and silently rots the next time a fresh environment comes up. The fix was to make the application **upgrade its own schema on startup**, so any database it boots against converges to the schema the code expects.\n\nThe migration reads the current shape of each table, compares it to what the code needs, and applies only the missing changes:\n\n``` js\nfunction migrateSchema(db) {\n  const cols = db.prepare(\"PRAGMA table_info(submissions)\").all();\n  const have = new Set(cols.map((c) => c.name));\n\n  const wanted = {\n    stripe_session_id: \"TEXT\",\n    payment_status:    \"TEXT\",\n    membership_tier:   \"TEXT\",\n    // ...the rest of the drifted columns\n  };\n\n  for (const [name, type] of Object.entries(wanted)) {\n    if (!have.has(name)) {\n      db.exec(`ALTER TABLE submissions ADD COLUMN ${name} ${type}`);\n    }\n  }\n}\n```\n\n`PRAGMA table_info`\n\ngives the current columns; the loop only issues `ALTER TABLE`\n\nfor the ones that are absent. That `have.has(name)`\n\nguard is load-bearing: SQLite's `ALTER TABLE ADD COLUMN`\n\nhas no `IF NOT EXISTS`\n\nclause before version 3.37, so it throws if the column is already there — the idempotency lives in the JavaScript check, not the SQL. Run it twice and the second pass is a clean no-op. A brand-new database converges to the full schema; an old drifted one gets exactly the missing columns added in place. (The `${name} ${type}`\n\ninterpolation is safe only because both come from a hardcoded object in the same file — never substitute column names or types from untrusted input; SQLite won't bind them as parameters.)\n\nVerified against the live VPS database, it applied **12 migrations in place** — the exact drift the audit had predicted, closed before a single customer touched the new stack. A regression test locks the behavior in: create an old-shape database, boot the app, assert the columns exist, boot again, assert the second boot is a clean no-op.\n\nThe schema drift was the headline, but an irreversible cutover has more than one way to go wrong. The audit was multi-lens on purpose — many independent passes, several review angles, each finding adversarially re-verified rather than trusted on first sight. Several of the others were also invisible until someone actually paid.\n\n**A GCP secrets client on a non-GCP host.** The code carried a Google Secret Manager client inherited from its Firebase/GCP life. On a GCP host that client silently uses the platform's identity. On a self-hosted VPS there is no GCP metadata server — so the client hunts for one, and that hunt can hang process startup while it waits on a network endpoint that will never answer. You inherit this hazard for free when you self-host code that assumed it was running inside Google's platform. The fix was an env-first secrets model: secrets materialized at deploy time from an encrypted store into the process environment, and the cloud SDKs removed entirely — `@google-cloud/secret-manager`\n\nand `google-auth-library`\n\ndeleted, pruning **63 npm packages** from the install.\n\n**A dead gateway URL on the success page.** The deployed frontend bundle was still calling a decommissioned GCP gateway host — a stale fallback URL baked into the post-payment success page. It would only fire *after* a customer paid, which is precisely why a normal click-through of the site never surfaced it. The path a paying customer takes is often the least-tested path on the whole site.\n\n**A login guaranteed to throw.** The Whop login flow referenced an undeclared `membership`\n\nvariable — residue from the deleted Firestore code path. Every login attempt would hit a `ReferenceError`\n\nand break. Not intermittent, not conditional: a guaranteed crash on a core flow, left behind by the migration itself.\n\n**Webhook replay with no idempotency guard.** Stripe re-delivers webhooks; `checkout.session.completed`\n\ncan arrive more than once for the same session. Without a guard, a duplicate delivery re-queues the diagnostic work and re-runs the LLM — double cost, double side effects. The fix keys on the session ID and treats any already-seen event as a no-op acknowledgement.\n\nThe same discipline — replicate the real environment instead of trusting the local one — surfaced two test bugs a warm laptop hides:\n\n**An exact-string environment gate.** The backend gates its mock-LLM path on the *exact* string `'true'`\n\n. The end-to-end test set `TEST_MOCK_LLM=1`\n\n. Environment variables are always strings in Node, so the gate compares `\"1\" !== 'true'`\n\n— the test would have driven a **real, keyless LLM call** and failed the full-flow run for a reason that had nothing to do with the code under test. The lesson is unforgiving and portable: know exactly which value your gate compares against. `\"1\"`\n\n, `\"true\"`\n\n, `\"yes\"`\n\n, and `true`\n\nare four different things, and a strict-equality check honors exactly one of them.\n\n**\"Passes locally\" that CI never saw.** A `ts-jest`\n\ninline `tsconfig`\n\nomitted `lib`\n\n. TypeScript derives its default `lib`\n\nfrom `target`\n\n, and `target: es2022`\n\ndefaults to `[\"ES2022\"]`\n\n— which has no `DOM`\n\n. Locally the project-level `tsconfig`\n\nstill supplied `DOM`\n\n, so compilation passed; the inline `ts-jest`\n\nconfig didn't inherit it, so CI type-checked against the bare `es2022`\n\ndefaults and the DOM globals — `window`\n\n, `IntersectionObserver`\n\n— failed to resolve. The green checkmark on the laptop came from config the CI transform never read. The fix pinned `lib: [\"ES2022\", \"DOM\", \"DOM.Iterable\"]`\n\nand turned on `isolatedModules`\n\n. A related trap in the same suite: the Playwright job ran `vite preview`\n\nwithout building `dist`\n\nfirst, so the preview server errored with \"directory dist does not exist.\" Both share a moral — **\"it works on my machine\" can quietly mean \"it works with config and build artifacts CI never sees.\"** CI is cold and empty by design — which is exactly why an adversarial audit replicates the deploy environment instead of reading the code and trusting it.\n\nThe audit ran **independent, adversarial review of the new stack before the irreversible step, while rollback was still free.** Many separate passes, several review lenses, every finding re-verified against the actual live artifacts rather than the code as written. The drift between \"what the code assumes\" and \"what is actually deployed\" is exactly the gap that a happy-path smoke test steps right over.\n\nFinding the bugs was half the job. The other half was making sure they stayed fixed. Before the flip, a revenue-path test suite was installed and gated in CI:\n\n`%PDF-`\n\nheader.The pre-remediation grade was recorded as **D+** in a `TEST_AUDIT.md`\n\n. That honest starting grade mattered: it named the gap in writing so the revenue-path P0s could be tracked and closed rather than waved through. By the flip, every revenue-path P0 was closed.\n\nFor context on the scope of the drift, here is the substrate the traffic landed on — every layer new relative to the Firebase/GCP original, which is why nothing about the old deployment could be assumed to still hold:\n\n`better-sqlite3`\n\nin WAL mode`gpt-4o`\n\nthrough an OpenAI-compatible client.`callVertexAI`\n\nbecame `callLLM`\n\n; the provider is now a pure environment swap, not a code change.`/api`\n\non the VPS, on The result: `diagnosticpro.io`\n\nDNS was flipped off Firebase and went live on the VPS on 2026-07-01.\n\nTake this to your own cutover:\n\n`ALTER TABLE`\n\nfixes exactly one database until the next fresh deploy rots it.The moment before an irreversible cutover is the **cheapest time you will ever have** to find catastrophic bugs. After the DNS flip, the schema drift is a live payment incident with charged customers and no records. Before it, it is a diff. The distance between those two costs is one adversarial audit of the new stack plus a revenue-path test suite — run while rollback is still nothing more than reverting a config. Buy the safety while it is free.", "url": "https://wpnews.pro/news/run-the-readiness-audit-before-you-flip-dns", "canonical_source": "https://dev.to/jeremy_longshore/run-the-readiness-audit-before-you-flip-dns-2g07", "published_at": "2026-07-04 13:00:29+00:00", "updated_at": "2026-07-04 13:19:07.303124+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure", "mlops"], "entities": ["DiagnosticPro", "Firebase", "Firestore", "GCP", "Vertex AI", "Stripe", "SQLite"], "alternates": {"html": "https://wpnews.pro/news/run-the-readiness-audit-before-you-flip-dns", "markdown": "https://wpnews.pro/news/run-the-readiness-audit-before-you-flip-dns.md", "text": "https://wpnews.pro/news/run-the-readiness-audit-before-you-flip-dns.txt", "jsonld": "https://wpnews.pro/news/run-the-readiness-audit-before-you-flip-dns.jsonld"}}