{"slug": "we-deleted-10-real-users-with-a-test-cleanup-script-rca", "title": "We Deleted 10 Real Users with a Test-Cleanup Script — RCA", "summary": "On May 11, 2026, a test-cleanup script on HoneyChat, a Telegram-native AI companion with approximately 300 daily active users, accidentally deleted 10 real OAuth users from the production PostgreSQL 16 database. The script ran a range-based DELETE command targeting test user IDs between -91111200 and -91111100, but real users had been assigned IDs in that same narrow window due to an unintended interaction between the OAuth ID allocation system and hardcoded test IDs. Three independent design flaws—a sequence advancement mechanism, an upsert pattern that silently overwrote data, and the range-based cleanup—combined to make the deletion possible, with recovery from backup proving effectively impossible due to the 22-hour pg_dump cycle.", "body_md": "On **2026-05-11**, a test-cleanup script on [HoneyChat](https://honeychat.bot/) (Telegram-native AI companion, ~3 months in production, ~300 DAU, PostgreSQL 16 + Redis) ran:\n\n```\nDELETE FROM users WHERE id BETWEEN -91111200 AND -91111100;\n```\n\nAbout **ten real OAuth users** had IDs in that narrow window. They were now gone. Their `users`\n\nrow, their `subscriptions`\n\nrow, their `chat_sessions`\n\n/ `web_messages`\n\n— all gone from Postgres, and recovery from backup was effectively impossible (more on that below).\n\nThis is the postmortem and the contract we now run instead. The honest version: **the destructive script went to prod on a schema I never verified end-to-end**. Three separate design mistakes lined up to make it possible, and *not one* of them was caught before the script ran on a Tuesday night.\n\nTwo signup paths feed the `users`\n\ntable:\n\n| Population | ID source |\n|---|---|\n| Telegram users (most of base) | Positive integers — Telegram's own user IDs come in on the message envelope |\n| OAuth users (Google / Discord, web sign-in) | Negative integers from a Postgres sequence `web_user_id_seq`\n|\n\nOAuth IDs were negative on purpose — to keep them out of the positive Telegram-ID space and avoid collisions when a Telegram user later signed in via web. The minter in `api/web_auth.py`\n\nlooked roughly like this:\n\n``` php\nasync def _allocate_negative_user_id(db) -> int:\n    for _ in range(5):  # retry on rare UniqueViolation\n        new_id = -(await db.fetchval(\"SELECT nextval('web_user_id_seq')\"))\n        try:\n            await db.execute(\"INSERT INTO users (id, ...) VALUES (%s, ...)\", new_id)\n            return new_id\n        except UniqueViolation:\n            # someone else took it; bump the sequence past current MIN(id) and retry\n            await db.execute(\n                \"SELECT setval('web_user_id_seq', GREATEST(-MIN(id), currval('web_user_id_seq')))\"\n                \" FROM users\"\n            )\n    raise RuntimeError(\"could not allocate user id after 5 retries\")\n```\n\nThe `setval(GREATEST(-MIN(id), current))`\n\nstep is the load-bearing piece you have to keep in mind. It says: *whatever the most-negative users.id is right now, my sequence should be at least that far advanced, so I never collide with it again*.\n\nFor QA I was creating test users by hand with **hardcoded negative IDs** like `-91111101`\n\n, `-91111102`\n\n, … via `INSERT ... ON CONFLICT (id) DO UPDATE`\n\n. Easy to remember, easy to clean up later by range.\n\nThat choice triggered three independent failure modes, each on its own benign, lethal in combination:\n\n`web_user_id_seq`\n\nto 91 111 101.`setval(GREATEST(...))`\n\nline above, the very next OAuth signup retry saw the new test row with `id = -91111101`\n\n, computed `-MIN(id) = 91111101`\n\n, and advanced its own sequence. From that moment on, all real OAuth signups were drawing IDs in the neighbourhood of `-91111111`\n\n, `-91111112`\n\n, … — right inside the window where my test users lived.`INSERT ... ON CONFLICT (id) DO UPDATE`\n\n.`plan`\n\n, `auth_source`\n\nand several other fields instead of erroring.`DELETE … WHERE id BETWEEN -91111200 AND -91111100`\n\nNone of these three behaviors is exotic. The `setval(GREATEST(...))`\n\nretry pattern is a normal way to handle UniqueViolation on a seeded sequence. `ON CONFLICT DO UPDATE`\n\nis a normal Postgres upsert. Range-DELETE is a normal cleanup pattern. **Each was safe on its own; the interaction of all three was lethal — and I never set up a staging run that would have surfaced the interaction before it touched prod.**\n\nA 30-second sanity check on the second insert (\"did adding `id = -91111101`\n\nmove `web_user_id_seq`\n\n? what does the next OAuth signup land on?\") would have shown the cascading effect immediately. Nobody — me — ran it. The cleanup script ran nightly for weeks looking healthy because real OAuth signup volume hadn't yet pushed a real ID into the deletion window.\n\nRecovery from Postgres backup was effectively impossible. The chain:\n\n`pg_dump`\n\nto Storj was about 22 hours old — `ON CONFLICT DO UPDATE`\n\nhad already mutated their `plan`\n\nand `auth_source`\n\ncolumns earlier the same day.`users`\n\npages were gone too.What we *could* salvage came from side channels:\n\n`PERSIST`\n\n-ed what looked important and reconstructed recent conversations for affected users.`chat_sessions`\n\nand `web_messages`\n\nNet: people kept their accounts and most of their *recent* conversations, but lost web-side scene context older than the Redis window. We comped the affected users. The cost of the incident wasn't the rows — it was the trust dent and the day-and-a-half of recovery work.\n\n`ON CONFLICT DO UPDATE`\n\n+ range-DELETE) was never verified end-to-end before any of it touched production.`INSERT`\n\nof `id = -91111101`\n\nin staging followed by `SELECT id FROM users ORDER BY id LIMIT 5`\n\n, would have shown the sequence had jumped to the test neighbourhood. Nobody ran it. `BETWEEN`\n\nquery can sweep. An attribute is something a `WHERE auth_source = 'test'`\n\nquery cannot accidentally trip over.`INSERT ON CONFLICT (id) DO UPDATE`\n\n.`INSERT`\n\nwould have failed loudly and surfaced the collision Any one of these five would have saved us; we had all five wrong.\n\n```\nALTER TABLE users ADD COLUMN auth_source text NOT NULL DEFAULT 'oauth';\n-- backfill: 'telegram' for positive Telegram IDs, 'oauth' for legacy negative,\n-- 'test' for known test rows that we then deleted via the new path.\nCREATE INDEX users_auth_source_idx ON users(auth_source);\n# scripts/test_user_factory.py\nTEST_ID_RANGE = (1_000_000_001, 1_999_999_999)   # high *positive* — out of all real paths\n\ndef create_test_user() -> int:\n    user_id = _next_test_id()\n    db.execute(\n        \"INSERT INTO users (id, auth_source, ...) VALUES (%s, 'test', ...)\",\n        (user_id, ...),\n    )\n    return user_id\nphp\n# scripts/test_user_cleanup.py\ndef cleanup_test_users(dry_run: bool = True) -> int:\n    rows = db.fetchall(\"SELECT id FROM users WHERE auth_source = 'test'\")\n    if dry_run:\n        print(f\"Would delete {len(rows)} test users\")\n        return len(rows)\n    db.execute(\"DELETE FROM users WHERE auth_source = 'test'\")\n    return len(rows)\n```\n\nThe script defaults to `dry_run=True`\n\n. The CLI flag to actually run it is explicit and shows the count first.\n\nWe've also banned, in our engineering doc and in code review: any `DELETE … WHERE id BETWEEN …`\n\non the `users`\n\ntable, for any reason; any `INSERT … ON CONFLICT (id) DO UPDATE`\n\non `users.id`\n\n.\n\nWe rebuilt the backup story around explicit recovery point objectives. Off-site is **Storj** (~7 GB total, ~$0.03/month — cost is not the constraint).\n\n| Backup tier | Cadence | Destination | RPO |\n|---|---|---|---|\nPostgres `pg_dump` (logical) |\nHourly | Local disk | ≤ 1 h |\nPostgres `pg_dump` (logical) |\nDaily | Storj S3 | ≤ 24 h |\n| Off-site cold copy | Weekly | Storj S3 | ≤ 7 d |\n| Redis snapshot (RDB) | Every 6 h | Local + Storj | ≤ 6 h |\n\nWAL archiving to S3-compatible storage is still pending — that's the next item. With it, RPO drops to seconds. Without it, hourly logical dumps are the floor.\n\nA backup you've never restored from is a hope, not a backup. We restore from yesterday's hourly dump into a scratch container monthly. The first time we tried, the restore script had bit-rotted and didn't compile.\n\n`DELETE`\n\n, in staging, against real data, and read the results\" is thirty seconds of work. It is also the only thing that would have caught this.We've run the new contract for two weeks now. No range-DELETE incidents. The new `auth_source = 'test'`\n\nfilter is boring and explicit and impossible to fat-finger. Boring is the goal.\n\nThis postmortem is from production work at ** HoneyChat** — a Telegram-native AI companion. Canonical version:\n\n— *HoneyChat Engineering*\n\n`pg_dump`\n\ndocumentation", "url": "https://wpnews.pro/news/we-deleted-10-real-users-with-a-test-cleanup-script-rca", "canonical_source": "https://dev.to/sm1ck/we-deleted-10-real-users-with-a-test-cleanup-script-rca-1lb1", "published_at": "2026-05-28 10:39:49+00:00", "updated_at": "2026-05-28 10:52:57.318162+00:00", "lang": "en", "topics": ["ai-products", "ai-infrastructure", "mlops"], "entities": ["HoneyChat", "PostgreSQL", "Redis", "Telegram", "Google", "Discord"], "alternates": {"html": "https://wpnews.pro/news/we-deleted-10-real-users-with-a-test-cleanup-script-rca", "markdown": "https://wpnews.pro/news/we-deleted-10-real-users-with-a-test-cleanup-script-rca.md", "text": "https://wpnews.pro/news/we-deleted-10-real-users-with-a-test-cleanup-script-rca.txt", "jsonld": "https://wpnews.pro/news/we-deleted-10-real-users-with-a-test-cleanup-script-rca.jsonld"}}