{"slug": "how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms", "title": "How we made parallel pytest safe for multi-tenant agent swarms", "summary": "Honeycomb's engineering team redesigned their parallel test execution system to prevent database collisions when multiple pytest invocations run concurrently against the same PostgreSQL host. The fix introduces per-invocation namespace tokens that make schema names unique across independent test runs, solving lock contention and silent data corruption that emerged as AI agents and multiple developers triggered simultaneous test suites.", "body_md": "Parallel test runs feel solved right up until the day they stop being solved.\n\nFor a while our backend test environment looked healthy. We had `pytest-xdist`\n\n, modular fixtures, Factory Boy, a structured `conftest.py`\n\n, separate platform and tenant databases, and a `~/run_tests`\n\nentrypoint that auto-loaded the test environment. One engineer running a focused slice of the suite was fast and stable. A few workers in parallel was routine.\n\nThen how we worked changed.\n\nWe started running more tests concurrently across more contexts: multiple tmux panes, background validation passes, longer fixture-heavy suites, and eventually multiple AI agents firing test invocations seconds apart against the same database host. That is when a structural problem we had been getting away with surfaced: even with xdist in place, two independent pytest invocations could collide on the same PostgreSQL schemas and produce `LockNotAvailable`\n\n, statement timeouts, or worse, silent fall-through writes into shared `public`\n\n.\n\nThe fix was not “raise the timeout again.” It was treating the test environment as shared infrastructure: per-invocation namespacing, fail-closed cleanup, deterministic connection labeling, and a few subtle bug fixes that mattered more than the headline change.\n\nThis post is about what broke, why xdist alone does not solve it, and the specific mechanisms we ended up needing.\n\n## Why xdist isn’t enough\n\n`pytest-xdist`\n\ngives you worker parallelism inside one invocation. That is not the same as making multiple independent invocations safe on the same database host.\n\nOur original isolation was schema-per-worker:\n\n```\ntest_gw0\ntest_gw1\ntest_gw2\n```\n\nThat works fine until two entirely separate `~/run_tests`\n\ncalls both spin up `gw0`\n\n. Both runs then try to drop, recreate, migrate, and seed the same physical schema. The result is structural lock contention that looks like flakiness until you see the pattern.\n\nA multi-tenant SaaS test setup typically touches at least two databases:\n\n- a\n**platform database** for users, orgs, roles, and global control-plane state - one or more\n**tenant databases** for the operational data each customer actually works with\n\nOnce tests add multi-tenant isolation cases or extra tenant DBs, “isolated by worker name” becomes “isolated only when other invocations stay out of the way.” For a small human team that is an occasional flake. For a swarm of AI agents kicking off concurrent runs and inheriting shell state from each other, it is the normal operating envelope.\n\nThe hard lesson was simple: xdist gives you worker parallelism inside one invocation. It does not automatically make multiple independent invocations safe on the same database host.\n\n## What we changed\n\nEvery pytest invocation now gets a namespace token, and every worker composes its schema from `namespace + worker_id`\n\n:\n\n```\ntest_{namespace}_{worker_id}\n```\n\nA namespace token from `~/run_tests`\n\nlooks like this:\n\n```\np18234t1745178234r3af93d71\n```\n\nThat is `p<PID>t<EPOCH>r<HEX>`\n\nwith 32 bits of `/dev/urandom`\n\nentropy in the suffix. Two concurrent invocations therefore look like:\n\n```\ntest_p18234t1745178234r3af93d71_gw0\ntest_p18234t1745178234r3af93d71_gw1\ntest_p20491t1745178241r9c12d4fe_gw0\ntest_p20491t1745178241r9c12d4fe_gw1\n```\n\nThe cross-invocation collision is now impossible by construction. The interesting work was in the parts of the system that had to learn this rule, and in two subtle bugs we hit along the way.\n\n### The runner generates and validates the namespace\n\n`~/run_tests`\n\ngenerates a fresh token per invocation by default, even if the calling shell already has `HONEY_TEST_SCHEMA_NAMESPACE`\n\nexported from a previous run. That stale-export defense matters: a debug session that left the variable set could otherwise cause the next “normal” run to silently re-collide with itself.\n\nA scoped debug override (`HONEY_TEST_SCHEMA_NAMESPACE_FORCE=mydebug`\n\n) is allowed but tightly restricted: lowercase alphanumeric, no underscores, no hyphens, max 32 characters. The runner validates the override in shell *before* the value reaches any report file path or environment export.\n\n### conftest.py covers direct pytest\n\n`bin/run_tests`\n\nis not the only way pytest gets invoked. To keep the design coherent under direct `pytest`\n\ncalls, `tests/conftest.py`\n\n:\n\n- preserves any namespace already exported by the runner\n- generates one if none exists\n- propagates the controller’s namespace into every xdist worker subprocess via the\n`pytest_configure_node`\n\nhook (`node.workerinput[\"schema_namespace\"]`\n\n)\n\nWithout that propagation, each xdist worker would generate its own namespace and you would be back to per-worker collisions inside a single invocation.\n\n### Subtle bug 1: silent search_path fallthrough\n\nThis is the bug that mattered most and that no test would have caught.\n\nPostgreSQL’s `SET LOCAL search_path TO foo, public`\n\ndoes not error if `foo`\n\ndoes not exist. It silently falls through to `public`\n\n. So if you migrate engine setup to namespaced schemas but a session-scoped seeder still composes the legacy name:\n\n```\n# Engines now use test_<namespace>_gw0 (correct)\n\n# Seeder, still using the old pattern:\nworker_id = os.getenv(\"PYTEST_XDIST_WORKER\", \"gw0\")\nschema = f\"test_{worker_id}\"               # \"test_gw0\" — does not exist any more\nseed_sess.execute(text(f\"SET LOCAL search_path TO {schema}, public\"))\nseed_sess.add(OperationType(...))           # ends up in `public`\nseed_sess.commit()                          # cross-invocation contamination, no error\n```\n\nThe seeded rows land in shared `public`\n\n. Concurrent invocations now see each other’s data. The original `LockNotAvailable`\n\nsymptom is gone, but a much subtler form of the same problem is silently active.\n\nEvery fixture, helper, and seeder that touches `search_path`\n\nnow composes through one shared `worker_schema()`\n\nhelper, so the engine’s schema and the seeder’s schema cannot diverge. We codified that as a structural invariant in the design doc.\n\n### Subtle bug 2: a regex that quietly halved DDL parallelism\n\nSchema setup uses a 2-slot DDL semaphore to bound how many concurrent `CREATE TABLE`\n\nruns hit Postgres at once. The slot was keyed off the worker number:\n\n```\n# Original\nslot_match = re.search(r\"(\\d+)\", schema)\nslot = int(slot_match.group(1)) % 2 if slot_match else 0\n```\n\nThat worked when the schema was `test_gw0`\n\nor `test_gw1`\n\n. After namespacing, the schema is `test_p12345t1745178234r3af93d71_gw0`\n\n, and `re.search(r\"(\\d+)\", ...)`\n\nmatches the *first* digit run: the PID. Every worker of one invocation gets `12345 % 2 = 1`\n\n. The 2-lane semaphore degenerates to a single lane, with no error and no logged warning.\n\nFix: anchor on the worker suffix.\n\n``` php\ndef ddl_slot(schema: str, slot_count: int = 2) -> int:\n    m = re.search(r\"_gw(\\d+)(?:_|$)\", schema)\n    return int(m.group(1)) % slot_count if m else 0\n```\n\nThe 2-slot count itself is empirical. Each `metadata.create_all()`\n\nfor our 220+ tenant tables acquires roughly 4,000 catalog locks. PostgreSQL’s shared lock table holds about 19,200 slots at the default `max_locks_per_transaction = 64`\n\n. Two concurrent DDL transactions peak near 8,000 locks, with comfortable margin. Four pushed peak past 16,000 and produced `out of shared memory`\n\nfailures under `-n 22`\n\n.\n\n### Active-namespace detection unions across every test DB\n\nStale-schema cleanup is best-effort and runs before pytest. It needs to drop crashed-run schemas without ever dropping a schema a live invocation is still using.\n\nThe first instinct (check `pg_stat_activity`\n\nin the DB you are about to sweep) is wrong. A live invocation may currently hold connections in only one of the four configured test DBs. If you check per-DB, you can drop a live namespace’s schemas from the other three, recreating the original race in a narrower window.\n\nThe sweep collects the live set across every reachable DB *before* sweeping any DB:\n\n```\nactive_namespaces, errors = collect_active_namespaces(all_urls)\nactive_detection_complete = not errors\n\nfor url in all_urls:\n    sweep_one(\n        url,\n        active_namespaces=active_namespaces,\n        active_detection_complete=active_detection_complete,\n    )\n```\n\nIf any DB is unreachable during collection (auth failure, network timeout, host blackhole), `active_detection_complete`\n\nflips to `False`\n\nand the sweep drops nothing. Stale candidates are recorded with `reason: active_detection_incomplete`\n\nso the fail-closed decision is visible in the JSON output. We bound connection establishment with a 5-second `connect_timeout`\n\nso a blackholing host cannot stall the pre-pytest sweep before this safety logic runs.\n\n### One drop per transaction\n\nThe first version of the sweep dropped every stale schema inside one transaction. That worked on small backlogs. On a real backlog of 56 stale schemas (each containing 220+ tenant tables), the cumulative `DROP SCHEMA ... CASCADE`\n\ncatalog locks blew past PostgreSQL’s shared lock table:\n\n```\npsycopg2.errors.OutOfMemory: out of shared memory\n```\n\nWorse, the failure aborted the transaction, leaving the rest of the backlog stranded for the next sweep to hit again at higher cost.\n\nThe fix: discovery in one read-only `engine.connect()`\n\nblock, then each drop in its own short `engine.begin()`\n\ntransaction with `SET LOCAL lock_timeout = '5s'`\n\nand `SET LOCAL statement_timeout = '30s'`\n\n. Failed drops land in a per-DB `failed`\n\nlist rather than aborting the rest of the database. The cleanup pass cleared the 56-schema backlog cleanly: 56 dropped, 0 failed, no OOM.\n\n### Connection labeling makes lock incidents traceable\n\nEvery test connection is labeled:\n\n```\napplication_name = pytest_{worker_id}_{namespace}[_{suffix}]\n```\n\nExamples:\n\n```\npytest_gw0_p18234t1745178234r3af93d71\npytest_gw0_p18234t1745178234r3af93d71_concurrency\n```\n\nThe underscore ban for ad-hoc namespaces is structural here. `pytest_gw0_my_debug_concurrency`\n\ncould parse as either `(my_debug, concurrency)`\n\nor `(my_debug_concurrency, no suffix)`\n\n, and the sweep’s active-namespace detector cannot pick. Banning underscores in ad-hoc tokens makes the suffix boundary unambiguous.\n\n`bin/diagnose_test_locks.py`\n\nqueries `pg_blocking_pids()`\n\nacross every configured test DB and prints both sides:\n\n```\nSELECT\n  blocked.pid              AS blocked_pid,\n  blocked.application_name AS blocked_app,\n  blocked.wait_event,\n  blocked.query            AS blocked_query,\n  blocker.pid              AS blocker_pid,\n  blocker.application_name AS blocker_app,\n  blocker.query            AS blocker_query\nFROM pg_stat_activity blocked\nJOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blockers(pid) ON true\nJOIN pg_stat_activity blocker ON blocker.pid = blockers.pid;\n```\n\nBecause every connection carries its worker and namespace in `application_name`\n\n, a `LockNotAvailable`\n\nis now traceable to a specific invocation, not inferred from a stack trace.\n\n### Test reports are namespaced\n\nA small but load-bearing detail. `~/run_tests`\n\nwrites JSON reports to `/tmp/pytest_reports/<timestamp>_<namespace>.json`\n\nand collected-nodeid scratch files to `.collected_nodeids_<namespace>.txt`\n\n. Two concurrent invocations starting in the same second used to overwrite each other’s evidence. The reader (`~/test_report`\n\n) globs `*.json`\n\n, so namespacing the filenames does not break anything downstream.\n\n## Old design vs current design\n\n| Concern | Earlier shape | Current shape |\n|---|---|---|\n| Worker schema naming | `test_gw0` | `test_{namespace}_{worker}` |\n| Cross-invocation safety | best effort | structural isolation |\n| Report artifacts | shared timestamp paths | namespace-qualified paths |\n| Active test detection | per-DB view, risk of partial picture | union across all configured DBs |\n| Cleanup posture | could be destructive when uncertain | fail-closed when detection is incomplete |\n| Schema drop strategy | many drops in one transaction | one schema per transaction |\n| DDL slot derivation | first digit run in schema name | anchored on `_gwN` suffix |\n`search_path` composition | hand-built per call site | single `worker_schema()` helper |\n| Lock triage | infer from failures | `pg_blocking_pids()` plus `application_name` labels |\n\n## What we deliberately did not do\n\nA few options we considered and rejected, in case they are useful for someone making the same call:\n\n**Per-invocation database**(`CREATE DATABASE … DROP DATABASE`\n\n): needs`CREATEDB`\n\n, re-runs Alembic per invocation, and adds 4 to 8 seconds × N databases to every startup. Too heavy.**Bigger**: more band-aid. The failure mode becomes 120-second waits instead of 15-second errors. Does not fix the race.`lock_timeout`\n\neverywhere**Postgres advisory locks** instead of file-system fcntl: does not replace namespace isolation. Reasonable follow-up if multi-host CI runners ever show up; not worth doing inside a local-WSL workflow.**Queueing invocations** as a discipline rule: fragile, and was already getting violated in practice when the failures started.\n\n## Why this matters more for AI-assisted teams\n\nA human team often hits this class of collision occasionally. An agent swarm hits it routinely:\n\n- multiple agents may launch tests within seconds of each other\n- they often run similar target sets\n- they may inherit shell state you forgot about\n- they are more likely to stress stale cleanup and report collection\n\nWhat looked like a flaky edge case becomes the normal envelope of operation. The test environment has to behave like shared infrastructure, with namespacing, cleanup rules, diagnostics, and safe defaults that hold without operator coordination.\n\n## Closing\n\nThe same instincts behind this work show up in the product: isolate state explicitly, fail closed when certainty disappears, label work so you can trace it later, prefer structural guarantees over timeout band-aids. That is the posture behind the [EquatorOps platform](/platform) and the [engine architecture](/platform/engines).\n\nIf you are building operational software with real concurrency, the takeaway is not “use namespaces.” It is that worker-level parallelism stops being enough once independent runs can collide on shared state, and once AI agents are part of your engineering loop, those collisions stop being rare.\n\nIf you want to talk about the developer surface behind that architecture, [/developers](/developers).", "url": "https://wpnews.pro/news/how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms", "canonical_source": "https://equatorops.com/resources/blog/parallel-pytest-agent-swarms", "published_at": "2026-06-27 17:47:41+00:00", "updated_at": "2026-06-27 18:04:36.147252+00:00", "lang": "en", "topics": ["developer-tools", "ai-agents", "ai-infrastructure"], "entities": ["Honeycomb", "pytest-xdist", "PostgreSQL", "Factory Boy", "pytest"], "alternates": {"html": "https://wpnews.pro/news/how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms", "markdown": "https://wpnews.pro/news/how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms.md", "text": "https://wpnews.pro/news/how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms.txt", "jsonld": "https://wpnews.pro/news/how-we-made-parallel-pytest-safe-for-multi-tenant-agent-swarms.jsonld"}}