How we made parallel pytest safe for multi-tenant agent swarms

Honeycomb's engineering team redesigned their parallel test execution system to prevent database collisions when multiple pytest invocations run concurrently against the same PostgreSQL host. The fix introduces per-invocation namespace tokens that make schema names unique across independent test runs, solving lock contention and silent data corruption that emerged as AI agents and multiple developers triggered simultaneous test suites.

Parallel test runs feel solved right up until the day they stop being solved. For a while our backend test environment looked healthy. We had pytest-xdist , modular fixtures, Factory Boy, a structured conftest.py , separate platform and tenant databases, and a ~/run tests entrypoint that auto-loaded the test environment. One engineer running a focused slice of the suite was fast and stable. A few workers in parallel was routine. Then how we worked changed. We started running more tests concurrently across more contexts: multiple tmux panes, background validation passes, longer fixture-heavy suites, and eventually multiple AI agents firing test invocations seconds apart against the same database host. That is when a structural problem we had been getting away with surfaced: even with xdist in place, two independent pytest invocations could collide on the same PostgreSQL schemas and produce LockNotAvailable , statement timeouts, or worse, silent fall-through writes into shared public . The fix was not “raise the timeout again.” It was treating the test environment as shared infrastructure: per-invocation namespacing, fail-closed cleanup, deterministic connection labeling, and a few subtle bug fixes that mattered more than the headline change. This post is about what broke, why xdist alone does not solve it, and the specific mechanisms we ended up needing. Why xdist isn’t enough pytest-xdist gives you worker parallelism inside one invocation. That is not the same as making multiple independent invocations safe on the same database host. Our original isolation was schema-per-worker: test gw0 test gw1 test gw2 That works fine until two entirely separate ~/run tests calls both spin up gw0 . Both runs then try to drop, recreate, migrate, and seed the same physical schema. The result is structural lock contention that looks like flakiness until you see the pattern. A multi-tenant SaaS test setup typically touches at least two databases: - a platform database for users, orgs, roles, and global control-plane state - one or more tenant databases for the operational data each customer actually works with Once tests add multi-tenant isolation cases or extra tenant DBs, “isolated by worker name” becomes “isolated only when other invocations stay out of the way.” For a small human team that is an occasional flake. For a swarm of AI agents kicking off concurrent runs and inheriting shell state from each other, it is the normal operating envelope. The hard lesson was simple: xdist gives you worker parallelism inside one invocation. It does not automatically make multiple independent invocations safe on the same database host. What we changed Every pytest invocation now gets a namespace token, and every worker composes its schema from namespace + worker id : test {namespace} {worker id} A namespace token from ~/run tests looks like this: p18234t1745178234r3af93d71 That is p<PID t<EPOCH r<HEX with 32 bits of /dev/urandom entropy in the suffix. Two concurrent invocations therefore look like: test p18234t1745178234r3af93d71 gw0 test p18234t1745178234r3af93d71 gw1 test p20491t1745178241r9c12d4fe gw0 test p20491t1745178241r9c12d4fe gw1 The cross-invocation collision is now impossible by construction. The interesting work was in the parts of the system that had to learn this rule, and in two subtle bugs we hit along the way. The runner generates and validates the namespace ~/run tests generates a fresh token per invocation by default, even if the calling shell already has HONEY TEST SCHEMA NAMESPACE exported from a previous run. That stale-export defense matters: a debug session that left the variable set could otherwise cause the next “normal” run to silently re-collide with itself. A scoped debug override HONEY TEST SCHEMA NAMESPACE FORCE=mydebug is allowed but tightly restricted: lowercase alphanumeric, no underscores, no hyphens, max 32 characters. The runner validates the override in shell before the value reaches any report file path or environment export. conftest.py covers direct pytest bin/run tests is not the only way pytest gets invoked. To keep the design coherent under direct pytest calls, tests/conftest.py : - preserves any namespace already exported by the runner - generates one if none exists - propagates the controller’s namespace into every xdist worker subprocess via the pytest configure node hook node.workerinput "schema namespace" Without that propagation, each xdist worker would generate its own namespace and you would be back to per-worker collisions inside a single invocation. Subtle bug 1: silent search path fallthrough This is the bug that mattered most and that no test would have caught. PostgreSQL’s SET LOCAL search path TO foo, public does not error if foo does not exist. It silently falls through to public . So if you migrate engine setup to namespaced schemas but a session-scoped seeder still composes the legacy name: Engines now use test <namespace gw0 correct Seeder, still using the old pattern: worker id = os.getenv "PYTEST XDIST WORKER", "gw0" schema = f"test {worker id}" "test gw0" — does not exist any more seed sess.execute text f"SET LOCAL search path TO {schema}, public" seed sess.add OperationType ... ends up in public seed sess.commit cross-invocation contamination, no error The seeded rows land in shared public . Concurrent invocations now see each other’s data. The original LockNotAvailable symptom is gone, but a much subtler form of the same problem is silently active. Every fixture, helper, and seeder that touches search path now composes through one shared worker schema helper, so the engine’s schema and the seeder’s schema cannot diverge. We codified that as a structural invariant in the design doc. Subtle bug 2: a regex that quietly halved DDL parallelism Schema setup uses a 2-slot DDL semaphore to bound how many concurrent CREATE TABLE runs hit Postgres at once. The slot was keyed off the worker number: Original slot match = re.search r" \d+ ", schema slot = int slot match.group 1 % 2 if slot match else 0 That worked when the schema was test gw0 or test gw1 . After namespacing, the schema is test p12345t1745178234r3af93d71 gw0 , and re.search r" \d+ ", ... matches the first digit run: the PID. Every worker of one invocation gets 12345 % 2 = 1 . The 2-lane semaphore degenerates to a single lane, with no error and no logged warning. Fix: anchor on the worker suffix. php def ddl slot schema: str, slot count: int = 2 - int: m = re.search r" gw \d+ ?: |$ ", schema return int m.group 1 % slot count if m else 0 The 2-slot count itself is empirical. Each metadata.create all for our 220+ tenant tables acquires roughly 4,000 catalog locks. PostgreSQL’s shared lock table holds about 19,200 slots at the default max locks per transaction = 64 . Two concurrent DDL transactions peak near 8,000 locks, with comfortable margin. Four pushed peak past 16,000 and produced out of shared memory failures under -n 22 . Active-namespace detection unions across every test DB Stale-schema cleanup is best-effort and runs before pytest. It needs to drop crashed-run schemas without ever dropping a schema a live invocation is still using. The first instinct check pg stat activity in the DB you are about to sweep is wrong. A live invocation may currently hold connections in only one of the four configured test DBs. If you check per-DB, you can drop a live namespace’s schemas from the other three, recreating the original race in a narrower window. The sweep collects the live set across every reachable DB before sweeping any DB: active namespaces, errors = collect active namespaces all urls active detection complete = not errors for url in all urls: sweep one url, active namespaces=active namespaces, active detection complete=active detection complete, If any DB is unreachable during collection auth failure, network timeout, host blackhole , active detection complete flips to False and the sweep drops nothing. Stale candidates are recorded with reason: active detection incomplete so the fail-closed decision is visible in the JSON output. We bound connection establishment with a 5-second connect timeout so a blackholing host cannot stall the pre-pytest sweep before this safety logic runs. One drop per transaction The first version of the sweep dropped every stale schema inside one transaction. That worked on small backlogs. On a real backlog of 56 stale schemas each containing 220+ tenant tables , the cumulative DROP SCHEMA ... CASCADE catalog locks blew past PostgreSQL’s shared lock table: psycopg2.errors.OutOfMemory: out of shared memory Worse, the failure aborted the transaction, leaving the rest of the backlog stranded for the next sweep to hit again at higher cost. The fix: discovery in one read-only engine.connect block, then each drop in its own short engine.begin transaction with SET LOCAL lock timeout = '5s' and SET LOCAL statement timeout = '30s' . Failed drops land in a per-DB failed list rather than aborting the rest of the database. The cleanup pass cleared the 56-schema backlog cleanly: 56 dropped, 0 failed, no OOM. Connection labeling makes lock incidents traceable Every test connection is labeled: application name = pytest {worker id} {namespace} {suffix} Examples: pytest gw0 p18234t1745178234r3af93d71 pytest gw0 p18234t1745178234r3af93d71 concurrency The underscore ban for ad-hoc namespaces is structural here. pytest gw0 my debug concurrency could parse as either my debug, concurrency or my debug concurrency, no suffix , and the sweep’s active-namespace detector cannot pick. Banning underscores in ad-hoc tokens makes the suffix boundary unambiguous. bin/diagnose test locks.py queries pg blocking pids across every configured test DB and prints both sides: SELECT blocked.pid AS blocked pid, blocked.application name AS blocked app, blocked.wait event, blocked.query AS blocked query, blocker.pid AS blocker pid, blocker.application name AS blocker app, blocker.query AS blocker query FROM pg stat activity blocked JOIN LATERAL unnest pg blocking pids blocked.pid AS blockers pid ON true JOIN pg stat activity blocker ON blocker.pid = blockers.pid; Because every connection carries its worker and namespace in application name , a LockNotAvailable is now traceable to a specific invocation, not inferred from a stack trace. Test reports are namespaced A small but load-bearing detail. ~/run tests writes JSON reports to /tmp/pytest reports/<timestamp <namespace .json and collected-nodeid scratch files to .collected nodeids <namespace .txt . Two concurrent invocations starting in the same second used to overwrite each other’s evidence. The reader ~/test report globs .json , so namespacing the filenames does not break anything downstream. Old design vs current design | Concern | Earlier shape | Current shape | |---|---|---| | Worker schema naming | test gw0 | test {namespace} {worker} | | Cross-invocation safety | best effort | structural isolation | | Report artifacts | shared timestamp paths | namespace-qualified paths | | Active test detection | per-DB view, risk of partial picture | union across all configured DBs | | Cleanup posture | could be destructive when uncertain | fail-closed when detection is incomplete | | Schema drop strategy | many drops in one transaction | one schema per transaction | | DDL slot derivation | first digit run in schema name | anchored on gwN suffix | search path composition | hand-built per call site | single worker schema helper | | Lock triage | infer from failures | pg blocking pids plus application name labels | What we deliberately did not do A few options we considered and rejected, in case they are useful for someone making the same call: Per-invocation database CREATE DATABASE … DROP DATABASE : needs CREATEDB , re-runs Alembic per invocation, and adds 4 to 8 seconds × N databases to every startup. Too heavy. Bigger : more band-aid. The failure mode becomes 120-second waits instead of 15-second errors. Does not fix the race. lock timeout everywhere Postgres advisory locks instead of file-system fcntl: does not replace namespace isolation. Reasonable follow-up if multi-host CI runners ever show up; not worth doing inside a local-WSL workflow. Queueing invocations as a discipline rule: fragile, and was already getting violated in practice when the failures started. Why this matters more for AI-assisted teams A human team often hits this class of collision occasionally. An agent swarm hits it routinely: - multiple agents may launch tests within seconds of each other - they often run similar target sets - they may inherit shell state you forgot about - they are more likely to stress stale cleanup and report collection What looked like a flaky edge case becomes the normal envelope of operation. The test environment has to behave like shared infrastructure, with namespacing, cleanup rules, diagnostics, and safe defaults that hold without operator coordination. Closing The same instincts behind this work show up in the product: isolate state explicitly, fail closed when certainty disappears, label work so you can trace it later, prefer structural guarantees over timeout band-aids. That is the posture behind the EquatorOps platform /platform and the engine architecture /platform/engines . If you are building operational software with real concurrency, the takeaway is not “use namespaces.” It is that worker-level parallelism stops being enough once independent runs can collide on shared state, and once AI agents are part of your engineering loop, those collisions stop being rare. If you want to talk about the developer surface behind that architecture, /developers /developers .