The Physical Laws of AI Migrations: Architecting an LLM Orchestrator that Survives Reality

A developer architecting an LLM orchestrator for large codebase migrations discovered that naive implementations fail under real-world load due to race conditions, Git lock collisions, and resource exhaustion. By applying distributed systems principles—such as using an append-only ledger for state, isolated Git worktrees for parallelism, and bounded retry loops—the orchestrator can survive reality and complete migrations without blocking concurrent development.

Large codebase migrations are not typing problems; they are distributed state machine problems. When you attempt to execute a multi-step, multi-PR refactor using an LLM—such as the workflows I proposed in this migration-orchestration-skill https://github.com/mhosseinab/skills/blob/master/migration-orchestration/SKILL.md —you are not just writing code. You are instantiating a distributed system. Your orchestrator is the control plane, your LLM subagents are asynchronous worker nodes, and the local repository is your shared database. Most AI workflow documentation assumes a "happy path" where agents neatly read instructions, edit files, and check off a Markdown list. This is whiteboard architecture at its most naive. In reality, agents hallucinate, Git locks up, test suites hang, and state files corrupt. To build an orchestrator that actually finishes a migration, we must align our architecture with the mechanical realities of the environment. The core business requirement is to safely sequence, verify, and commit a massive architectural change e.g., migrating a monolith to partitioned micro-databases or rewriting React classes to hooks without blocking concurrent human feature development. The fundamental distributed systems headache? You are coordinating asynchronous, non-deterministic compute nodes LLM agents attempting concurrent mutations against a rigidly sequential, locally-locked state store the local filesystem and Git . If you don't engineer strict boundaries, the result is corrupted code, blown context windows, and thousands of dollars in wasted API tokens. The standard MVP implementation breaks the migration into discrete steps S1..Sn , defines a dependency graph, and instructs an orchestrator to spawn concurrent subagents for any steps that touch "disjoint files." The state of the migration is tracked by having agents read and update a human-readable <slug -progress.md file. When an agent fails a test suite, it enters a worker → reviewer → fix loop until the tests pass. Under real-world load, this implodes almost immediately: progress.md file, wiping out each other's status updates. npm install or cargo build in parallel. They exhaust I/O, thrash memory, and crash the local machine. .git/index.lock collisions, causing unhandled fatal errors that halt the orchestrator entirely.To fix this, we must deconstruct the pipeline and apply mechanical sympathy to every failure domain. <slug -progress.md as a concurrent data store. It guarantees race conditions, phantom reads, and lost updates when multiple agents edit the file. Ctrl+C , the lock is never released. The migration is permanently deadlocked. {"step": "S2", "status": "passed", "sha": "abc1234", "timestamp": 1719154980} . The Markdown file is strictly a materialized, read-only projection generated by the orchestrator for human consumption. If an agent's expected prior state doesn't match the latest ledger sequence, the append is rejected, and the agent must reconcile. auth.ts , Agent B edits db.ts prevents conflicts. git add and git commit . git worktree add ../S2-branch for every concurrent step. This gives each agent a completely isolated index, working directory, and execution environment. Parallelism is no longer an illusion; it is mechanically enforced by OS-level filesystem boundaries. worker → reviewer → fix loop assumes that every code problem is solvable if the LLM just tries hard enough. It acts as an unbound while true loop. max attempts=3 . If the agent fails, it marks the step as blocked . in progress . Because downstream steps in the DAG depend on this node, the system indefinitely buffers out-of-order sequences waiting for the missing tail. This is a classic zombie state. expires at: 1719155980 . If the orchestrator detects an expired lease, the step is violently revoked, marked as a Poison Pill, and sent to the DLQ. The orchestrator halts execution of dependent nodes and pages the human operator. A strict dependency chain cannot skip broken links; it must halt safely and predictably.Drawing dependency boxes on a whiteboard is easy. Executing them reliably against physical constraints is Staff-level engineering. An LLM orchestration framework is not exempt from the physical laws of distributed systems. File lock contention, out-of-order execution, unparseable ASTs, and process crashes are guaranteed. If you rely on "happy path" prompts to sequence a codebase migration, your system will fail. But if you design your AI pipeline with immutable ledgers, isolated execution environments, and strict TTL bounds, you graduate from writing fragile prompt chains to operating a resilient, production-grade migration machine. Real engineering is found in how your system handles failure, not how it behaves when everything goes right.