Orchestrating AI Code Review at scale The article describes how Cloudflare built a scalable AI code review system to address bottlenecks in their engineering workflow. Instead of using a single monolithic AI model, they developed a CI-native orchestration system that deploys up to seven specialized AI agents to review merge requests for security, performance, code quality, and other criteria. The system, which has processed tens of thousands of merge requests internally, uses a coordinator agent to deduplicate findings and post structured reviews, effectively catching real bugs and blocking merges with genuine security vulnerabilities. Code review is a fantastic mechanism for catching bugs and sharing knowledge, but it is also one of the most reliable ways to bottleneck an engineering team. A merge request sits in a queue, a reviewer eventually context-switches to read the diff, they leave a handful of nitpicks about variable naming, the author responds, and the cycle repeats. Across our internal projects, the median wait time for a first review was often measured in hours. When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare. So, we jumped to the next most obvious path, which was to grab a git diff, shove it into a half-baked prompt, and ask a large language model to find bugs. The results were exactly as noisy as you might expect, with a flood of vague suggestions, hallucinated syntax errors, and helpful advice to "consider adding error handling" on functions that already had it. We realised pretty quickly that a naive summarisation approach wasn't going to give us the results we wanted, especially on complex codebases. Instead of building a monolithic code review agent from scratch, we decided to build a CI-native orchestration system around OpenCode , an open-source coding agent. Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents. Rather than relying on one model with a massive, generic prompt, we launch up to seven specialised reviewers covering security, performance, code quality, documentation, release management, and compliance with our internal Engineering Codex. These specialists are managed by a coordinator agent that deduplicates their findings, judges the actual severity of the issues, and posts a single structured review comment. We've been running this system internally across tens of thousands of merge requests. It approves clean code, flags real bugs with impressive accuracy, and actively blocks merges when it finds genuine, serious problems or security vulnerabilities. This is just one of the many ways we’re improving our engineering resiliency as part of Code Orange: Fail Small . This post is a deep dive into how we built it, the architecture we landed on, and the specific engineering problems you run into when you try to put LLMs in the critical path of your CI/CD pipeline, and more critically, in the way of engineers trying to ship code. The architecture: plugins all the way to the moon When you are building internal tooling that has to run across thousands of repositories, hardcoding your version control system or your AI provider is a great way to ensure you'll be rewriting the whole thing in six months. We needed to support GitLab today and who knows what tomorrow, alongside different AI providers and different internal standards requirements, without any component needing to know about the others. We built the system on a composable plugin architecture where the entry point delegates all configuration to plugins that compose together to define how a review runs. Here is what the execution flow looks like when a merge request triggers a review: Each plugin implements a ReviewPlugin interface with three lifecycle phases. Bootstrap hooks run concurrently and are non-fatal, meaning if a template fetch fails, the review just continues without it. Configure hooks run sequentially and are fatal, because if the VCS provider can't connect to GitLab, there is no point in continuing the job. Finally, postConfigure runs after the configuration is assembled to handle asynchronous work like fetching remote model overrides. The ConfigureContext gives plugins a controlled surface to affect the review. They can register agents, add AI providers, set environment variables, inject prompt sections, and alter fine-grained agent permissions. No plugin has direct access to the final configuration object. They contribute through the context API, and the core assembler merges everything into the opencode.json file that OpenCode consumes. Because of this isolation, the GitLab plugin doesn't read Cloudflare AI Gateway configurations, and the Cloudflare plugin doesn't know anything about GitLab API tokens. All VCS-specific coupling is isolated in a single ci-config.ts file. Here is the plugin roster for a typical internal review: Plugin | Responsibility | |---| @opencode-reviewer/gitlab | GitLab VCS provider, MR data, MCP comment server | @opencode-reviewer/cloudflare | AI Gateway configuration, model tiers, failback chains | @opencode-reviewer/codex | Internal compliance checking against engineering RFCs | @opencode-reviewer/braintrust | Distributed tracing and observability | @opencode-reviewer/agents-md | Verifies the repo's AGENTS.md is up to date | @opencode-reviewer/reviewer-config | Remote per-reviewer model overrides from a Cloudflare Worker | @opencode-reviewer/telemetry | Fire-and-forget review tracking | How we use OpenCode under the hood We picked OpenCode as our coding agent of choice for a couple of reasons: We use it extensively internally, meaning we were already very familiar with how it worked It’s open source, so we can contribute features and bug fixes upstream as well as investigate issues really easily when we spot them at the time of writing, Cloudflare engineers have landed over 45 pull requests upstream It has a great open source SDK , allowing us to easily build plugins that work flawlessly But most importantly, because it is structured as a server first, with its text-based user interface and desktop app acting as clients on top. This was a hard requirement for us because we needed to create sessions programmatically, send prompts via an SDK, and collect results from multiple concurrent sessions without hacking around a CLI interface. The orchestration works in two distinct layers: The Coordinator Process: We spawn OpenCode as a child process using Bun.spawn . We pass the coordinator prompt via stdin rather than as a command-line argument, because if you have ever tried to pass a massive merge request description full of logs as a command-line argument, you have probably met the Linux kernel's ARG MAX limit. We learned this pretty quickly when E2BIG errors started showing up on a small percentage of our CI jobs for incredibly large merge requests. The process runs with --format json , so all output arrives as JSONL events on stdout : js const proc = Bun.spawn "bun", opencodeScript, "--print-logs", "--log-level", logLevel, "--format", "json", "--agent", "review coordinator", "run" , { stdin: Buffer.from prompt , env: { ...sanitizeEnvForChildProcess process.env , OPENCODE CONFIG: process.env.OPENCODE CONFIG PATH ?? "", BUN JSC gcMaxHeapSize: "2684354560", // 2.5 GB heap cap }, stdout: "pipe", stderr: "pipe", }, ; The Review Plugin: Inside the OpenCode process, a runtime plugin provides the spawn reviewers tool. When the coordinator LLM decides it is time to review the code, it calls this tool, which launches the sub-reviewer sessions through OpenCode's SDK client: js const createResult = await this.client.session.create { body: { parentID: input.parentSessionID }, query: { directory: dir }, } ; // Send the prompt asynchronously non-blocking this.client.session.promptAsync { path: { id: task.sessionID }, body: { parts: { type: "text", text: promptText } , agent: input.agent, model: { providerID, modelID }, }, } ; Each sub-reviewer runs in its own OpenCode session with its own agent prompt. The coordinator doesn't see or control what tools the sub-reviewers use. They are free to read source files, run grep, or search the codebase as they see fit, and they simply return their findings as structured XML when they finish. What’s JSONL, and what do we use it for? One of the big challenges that you typically face when working with systems like this is the need for structured logging, and while JSON is a fantastic-structured format, it requires everything to be “closed out” to be a valid JSON blob. This is especially problematic if your application exits early before it has a chance to close everything out and write a valid JSON blob to disk — and this is often when you need the debug logs most. This is why we use JSONL JSON Lines , which does exactly what it says in the tin: it’s a text format where every line is a valid, self-contained JSON object. Unlike a standard JSON array, you don't have to parse the whole document to read the first entry. You read a line, parse it, and move on. This means you don’t have to worry about buffering massive payloads into memory, or hoping for a closing that may never arrive because the child process ran out of memory. In practice, it looks like this: Stripped: authorization, cf-access-token, host Added: cf-aig-authorization: Bearer