Different models have different blind spots

Codev's multi-model review system caught two distinct bugs that no single AI model would have identified alone: Codex detected a Unix socket permission flaw (missed by Claude and Gemini), while Claude spotted an OAuth nonce misplacement (missed by Codex and Gemini). This demonstrates that different AI models have unique blind spots, leading Codev 3.0 to implement a parallel multi-model consultation loop where models debate disagreements rather than relying on a single perspective.

One of the best arguments for Codev came from two specific "saves" earlier this year — bugs that no single model would have caught on its own. During a high-velocity sprint, @waleedkadous used Codev to ship a stack of features for the platform. The work looked ready to merge. Then the multi-model review at the end of one of the implementation phases took place. Codex flagged a Unix socket created without restrictive permissions 0600 . Any local user on the machine could have connected to it and driven the shell session — not just observed it. Claude and Gemini both missed it. Claude flagged an OAuth nonce placed on the wrong URL. The nonce — a one-time secret that proves an OAuth callback came from the flow this user started — was attached to the outbound request instead of the callback URL the cloud echoes back. Net effect: The callback handler had nothing to verify against, opening the door to a CSRF attack where a forged callback could hijack the connection and make it look like you had authorized it when you hadn’t. Codex and Gemini both missed it. The Takeaway: Different models have different blind spots. Codex obsesses over edge cases and security surface area; Claude pattern-matches against subtle protocol-level mistakes. Neither model alone would have caught both bugs. This is why we built Codev 3.0 around a multi-model consultation loop. Rather than relying on a single model's perspective on the code, the 3.0 pipeline runs independent models in parallel, surfaces every disagreement, and lets the different models debate it through a rebuttal round. You can see the full breakdown of how multi-agent reviews compare to single-model outputs here: