Scaling security reviews at 1Password: Building an AI-powered pipeline

wpnews.pro

The developers and engineers here at 1Password are always working to improve our products. With all the active development to introduce features, fix bugs, and enhance the overall user experience, numerous code changes go into every release. We strive to ensure each iteration is better than the last and that new code doesn’t introduce vulnerabilities. A key part of this process is our Product Security (ProdSec) team’s review of all code changes that may have security implications.

In the past, security engineers gathered on calls several times per week to go through all the PRs in the queue that required ProdSec eyes. While incredibly important, this review process was arduous and consumed countless people hours every month, especially when the engineers were flagging the same patterns over and over. And our team did this for years.

These manual security reviews worked when 1Password was a smaller company with one product. But as 1Password grew in size and expanded its product line, the number of PRs increased—and that began to grow by orders of magnitude as engineers adopted AI-coding assistants.

One thing remained constant, though: There are still only 24 hours in a day. It was a process that just couldn’t scale.

Over the past year, it became clear we needed a solution. We tested a few popular third-party tools that use artificial intelligence (AI) to enhance the traditional static analysis process (SAST) used throughout the industry. They functioned okay from a general security perspective but we knew we could do better. We believed we could use AI models, along with a 1Password-specific knowledge base, to significantly reduce the time and effort our team dedicated to security reviews.

A few days later, we had an idea and a great moniker: SAGE (Security Analysis Guidance Engine). 🌿

SAGE was an ambitious hypothesis but we had all the information we needed, we just had to compile it. It started with a script.

We gathered nearly 9,000 pull requests that spanned over five years of ProdSec code reviews. The reviews covered our entire product line: Rust, Go, Kotlin, TypeScript, and Swift. We fed those reviews into an LLM to deduplicate comments and cull any superfluous notes (like thumbs-up emoji and “Looks good to me!”) and ended up with 8,343 reviews that consisted only of the ProdSec engineer’s comment and corresponding diff hunk. Finally, we gave those reviews to an LLM and instructed it to synthesize the information into rules, grouped by vulnerability category (and deduplicate accordingly).

We (real live humans) reviewed and edited the output, and went into v0 with 171 rules across 16 categories like authentication, cryptography, and logging.

SAGE v0 was deliberately modest. First, we ran it locally against PR diffs downloaded directly from GitHub. SAGE made a single call to the LLM with a relatively minimal prompt, instructions for structured output, the PR diff, and a copy of the ruleset.

Our local tests were pretty successful and, after posting SAGE’s comments on the live PRs, we received positive feedback from reviewers on the team. At that point, it was time to try SAGE in the production pipeline, so we deployed it as a GitHub Action using an existing internal GitHub Actions framework as a starting point.

As a GitHub Action, SAGE v0 posted its findings on the scanned PRs with comments keyed to the specific lines of code. It also maintained an activity log: A persistent PR comment updated on each push that tracked scan history (new findings, resolved findings, errors). Reviewing engineers could react to the comments with a thumbs up if they agreed with the finding and thumbs down if they deemed it a false positive.

SAGE really began to show its worth when it scanned a PR that touched a lot of cryptographic code. It identified 6/6 true positives, including findings initially missed by our human reviewers. And it did all this for an average token cost of $0.47 USD per scan.

But we also noticed shortcomings with the SAGE v0 implementation. Primarily, we were asking one model to locate and verify findings in the same call. SAGE also lacked any false-positive filter, had little hardening against prompt injection, and supported only a single provider. V0 proved our concept worked but it was one call/model doing everything; we needed separation and some form of objectivity.

So we got back to work.

We researched and brainstormed for days before we outlined a (pretty daunting) plan for SAGE v1.

We started with housekeeping tasks. With access to more advanced models at this point, we ran another rule extraction and reconciled the output with the v0 ruleset. V1 now has access to 343 rules that cover 16 vulnerability categories. From that ruleset, we created a compact index that consists only of the rule ID and a one-line summary of the rule.

That was the easy part.

Our v1 architecture introduced two critical aspects: model/vendor agnosticism and a call pipeline.

At a time when AI technology advances nearly every day, we wanted SAGE to be flexible. So we designed v1 to talk to LLMs through an llm.Client interface, which made the scanner entirely provider-agnostic.

Following a consultation with industry experts, we also built a three-stage progressive-disclosure pipeline: A series of separate LLM calls that each receive specific information.

The Finder stage is intentionally noisy. It sees the compact rule index, a tailored prompt, and instructions for structured JSON output. This stage detects and reports prompt-injection attempts as PROMPT-INJECTION findings. Its goal is high recall. Speculative findings are acceptable because the next stage in the pipeline handles quality control.

The Critic stage is a completely separate API call. It has no access to the Finder’s chain of thought, reasoning, or raw response. It sees the structured finding JSON, relevant code hunks, and the full rule body for the cited rule ID (from the index). The Critic’s prompt is adversarial: ”Consider findings exploitable unless you can prove otherwise; actively look for reasons each finding is WRONG.” It considers whether the code path is reachable, the finding is within test or mock code, there are mitigations elsewhere, and exploitation requires multiple unlikely conditions.

The Judge stage is another discrete call. It receives the original finding JSON, Critic’s output, and code hunks. Its prompt is explicitly neutral. We instruct it to weigh the finding and Critic output independently, and to not be biased toward either the finding or critique. The Judge outputs a verdict (confirmed, false positive, needs review), a list of preconditions, and an exploit scenario (for confirmed findings only). It drops false-positive findings entirely and logs them and its rationale for human review.

With the pipeline structure built, we started testing — we needed to know which models were best suited for each stage. We cloned an internal GitHub repository, created five model profiles, and ran each profile against the same 10 PRs. With the results of those 50 PR scans, we analyzed the number of findings, overall cost, catch rate, and token cost per finding. One profile was the clear winner and finalized our default SAGE v1 pipeline.

Our Finder uses a fast, mid-tier, cost-efficient model chosen for breadth and speed since the Finder's job is wide, inclusive discovery rather than airtight proof. Our Critic relies on a frontier reasoning model from a different provider than the Finder. We chose this model for two reasons: it doesn’t share the Finder’s blind spots and can adversarially pressure-test each finding. Finally, our Judge calls a powerful high-end reasoning model that weighs the finding against the critique and renders the final verdict. With separation and objectivity in place, we wired in the default profile and went live.

SAGE v1 runs in our largest repositories today, and already saves our ProdSec engineers hours of review time every week.

The third-party products we evaluated know general security canon, but SAGE knows 1Password.

By building our own reviewer on top of our historical expertise, we got a tool that speaks our languages and applies the judgment our ProdSec engineers spent years developing.

SAGE v0 proved the hypothesis: An LLM, equipped with 8,343 past reviews distilled into a ruleset, can surface real vulnerabilities (sometimes those missed by our incredible human engineers!) for pennies per scan. SAGE v1 made it production-ready: A vendor-agnostic Finder / Critic / Judge pipeline that separates recall from precision, hardens against prompt injection, and adjudicates every finding with built-in objectivity. This is all wrapped in a deterministic harness that allows us to easily migrate to the latest frontier AI models as soon as they become available.

But every version of SAGE to this point shares one limitation: It only scans PR diffs. Our engineers still have to bring the context no isolated change can capture: which directories are sensitive, where the trust boundaries sit, what guards already exist elsewhere. Giving SAGE that same awareness is where SuperSAGE comes in.

SAGE already knows 1Password. Soon, it will understand it.

We'll save those details for Part 2.

Subscribe to our developer newsletter to be the first to know about new betas, tools, and resources for developers.

source & further reading

1password.com — original article Agent identity architectures: Delegated, bounded, and autonomous Introducing AI-assisted query creation in 1Password Device Trust 1Password is a Leader in the 2026 Gartner® Magic Quadrant™ for SaaS Management Platforms

Scaling security reviews at 1Password: Building an AI-powered pipeline

Run your AI side-project on zahid.host