Build your own vulnerability harness

Project Glasswing reveals that frontier security models must be interchangeable in a model-agnostic harness to effectively scan enterprise codebases for vulnerabilities. The system, built from a 450-line security-audit skill, orchestrates multi-phase audits with interchangeable models to cross-check findings and eliminate false positives at scale.

A few weeks ago, we published our initial findings from Project Glasswing , looking at what happens when you point frontier security models at an enterprise codebase. We also explored how our defensive structures adapt to protect our infrastructure and customers from threats posed by frontier AI . Since then, the AI ecosystem has continued to shift rapidly — developers who've built tightly around a single model have already experienced what happens when that model is no longer available or gets superseded by a more capable one. These market shifts only reinforce our core thesis: no matter which underlying model is leading the pack on any given day, the future of agentic workflows will not be found in standalone models, prompts, or single-agent sessions. Moving from a localized security "skill" to a continuous, fleet-wide scanning pipeline requires an architecture where models are treated as interchangeable components. Relying on a single model inherently limits defensive coverage, as the same system will tend to look at code paths through the exact same lens. To counter this, models should be frequently interchanged and cross-tested. By varying the models across the pipeline — such as using one model for initial discovery and an entirely different one for validation — we can ensure that vulnerabilities are cross-checked by distinct sets of logic. Furthermore, a true enterprise-scale harness must look beyond isolated repositories to trace vulnerabilities across cross-repo dependencies, ultimately filtering thousands of raw candidates down to a trusted, triaged queue of actionable fixes. This post serves as a practical look at how to build that model-agnostic layer, focusing on how we manage state controls, eliminate false positives, and coordinate end-to-end triage at scale. The first post made the case for why generic coding agents can't do this job. The main issue is that agents only hold one hypothesis at a time, fill their context window after covering a sliver of a real repo, and then lose information during context compaction. For more details, read that post . Before we move forward, we would like to answer two likely questions. "Why not use subagents instead of a harness?" Subagents are useful, and they are a good starting point. But security analysis needs hundreds of separate investigations that survive across runs, don't share a context window, and can be re-scoped and cross-referenced later. It needs persistence, deduplication, resumability, and eventually fleet-wide dependency tracing. That's an orchestration problem, and a prompt can't get you there. "Is this blog post just an ad for frontier models?" No. Our approach centers on the harness, not the model. When it comes to vulnerability discovery, we run it with whatever frontier model is currently best at what we need. When we point different models at the same target, they each turn up a different share of the bugs. The harness is the bit that lasts. If you build your own system, design it to be model-agnostic from day one. This will allow you the freedom to use any model of choice without constraints. It all starts with a skill We started with a ~450-line security-audit skill that we ran on a single repository, and adjusted the prompts until we surfaced real bugs. Later, we added the orchestration that became the plumbing of the entire system. The real value lives in the prompts themselves, and our prompts continue to carry the initial skill's attacker scenarios, bug classes, and anti-pattern detections nearly unchanged. The skill was written to run a 7-phase audit in one session: Three parallel research agents do recon and write an architecture.md . One Hunter agent runs per class attack, trying to break the code rather than review it. Adversarial validators try to disprove each finding. The survivors are written up as a human-readable vulnerability report. They're also emitted as findings.json against a schema, and a mechanical check validates that file. Finally, a fresh agent independently re-verifies every finding against the source. The surviving, re-verified findings are submitted to the ingest API. That first skill maps almost directly onto the later harness: Skill phase | Harness stage | |---| Recon agents write architecture.md | Recon | Hunters run per attack class | Hunt | Validators disprove findings | Validate | Surviving findings become a report | Report | findings.json is checked mechanically for schema adherence, not correctness | Mechanical validation of line numbers and functions in findings | Fresh agent re-verifies findings | Independent validation | The skill worked, but it quickly revealed its limits. Looking at the coverage metrics, a single run finds only about half the bugs you'd catch across multiple runs. In our experience the ones it did find skewed toward the simpler and less subtle. Once your process is basically "run it ten times and diff by hand," you probably need to start looking at a real harness. While running and fine-tuning the skill, we ran into three walls: Context exhaustion : An hour in, the context window fills up and the model will cannibalize its own memory, instantly forgetting the bugs it spent all morning tracking down. We broke this bottleneck by externalizing the state entirely, treating the LLM as a stateless compute engine. ADVICE: A real but minimal harness consists of just Recon, Hunt, and Validate stages kept in a database, alongside a separate Validator that can't file its own findings. You should skip cross-repo tracing entirely until you have more than one repository that matters. Skip a dedicated Deduplication agent until you are actively drowning in noise. Start with a skill in your development environment, get your prompts working well, and only build the next architectural stage when not having it is the specific thing slowing you down. Codifying the skill into a pipeline Most AI security write-ups in this space are about a single repo or a curated benchmark; running a whole fleet this way, with cross-repo tracing, isn't something we've seen written up elsewhere. Our codebase spans a massive mix of languages — Rust, Go, C, Lua, TypeScript and Python, alongside various configuration management systems, static configs, and all sorts of additional context. So we had to come up with something new that worked for us. Going from that first slash-command run to a fleet scanner that could cover 128 distinct repos, automatically finding and interrogating relevant dependencies, took about six weeks. Codification was mostly mechanical: we lifted each phase of the skill into its own agent, put a database behind it and an orchestrator in front. The mapping was almost one-to-one. The entire fleet runs on one unified harness with no per-language tuning and traces the dependencies between repos. While offloading syntax to a model makes the system language-agnostic, the differentiator is its ability to trace dependencies between repos. The harness itself doesn’t care if it’s looking at C pointers or a TypeScript file; it focuses on the higher-level logic of security orchestration. This allows us to scale across hundreds of different codebases, without having to write custom language parsing. A two-stage vulnerability research workflow Our entire vulnerability research workflow is built on a two-stage operational framework: the Vulnerability Discovery Harness VDH and the Vulnerability Validation System VVS . The VDH functions as our discovery engine, proactively scanning codebases to surface potential security issues. Once bugs enter the VVS, which allows multiple harnesses to feed into it, they go through stages of Deduplication, Judgment, and finally Fixing, as we’ll talk about later. We use one model for VDH, but we use a completely different model for VVS, so the models are effectively double-checking each other. There is an obvious security benefit to this: by forcing Model B VVS to judge the output of Model A VDH , you ensure that the finding is evaluated by an entirely different set of logical weights and training data — one that acts as an unbiased, adversarial third party whose sole job is to ruthlessly stress-test Model A's assumptions. And operationally, we benefit from treating model providers like interchangeable commodities. Model providers can change temperature, caching, and inference effort budgets over time, even within one model version. Instead of building a system that depends on a model behaving predictably over time, our harness is built to absorb downstream volatility without breaking. Stage 1: Vulnerability Discovery Harness VDH The first post covered what each agent/stage is for, so we'll talk about the parts it didn't: the glue between stages, and the handful of details that decide whether any of it works. Agent/stage | Primary Role | Sub-agents / Tooling | |---| Recon | Maps out the target architecture and maps potential threat vectors | 3 parallel Recon sub-agents write architecture.md | Hunt | Runs per-class attacks, compiles fragments, probes binaries | It spawns siblings these handle between 9% and 20% of fleet-wide tasks depending on the model . It reaches out to and writes to the Wishlist tool. | Validate | Mechanically checks the finding, then adversarially disproves it | Runs in two passes: plain code handles the initial schema/path checks, then a single isolated agent tries to disprove the finding before it can be filed. | Gapfill | Generates new hunt tasks for empty coverage cells | Enqueues fresh hunt tasks for any under-tested area × attack-class cells that still look thin | Dedup | Identifies and consolidates overlapping findings | Combines deterministic code and agents to cluster findings by root cause, folding them together in real time | Trace | Walks dependency graph; spawns consumer-repo tasks | Walks the graph to add hunt tasks inside every identified consumer repo to make sure cross-repo bugs are caught | Feedback | Learns from pre-existing reports and optimizes future runs | Takes validation failures, shallow runs, and repeated misses, and instantly rewrites queued prompts to make future tasks sharper. | Report | Renders human-readable report | Just a script, no model required | Table 1: Vulnerability Discovery Harness VDH Stages four through eight run as a continuous producer-consumer loop. As the initial hunt progresses, the Gapfill, Feedback and Trace agents generate new tasks; Dedup folds overlapping findings back together and the rest of the loop keeps consuming the queue. This ensures a vulnerability discovered late in the cycle is still validated, reported and checked against other code to make sure it doesn't contain the same bug, all within the same run. Splitting the pipeline this way guarantees strict context controls. If you fill the context window, the model starts hallucinating. We keep each agent’s job hyper-focused, keeping context usage below 25% of the total window. A naive “ read all files ” approach will blow past this limit every single time. One thing that caught us out was that persistence needs to be factored in before parallelism. You do not want to throw away a five-hour run because of an unforeseen error. Every stage writes to one SQLite database keyed by run id , repo , stage . Any stage can resume, retry, or get pulled into a later run without redoing work. Findings are streamed and saved as they happen, so a crash costs you the task in flight and nothing else. ADVICE: Sometimes a transient API error comes back as text in the 200 OK response stream instead of throwing a code exception. To the orchestrator, this looks exactly like a task that finished cleanly. You must explicitly classify the response text, not just trust the exception type, or you end up logging empty runs as successes. During the Recon stage, the agent writes the threat model instead of being handed one. Beyond about ten built-in attack classes many forms of injection, memory corruption, protocol parsing, timing side channels, and others , the Recon agent can invent repo-specific classes on the spot, each with its own methodology. It writes a custom taxonomy tailored specifically to that codebase, which is used to more tightly scope the Hunter agents. Reading source code isn’t enough to understand how it behaves under stress, especially for subtle undefined-behavior bugs in C and other lower-level languages. The Hunter agents move past code reading and transition into active execution. They compile fragments, build small versions, and attack them. The biggest jump in quality came from giving Hunters a sandbox built on unshare to crash binaries. ADVICE: If the harness itself runs inside Docker, that sandbox needs seccomp=unconfined and apparmor=unconfined or it will silently fail to start. It’s a one-line fix that saves you a day of head-scratching if you aren't an expert in nested containerization, like us. Micro-forks and the wishlist Beyond the core pipeline stages, we added two specialized mechanisms that grant the Hunters significant autonomy to adapt their focus and request external resources without derailing an ongoing analysis: Sibling Forking : This helps ensure that if a Hunter agent trips over an interesting code path that is outside the current scope, it doesn’t wander off track. It uses a tool call to fork a sibling agent with a precise structural seed. Fleet-wide, this accounts for roughly 9% of tasks, though the rate is highly model-dependent — from near-zero to about a fifth, depending on which model is hunting. The Wishlist : When an agent needs a tool it doesn't have, often a Validator confirming a Proof of Concept PoC or a Hunter wanting to build something like a specific build environment, a VM, or some prod config files , it writes to a central wishlist. It provides enough context for the system to automatically re-run that exact task once a human provides the dependency. Some of these can be partly self-healing: if the container needs to be rebuilt with some changes, this can autonomously happen after the run by having a generic coding harness monitor the logs. The wishlist has been written to 25,472 times across 128 repos since the wishlist was added, and it's the main way the agents talk back to us. One that landed while we were writing this: " I need a FreeBSD VM to confirm this PoC end-to-end. " Fleet-wide cross-repo tracing After the initial cleanup, a Tracer agent checks how different software components are connected. It looks for a specific path: can a potential attacker send harmful input from the outside to a vulnerable part of the system? If the answer is yes, the Tracer agent automatically spawns fresh hunt tasks inside the consumer repository. To make this work, you need a unified, cross-repo symbol index and an accurate dependency graph. This allows you to uncover deep, systemic flaws that a standard single-repo scan would miss. Running our harness across an entire fleet of repos revealed two lessons that only surfaced when this was done at scale. First, deduplication is its own problem, big enough to need its own agents. When you are scanning a handful of repositories, you can manually eyeball overlapping bugs. Simple string matching or file-path checks won't save you here. Determining whether two complex logic flaws are actually the exact same root bug sounds trivial, but it isn't. It requires so much cognitive reasoning that we had to deploy dedicated Dedup agents just to clean up the noise, along with their own heuristics and ways of reducing the work. The second is to not wire in static analysis early. We plumbed Semgrep all the way through, and the Hunters invoked it zero times in a month of runs. They would rather read and run the code. The wishlist, by contrast, was the single most-used tool in the system. It's worth paying attention to what the agents actually reach for, rather than what you think they'll want. Making findings you can trust The agent will edit the source code so its own exploit works, then triumphantly report the bug it just created. It will write a test that proves something entirely tautological like “ exec executes things, therefore critical vulnerability ”. Or it builds an exploit that runs fine but proves nothing, because the threat model behind it is nonsense. If your harness doesn't actively fight this, all you've built is a faster way to produce junk. A Hunter has to state the threat model before it's allowed to file anything. It has to define exactly who the attacker is, and what boundary the vulnerability crosses or what assumption it breaks. The output schema ordering enforces it. This requirement eliminates the vacuous findings, the " if a user has database write access, they can write to the database " kind. Every confirmed finding ships with a PoC written as a test that runs against the original, untouched codebase. This prevents the agent from editing the source files to force an exploit to land. If there is no working PoC, we treat the finding as fake. In practice, that's a Hunter compiling a thirty-line parsing loop, running it with memory protection enabled, and demonstrating that the incorrect read stride is originating from a stack address rather than the expected message body. You can re-run it yourself. Furthermore, every confirmed finding must also ship a proposed patch. What actually reaches our review queue is a verified bug, a working test, and a functional git diff, not just a vague text description of a problem. Before an exploit path survives, deterministic code written in plain code, not another model mechanically verifies that the cited files and paths actually exist, and confirms that both the patch and the test parse correctly. This Validator cannot log findings of its own; its sole job is to aggressively disprove the Hunter 's theory. If a Hunter is allowed to grade its own homework, it will confidently validate everything it outputs. We don't claim a false-negative rate for our system. There's no labeled set of every real bug in a codebase, so any claimed recall number is entirely speculative. What we can watch is whether re-runs keep turning up new bugs they do and whether coverage is still growing across runs. It’s all a proxy, as you don’t know for sure how many bugs exist in a single codebase, but it’s a good-enough way of measuring effectiveness. Stage 2: Vulnerability Validation System VVS A finding coming out of the harness is just the start of the triage process, with all discoveries landing in a single, shared VVS that currently holds 13,841 findings across 145 repos in total. Triaging that volume is its own massive engineering problem, and it matters just as much as the hunting. That triage engine runs on a different model from the harness, broken down into three distinct jobs. Agent/stage | Primary role | Spawns/ sub-agents/tooling | |---| Dedup | Identifies if a vulnerability is already in the system, or raised as internal Jira ticket already | Deterministic: plain code builds inverted indexes over files, functions, trust boundaries, and rare tokens, then hands each finding a short candidate list Probabilistic: Dedup agent reasons over that short list, Stable cross-run key reopens existing records | Judgment | Production reachability and validation | Single agent — builds context about the bug from MCP servers, to get the shape of what the service looks like in production. Searches the wiki, Jira, git, config, and all available other sources to try and understand whether a bug is truly applicable to our production environment, and then score the vulnerability against this. It also validates the bug against source code to understand if the bug still exists on the latest main branch. | Fixing | Generates patches, runs regression tests | Runs the regression test before and after filtered to the affected test; full suite only when per-test filtering isn't available . It requires a clean fail→pass flip on the target test to clear the gate. If the post-patch test fails, or if a global run detects downstream regressions, the commit is automatically blocked and flagged for human intervention. | Table 2: Vulnerability Validation System VVS Comparing every single finding against every other finding using an LLM scales at O N^2 , which falls apart completely at scale. To keep the model off the critical path, deterministic code builds inverted indexes over the structured data touched files/functions, trust boundary, rare tokens to generate a short list of real candidates. Only then does an agent look at that short list to see if a single fix would close several of them. Stable cross-run keys ensure re-found bugs reopen existing records rather than spawning new ones. Judgment is a second, independent pass over what survived. The agent rechecks the latest information, pulling from deployment, environment, and config context to determine if the code path is reachable in prod, and identify the repo owner. This process filters "exploitable now" from "real but latent" and from "real but filed against the wrong component." It's moving a pile of chaotic findings into a risk-driven orchestration workflow. The Fixer takes the proposed patch and unit tests, rewrites them to match the repo’s style, applies the diff, and runs targeted tests. A clean fail→pass flip is the ideal and the only auto-cleanup case; a failing post-patch test blocks the commit. The Fixer never merges code on its own; a human must review the branch. This gate is the non-negotiable, human-in-the-loop safeguard that enables a clean, unbreakable cryptographic trail for change management compliance. Left to patch freely, a model will happily fix a security bug while quietly breaking an unrelated feature or adding dozens of new bugs. Across all three triage jobs, each agent is confined to one narrow task wrapped in deterministic bookkeeping code, and nothing writes to production without a human signing off on a dry run. While this pipeline moves the engineering bottleneck from finding bugs to reviewing and landing fixes, the Fixer remains the youngest and slowest part of the system. Running hundreds of agents over a fleet of repos is not cheap, but at least the shape of the spend is predictable. Almost all of the compute budget goes directly into the hunt stage. This makes Gapfill our cost-to-coverage lever, as each additional pass costs roughly half as much as the initial hunt. Because the cost per repository varies wildly, we budget per repo rather than per run. We enforce a strict task cap per repository and spin up a worker pool of anywhere from 50 to 200 workers. That way you can spend money on the repos that are actually finding things, and not waste it on the ones that aren't. It's also why, for us, the big scans are a periodic backlog sweep and not a per-PR check. A full scan of a complex repo can take hours; the worst run took just over 14 hours. Cheaper, smaller harnesses are the right tool for that job. We measure our system’s effectiveness by tracking how efficiently our automated pipeline filters deliberate engineering noise into high-quality, actionable findings. Because we intentionally tune our Hunters to over-report subtle primitives that could be chained into larger attacks, our true indicator of success is how sharply we can refine that initial mountain of raw data, before it ever reaches a human. To gauge this, we track exactly how many raw findings survive each validation stage over time. Thanks to better context injection from our Recon phase, our initial validation rejection rate dropped from 40% down to 11%, while the share of high-integrity findings climbed from 35% to 58% representing ~12,057 lifetime findings . Here's the lifetime breakdown from raw candidates to actionable findings, at the point in time this blog post was written. Vulnerability Discovery Harness VDH Raw candidates: Everything the discovery harness emitted before independent validation. Needs repro: Findings that appeared plausible but required manual reproduction before being trusted. Rejected at validation: The validator disproved the threat model, exploit path, affected code, or evidence. Duplicates: Candidates collapsed onto another finding from the same harness. Survived validation: Findings that passed the independent validation gate and moved into the VVS. Bugs that went elsewhere: Findings deliberately routed outside this flow. Vulnerability Validation System VVS Another vulnerability harness: Other automated sources feeding the same validation system. Total bugs in system: The combined pool after ingest. Duplicates: Findings the dedup pass identified as already covered by another canonical finding or ticket. Wrong repo / other / not a risk: The noise bucket: misattributed findings, defense-in-depth, or latent risks. Bugs sent to teams: Finalized, clean findings ready for remediation. Judged Internet-exploitable: High-urgency findings a realistic attacker could trigger in production. Not judged Internet-exploitable: Lower-urgency, actionable bugs production issues, dependency risks, or config errors . Final severity split: The categorization used to assign priority for the engineering teams. The core metric of the harness isn’t a speculative recall score — it’s keeping the number of unconfirmed findings in front of real humans as close to zero as possible. The architecture needs to be a relentless filtering funnel. Out of 20,799 raw candidates generated by VDH, only about 12,057 survived validation. When these were pushed into the VVS, joining findings from another harness, the central pool was brought to 13,841 . The Dedup agent folded away 5,442 findings as duplicates. 1,154 were routed to the queue as ‘wrong-repo’ or ‘low-risk’ and were recycled back into the system where appropriate. Ultimately this left 7,245 actionable findings for engineering teams to act on. Traditional compliance rules dictate arbitrary remediation windows based entirely on a static CVSS score e.g., "Fix all Highs in 30 days" . Our contextual judgment layer turns this compliance checkbox into actual risk management. The architecture is capable of tracking findings back to their origin, meaning that fixing a single root cause resolves an entire cluster of findings rather than just patching individual issues. VDH system performance is also measured by dividing repos into area x attack-class cells and running the Gapfill agent iteratively until it stops producing findings. Whenever we update an underlying prompt, we test it against a held-out repository to see if that total coverage cell number actually moves. The harness wires automated health signals to catch system failures early in the pipeline. If a hunt finished suspiciously fast and fails to spawn sub-hunts or gap tasks, it usually indicates a crashed dependency rather than a clean codebase. To remedy this, the system flags any Hunter agent that finishes with zero findings as “shallow” and immediately requeues it for a new run. Finally, our system’s robustness is reinforced by the independent triage pass described earlier. By re-judging all submissions with a different model and separate logical weights, we ensure an unbiased, adversarial verification that is decoupled from the specific model used for discovery, providing a trust layer that persists regardless of which model is in use. None of this is finished. We change our system constantly, and it is nowhere near a perfect science. But raw candidate findings are cheap now, and the only work worth doing is turning them into sound, verifiable code fixes. Building your own harness means accepting that AI models are volatile, but your orchestration layer doesn't have to be. By decoupling your security logic from any single provider, forcing adversarial verification, and automating your triage pipeline, you can turn a mountain of LLM noise into a reliable, fleet-wide defense engine. Our “North Star” metrics: measuring real-world velocity Every codebase is a little different, so to show you how this actually works in the real world, we mapped out a realistic benchmark based on a standard repo run. Keep in mind that this represents a single pass on one repo; over time, as the continuous fleet-wide loop deduplicates, filters, and recycles findings, it reduces the volume of lifetime candidates by roughly 65%. Engineering hours saved via automated patching: Rather than focusing on static baselines, we measure the health of our pipeline by its technical throughput, processing velocity, and its ability to eliminate the manual triage bottleneck: Initial Validation Cut: For a standard repository ~30k lines of code , this yields 100 initial findings, with a full run taking 3-4 hours, maintaining a hyperfocused context window throughout. Compression: The Deduplication and Contextual Judgment Layers process these candidates in parallel. Within 3 hours, the system compresses and refines the batch of findings from ~100 raw candidates to 80 distinct, high-fidelity bugs. Remediation: The automated Fixer processes these 80 distinct bugs at an average rate of 5 minutes per bug. In total, the system can discover, validate, deduplicate, and open functional pull requests in approximately 14 hours. Shrinking mean-time-to-resolve for critical flaws: Of course, you can’t dump 80 patches into production all at once without breaking things. To keep deployments safe, our system uses a tiered rollout: Critical Exposure Containment : The system isolates the critical, high, and exploitable bugs avg. 10 out of 80 . We fast-track these for a human review and introduce them into release cycles, getting them fully patched in production in 5 days. Incremental Hardening: The remaining latent risks, minor config anomalies, and lower-urgency bugs are incrementally rolled into prod over a 15-20 day window to guarantee platform stability. How we’re handling all of this patching These findings are the result of an isolated, ring-fenced research experiment designed to stress-test our code. They do not represent active, unpatched vulnerabilities in our live production environment. Because the harness runs constantly in our test environments, these specific numbers are completely out of date by the time you're reading this. Every single bug surfaced by the pipeline came attached to a working test case to demonstrate the bug and a draft patch. Our security teams are systematically processing the reports and applying the necessary fixes, meaning the Cloudflare products you use every day are already actively hardened against these vectors. Along with this blog post, we’re releasing the initial skill we used to develop the harness, it’s been slightly cleaned up before release so it’s easier to understand and integrate, but the skill itself remains substantially the same. Hopefully the harness itself will follow shortly. This could be a starting point for your own vulnerability harness, your own skill, or whatever suits your needs best: github.com/cloudflare/security-audit-skill If your team is working on the same problems and would like to compare notes, reach out to us at email protected .