{"slug": "xints-false-positive-rate-methodology-and-purpose", "title": "Xint’s False Positive Rate: Methodology and Purpose", "summary": "Xint, an LLM-based security tool developed by Theori, reports a false positive rate of less than 25% for its highest-severity vulnerability findings, based on manual validation through proof-of-concept testing. For example, scanning 600,000 lines of OpenSSL code yielded 17 critical flags, with 14 successfully validated (18% FP rate), while scans of Ghostscript and Postgres produced FP rates of 19% and 22% respectively. The company emphasizes that a 0% false positive rate is not the goal, as practical security requires focusing on severe, exploitable threats rather than flagging every possible bug.", "body_md": "Xint’s False Positive Rate: Methodology and Purpose\nHigh Profile Stories About LLMs Finding 0days Miss the 800lb Gorilla In the Room: False Positives\nLast year as the security researchers at Theori developed what would become Xint, they were able to win several high profile competitions like Zeroday.cloud, DEF CON CTF, and AIxCC using LLMs, often with little or no human intervention. But it’s been the launch of Mythos and now Daybreak this year that has really forced the potential of LLMs to secure and attack codebases to the forefront of CISO priorities.\nThese models have demonstrated impressive capabilities in finding bugs that would have been hard even for a human to find. But these news stories raise several critical questions:\nWhat was the total number of vulnerabilities flagged as severe by the model?\nHow much work went into validating how many of those flagged vulns were true versus false?\nAnd what share of these findings ended up being false positives?\nThis is not an academic exercise. One of the largest challenges for product security teams is sifting through false positives In traditional SAST tools false positives can account for over 80% of total findings and often cause burnout in the small teams tasked with discovery, triage, and remediation.\nIt seems obvious but the goal of AppSec in the real world isn’t to flag as many possible bugs as possible - it’s to find the vulnerabilities that attackers would target that could cost the business.\nXint Has a <25% False Positive Rate\nXint can also flag hundreds or even thousands of potential bugs, but given our team’s experience in practical security our interface has been designed specifically to 1) automatically rank bugs by severity, 2) provide trigger conditions to expedite validation, and 3) include potential impact of an exploit so the teams can focus limited bandwidth only on severe threats.\nWhen we say we have less than a 25% false positive rate, we arrive at this answer through the following methodology:\nOut of the hundreds of findings per scan, we select those flagged as moderate- to high-severity, similar to how product security teams would want to focus only on important findings.\nOur researchers then try to conduct a POC for each item, often just inputting the trigger conditions included in every Xint finding to create that POC. Usually this takes 15-20 minutes of human researcher time per item tested. This compares to the days or even weeks per-bug validation used to take.\nWe then divide the number of items that had a successful POC over the total number of items tested to arrive at our true positive rate.\nFor example, as part of DARPA’s post AIxCC bounty program, we scanned 600k lines of OpenSSL. In less than 6 hours, Xint Code had flagged 411 possible vulns, of which 17 were flagged as critical. Our researcher was able to validate (that is, generate a successful POC) in 14 of those 17, for an FP rate of 18%.\nSimilarly, after running a scan of Ghostscript as part of our paper recreating Mythos results, out of 21 of the highest severity bugs flagged by Xint, 4 were FPs (19% FP rate). The Postgres bug that was part of our winning submission at ZDC came as a result of scanning nearly 1 million lines of code in under 12 hours, which resulted in 27 possible vulnerabilities flagged as severe, of which only 6 were FPs (22% FP rate).\nOf course, it’s worth pointing out that our methodology is based on manually triaging only the highest severity bugs so it is possible that FPs are not uniformly distributed across severities. We maintain however that focusing on the most severe vulnerabilities is consistent with real world circumstances where both defenders and attackers are just looking for the most severe exploits.\nWhy 0% FP Is Not the Goal\nObviously having an FP rate that is too high is harmful because it leads to the “boy who cried wolf” syndrome - security teams just stop paying attention to all the alerts.\nBut at the other extreme, if you only catch bugs that can be reproduced, you're going to miss security issues that ought to be fixed. You don’t want your FP rate to be 0% because that means you’re likely missing true positives/significant bugs. Both near-misses and \"safe in configuration X but not configuration Y\" are things that one ought to be informed of, even if it's a false positive in a strict sense.\nExploits or POCs can be useful for prioritization in lower-volume scenarios, but when bug volume is high (dozens/hundreds), using exploitability as a prioritizer effectively becomes a harsh filter. Defenders primarily need convincing evidence of impact + remediation help — not necessarily a full exploit. As a result, at enterprise scale it is better to have solid confidence scores and triage based on estimated severity/impact.\nThe Right Approach to False Positives in Era of AI Vuln Discovery\nIn this new era of high-volume AI-driven vuln discovery, teams should focus more on reliable static analysis, impact estimation, and low-FP scaffolding instead of leaning heavily on \"can we exploit it?\" as the decider. Exploit skills remain valuable (Xint values and publishes exploits — sharing technical deep-dives with heap grooming, RCE techniques, etc. But that's for demonstration/impact, not as the gatekeeper for whether a vuln is \"real\" or worth fixing), but over-relying on them for sorting would slow things down and miss bugs.", "url": "https://wpnews.pro/news/xints-false-positive-rate-methodology-and-purpose", "canonical_source": "https://xint.io/blog/xints-false-positive-rate", "published_at": "2026-05-18 15:52:00+00:00", "updated_at": "2026-05-20 08:41:10.220713+00:00", "lang": "en", "topics": ["cybersecurity", "artificial-intelligence", "large-language-models", "research", "products"], "entities": ["Xint", "Theori", "Mythos", "Daybreak", "LLMs", "CISO", "Zeroday.cloud", "DEF CON CTF"], "alternates": {"html": "https://wpnews.pro/news/xints-false-positive-rate-methodology-and-purpose", "markdown": "https://wpnews.pro/news/xints-false-positive-rate-methodology-and-purpose.md", "text": "https://wpnews.pro/news/xints-false-positive-rate-methodology-and-purpose.txt", "jsonld": "https://wpnews.pro/news/xints-false-positive-rate-methodology-and-purpose.jsonld"}}