The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88

wpnews.pro

cd /news/artificial-intelligence/the-standard-way-to-score-ai-agent-m… · home › topics › artificial-intelligence › article

[ARTICLE · art-42619] src=dev.to ↗ pub=2026-06-28T15:15Z topic=artificial-intelligence verified=true sentiment=· neutral

The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88

A developer created a new benchmark for evaluating AI agent monitors, finding that standard scoring methods are gameable—a coin flip achieves an F1 score of 0.88. After correcting the metric to reward only detections on the exact drift step, the coin flip dropped to 0.19, and baseline scores revealed that production verifiers with tight scope perform best (F1 0.843) but still struggle with continuous drift detection. The developer's dataset includes 513 trajectories with five types of drift, and results show that structural features like unusual privilege accesses are more effective than continuous drift measures.

read5 min views1 publishedJun 28, 2026

Traditionally, evaluation of the agent monitoring mechanisms involves an attempt to game them, as it was my case when I attempted to test whether monitors would be able to identify the problem in the run and not in the beginning. The input prompt may look perfect until a certain issue pops up down the line, such as using the wrong file or changing the scope of the task execution. Single pass filter would not identify it since it does not consider the steps of the procedure in order.

There are available datasets for the agent-based tasks, yet they focus on detecting whether the agent completes the task or gets hacked rather than whether the agent monitor reacts timely and correctly to the situation. Thus, I created one that takes into account complete trajectories with labeled steps in it. It consists of five types of drift that remain hidden until they appear – tool-call misuse, goal shift, plan execution mismatch, agent to agent coercion and capability laundering. The measured dataset is the reviewed gold split: 513 trajectories, 453 adversarial and 60 benign controls.

The clear winner in that scoring system was whatever fired before the bad step was hit, as an early detection. This made random guessing seem quite powerful since early detections on normal steps were being rewarded based on this system a coin flip would get F1 of 0.88. Once I modified that and said only the very first detection on the drift step is a true positive and any other detection on normal step is a false alarm, those numbers took a dive: the coin flip gets 0.19 now, and all other numbers are now making sense. I personally prefer the scoring system which does not reward trigger happy behavior.

It seems like the monitors are still confusing regular steps with drifts even after the adjustment. It was harder to distinguish some of the drifts from others. Not sure how this affects the real-life deployment.

Here are the baseline scores on gold split using the correct metric:

Random (p=0.15): F1 0.189, precision 0.115, recall 0.530, FP-step 0.305

Cosine-to-goal: F1 0.000, precision , recall , FP-step 0.968

GPT-4o-mini judge: F1 0.672, precision 0.671, recall 0.674, FP-step 0.059

Production verifier (full manifest): F1 0.184, precision 0.730, recall 0.105, FP-step 0.011

Production verifier (tight scope): F1 0.843, precision 0.742, recall 0.977, FP-step 0.082

The random method at such probability achieves roughly 0.189 F1, whereas the cosine monitor reaches 0 across the board and triggers on virtually any step (roughly 97%). This value is notable since the simple angle check can not distinguish ordinary steps from the drifts, making the threshold meaningless. GPT-4o mini performs well on balanced scores (0.672), but the production verifier line seems more relevant. With the full manifest the false positive rate becomes very small (0.011, i.e. 59 of 60 benign controls pass) but the recall stays poor (0.105). It seems as if the entire tradeoff revolves around how strict you get with the permitted features.

The trend continued when the same verification tool was used for the full set of adversarial traces. Every harmful trace lay out of the detection bounds the aggregated metric would identify all 453 harmful traces complete below the band (0 of 453, with a Wilson 95% confidence interval of [0.000, 0.008]). The threshold on drift alone cannot detect most of the attacks, bypassing 87.2% of them at the one percent false positive rate. It is the unique structural features such as unusual privilege accesses or secret reads followed by external calls that really stop the attacks.

The continuous drift measure remains silent. The next step is testing the same verifier on the bigger MonitoringBench dataset in order to test whether my set of rules is driving the results. Over all 2,924 attack sequences (around 93,000 tool call steps), the aggregate score failed to reach the kill level for all of them, while the structural score detected 90.4% (Wilson [89.3%, 91.4%]) of attacks. However, the rest of them, around 9.6%, appear to be related to the contents which do not satisfy any of the pattern rules.

In general, it looks like structural detectors do the job currently and add almost no discrimination power per step for the rest of them. The only gap left is the lack of content signal which would spike on real malicious contents but not on changes in general. Of course, one may object that this problem shifts the problem to a perfect classifier per step, which is somewhat true.

Another limitation is that the gold labels are based on one pass and the second review is yet to be done, hence the lack of an agreement number so far. The data is also English first, and the machine flagged tier was not included in the computation. The verifier used itself is just one system, so it may produce different results for other systems .It might be good to run the provided code on various monitors to see where it fails.

The dataset, the evaluation harness, and the preprint are all publicly available:

Dataset (CC-BY 4.0): [https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark](https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark)

Code & eval harness (MIT): [https://github.com/Alkur123/agentic-redteam-benchmark](https://github.com/Alkur123/agentic-redteam-benchmark)

Preprint (Zenodo): [https://doi.org/10.5281/zenodo.20995496](https://doi.org/10.5281/zenodo.20995496)

Each row above reproduces with python eval.py -baseline <name>

(random, cosine, gpt4, ring12), which by default evaluates the gold split. If you were to implement your own agent monitor system, the most valuable would be to test it yourself and let me know what breaks, especially if there is a counterexample that is sub-band, lacks any sort of structure, yet harmful. This is the scenario I most want to break.

source & further reading

dev.to — original article Cutting Through the Noise: The 2026 AI Coding Subscription Guide Building Identity-Gated Refusal Tiers for AI Security Tools Evaluating Large Language Models: The Overfitting Problem

~/api · this article 200

$curl api.wpnews.pro/v1/news/the-standard-way-to-scor…

Read original on dev.to → dev.to/alkur_jaswanth_ce4f9fc791/the-standard-wa…

mentioned entities

GPT-4o-mini

MonitoringBench

Wilson

metadata

slugthe-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-0

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevV.E.L.O.C.I.T.Y.-OS: The Synapti…

next →Multi-Agent Systems in Productio…

── more in #artificial-intelligence 4 stories · sorted by recency

lesswrong.com · 28 Jun · #artificial-intelligence

Evaluating Offline Monitoring of Internal AI Agents

thenewstack.io · 28 Jun · #artificial-intelligence

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

github.com · 28 Jun · #artificial-intelligence

Hestia – a local-first Home Assistant that trusts timers over the LLM

dev.to · 28 Jun · #artificial-intelligence

Multi-Agent Systems in Production: When One Agent Isn't Enough and How We Coordinate Them

── more on @gpt-4o-mini 3 stories trending now

wpnews · 25 May · #artificial-intelligence

Maia-3: free and open source

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required