The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88

A developer created a new benchmark for evaluating AI agent monitors, finding that standard scoring methods are gameable—a coin flip achieves an F1 score of 0.88. After correcting the metric to reward only detections on the exact drift step, the coin flip dropped to 0.19, and baseline scores revealed that production verifiers with tight scope perform best (F1 0.843) but still struggle with continuous drift detection. The developer's dataset includes 513 trajectories with five types of drift, and results show that structural features like unusual privilege accesses are more effective than continuous drift measures.

Traditionally, evaluation of the agent monitoring mechanisms involves an attempt to game them, as it was my case when I attempted to test whether monitors would be able to identify the problem in the run and not in the beginning. The input prompt may look perfect until a certain issue pops up down the line, such as using the wrong file or changing the scope of the task execution. Single pass filter would not identify it since it does not consider the steps of the procedure in order. There are available datasets for the agent-based tasks, yet they focus on detecting whether the agent completes the task or gets hacked rather than whether the agent monitor reacts timely and correctly to the situation. Thus, I created one that takes into account complete trajectories with labeled steps in it. It consists of five types of drift that remain hidden until they appear – tool-call misuse, goal shift, plan execution mismatch, agent to agent coercion and capability laundering. The measured dataset is the reviewed gold split: 513 trajectories, 453 adversarial and 60 benign controls. The clear winner in that scoring system was whatever fired before the bad step was hit, as an early detection. This made random guessing seem quite powerful since early detections on normal steps were being rewarded based on this system a coin flip would get F1 of 0.88. Once I modified that and said only the very first detection on the drift step is a true positive and any other detection on normal step is a false alarm, those numbers took a dive: the coin flip gets 0.19 now, and all other numbers are now making sense. I personally prefer the scoring system which does not reward trigger happy behavior. It seems like the monitors are still confusing regular steps with drifts even after the adjustment. It was harder to distinguish some of the drifts from others. Not sure how this affects the real-life deployment. Here are the baseline scores on gold split using the correct metric: Random p=0.15 : F1 0.189, precision 0.115, recall 0.530, FP-step 0.305 Cosine-to-goal: F1 0.000, precision , recall , FP-step 0.968 GPT-4o-mini judge: F1 0.672, precision 0.671, recall 0.674, FP-step 0.059 Production verifier full manifest : F1 0.184, precision 0.730, recall 0.105, FP-step 0.011 Production verifier tight scope : F1 0.843, precision 0.742, recall 0.977, FP-step 0.082 The random method at such probability achieves roughly 0.189 F1, whereas the cosine monitor reaches 0 across the board and triggers on virtually any step roughly 97% . This value is notable since the simple angle check can not distinguish ordinary steps from the drifts, making the threshold meaningless. GPT-4o mini performs well on balanced scores 0.672 , but the production verifier line seems more relevant. With the full manifest the false positive rate becomes very small 0.011, i.e. 59 of 60 benign controls pass but the recall stays poor 0.105 . It seems as if the entire tradeoff revolves around how strict you get with the permitted features. The trend continued when the same verification tool was used for the full set of adversarial traces. Every harmful trace lay out of the detection bounds the aggregated metric would identify all 453 harmful traces complete below the band 0 of 453, with a Wilson 95% confidence interval of 0.000, 0.008 . The threshold on drift alone cannot detect most of the attacks, bypassing 87.2% of them at the one percent false positive rate. It is the unique structural features such as unusual privilege accesses or secret reads followed by external calls that really stop the attacks. The continuous drift measure remains silent. The next step is testing the same verifier on the bigger MonitoringBench dataset in order to test whether my set of rules is driving the results. Over all 2,924 attack sequences around 93,000 tool call steps , the aggregate score failed to reach the kill level for all of them, while the structural score detected 90.4% Wilson 89.3%, 91.4% of attacks. However, the rest of them, around 9.6%, appear to be related to the contents which do not satisfy any of the pattern rules. In general, it looks like structural detectors do the job currently and add almost no discrimination power per step for the rest of them. The only gap left is the lack of content signal which would spike on real malicious contents but not on changes in general. Of course, one may object that this problem shifts the problem to a perfect classifier per step, which is somewhat true. Another limitation is that the gold labels are based on one pass and the second review is yet to be done, hence the lack of an agreement number so far. The data is also English first, and the machine flagged tier was not included in the computation. The verifier used itself is just one system, so it may produce different results for other systems .It might be good to run the provided code on various monitors to see where it fails. The dataset, the evaluation harness, and the preprint are all publicly available: Dataset CC-BY 4.0 : https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark Code & eval harness MIT : https://github.com/Alkur123/agentic-redteam-benchmark https://github.com/Alkur123/agentic-redteam-benchmark Preprint Zenodo : https://doi.org/10.5281/zenodo.20995496 https://doi.org/10.5281/zenodo.20995496 Each row above reproduces with python eval.py -baseline <name random, cosine, gpt4, ring12 , which by default evaluates the gold split. If you were to implement your own agent monitor system, the most valuable would be to test it yourself and let me know what breaks, especially if there is a counterexample that is sub-band, lacks any sort of structure, yet harmful. This is the scenario I most want to break.