The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

A new study using the HEART affective-dynamics engine to evaluate intervention triggers on autonomous agents found that state-based detectors fire on 39-83% of actions due to a "State Saturation Trap" where agents show no recovery under sustained difficulty. LLM judges performed poorly, with small models never firing and frontier models achieving only F1 scores of 0.17-0.40 at up to 90x the cost. Most critically, three trained human annotators agreed on intervention timing only slightly above chance (Krippendorff's alpha = +0.047), indicating that intervention timing is a low-reliability construct unsuitable for single-annotator optimization.

arXiv:2606.04296v1 Announce Type: new Abstract: As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine HEART as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model gpt-5.4-mini never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349 and not at all on intervention type pause degenerate; clarify below chance; reflect only alpha = +0.226 . We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.