Dissecting model behavior through agent trajectories

Researchers introduced the Simple Strands Agent (SSA) to minimize the 'intent-execution gap' between AI models and their harnesses, improving agent performance. Analyzing 138k trajectories, they found that fine-grained metrics like edit frequency reveal model-level differences in problem-solving behavior, beyond pass@1 scores.

arXiv:2606.17454v1 Announce Type: new Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called Simple Strands Agent' SSA . SSA aims to find the bulk of common patterns which generalize across different model families such as Claude, Gemini, GPT, Grok, Qwen , as well as a small number of model-specific preferences. We make two contributions: i we $\textbf{reproduce or improve on the pass@1}$ performance reported by diverse model-provider families on popular agentic benchmarks SWE-Pro, SWE-Verified and Terminal-Bench-2 , and ii building on an $\textbf{analysis of 138k trajectories generated by SSA}$, we look beyond the $\texttt{pass@1}$ numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.