Simplicity always wins:SOTA on swe-pro,tb2,-verif on 21 models with simple-agent

wpnews.pro

cd /news/artificial-intelligence/simplicity-always-wins-sota-on-swe-p… · home › topics › artificial-intelligence › article

[ARTICLE · art-34239] src=github.com ↗ pub=2026-06-19T17:58Z topic=artificial-intelligence verified=true sentiment=↑ positive

Simplicity always wins:SOTA on swe-pro,tb2,-verif on 21 models with simple-agent

Strands Labs released Simple Strands Agent (SSA), a minimal autonomous coding harness that achieves state-of-the-art results on SWE-Bench Verified, SWE-Bench Pro, and Terminal Bench 2 across 21 models. The open-source tool pairs frontier LLMs with bash and file-editing tools in isolated Docker environments, enabling rapid experimentation and model-agnostic benchmarking.

read3 min views1 publishedJun 19, 2026

Simplicity always wins:SOTA on swe-pro,tb2,-verif on 21 models with simple-agent — Image: source

A repository for Strands-based agents and harnesses for agentic benchmarks. It is a uv workspace: the repository root coordinates one or more member packages. Setup, configuration, and usage live in each agent's README.

A lean autonomous-coding agent achieving state-of-the-art performance across software engineering benchmarks.

For a summary of the work, see the post on Amazon-Science.

Simple Strands Agent (SSA) is a minimal, hackable harness for autonomous software engineering. It pairs frontier LLMs (Claude, GPT, Gemini, and open-weight models via Bedrock/LiteLLM/vLLM) with bash

and file-editing tools inside isolated Docker environments to analyze codebases, diagnose bugs, write patches, and verify solutions.

The harness is built for rapid experimentation — swap models, tune prompts, adjust tool behavior, and benchmark all in a single config change. Despite its simplicity, SSA delivers SOTA-level results on widely-used coding benchmarks including SWE-Bench Verified, SWE-Bench Pro, and Terminal Bench 2.

Model-agnostic— first-class adapters for Anthropic, OpenAI, Google, xAI, Bedrock, and any OpenAI-compatible endpoint (vLLM, LiteLLM, Together, Vertex, Z.AI).Composable tools—bash

,str_replace_editor

,think

, andsubmit

primitives with per-tool output clipping and timeout controls.Isolated environments— Docker-backed sandboxes with streaming exec, automatic image resolution, and ECR support.** Hydra-powered configs**— every knob is overridable from the command line; experiments are reproducible from a single YAML.** Built-in benchmarking**— turnkey scripts for SWE-Bench Verified, SWE-Bench Pro, and Terminal Bench 2, including S3 result upload.

Python 3.12+Docker for containerized task environmentsAWS credentials if using Amazon Bedrock for model access and/or Amazon ECR for docker imagespackage manager (optional, but recommended)uv

git clone https://github.com/strands-labs/benchmark-harnesses.git
cd benchmark-harnesses

uv sync
source .venv/bin/activate

pip install -e .
uv run python -m ssa.run \
    --config-name=default.yaml \
    dataset.name=sbv \
    dataset.identifier=django__django-15987 \
    env.env_type=docker \
    env.docker.workdir="/testbed"

We provide simple scripts for running instances from SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench-2 (and see this for running Terminal-Bench-2 with SSA's harbor plugin).

SSA uses Hydra for configuration. All configs live in src/ssa/configs/, and any parameter can be overridden from the command line. Start from

for the full schema, then mix in a model-specific config as needed.

src/ssa/configs/default.yaml

SSA's design and execution loop is summarized in the following figure:

SWE-Bench Verified:

SWE-Bench Pro:

Terminal Bench 2:

The code is structured as follows:

src/ssa/
├── agent.py / agent_runner.py     # Core agent loop
├── models/                        # Model adapters (Anthropic, OpenAI, Bedrock, ...)
├── tools/                         # bash, str_replace_editor, think, submit
├── environments/                  # Docker-backed sandbox
├── conversation_manager/          # Context management & truncation
├── hooks/ • callbacks/ • metrics/ # Observability and instrumentation
├── prompts/ • configs/            # System prompts and Hydra configs
└── run.py                         # Entry point
scripts/
├── swe_verified/ • swe_pro/ • tb2/

Please consider citing as follows, if you find SSA useful!

@misc{2026simplestrandsagent,
      title={Dissecting model behavior through agent trajectories},
      author={Gaurav Gupta and Vatshank Chaturvedi and Jun Huan and Anoop Deoras},
      year={2026},
      eprint={2606.17454},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2606.17454},
}

Agents in this repository are given access to shell tools. In practice, this means the model can run commands in the environment where the agent is started.

This is useful for experiments and benchmarking, but it also means you should treat the agent like you would treat any program with shell access: it may read files, modify files, delete data, install packages, or accidentally expose information from the environment.

For normal use, we recommend running agents in an isolated environment rather than directly on your machine. Our experiments and benchmarks are run inside Docker containers. You should avoid running agents in an environment that contains secrets, credentials, personal files, production data, or anything you would not want the model to access.

A good default setup is:

run the agent inside Docker or another sandboxed environment
mount only the files/directories the agent actually needs
avoid exposing cloud credentials, SSH keys, API keys, or other secrets
do not run it with unnecessary privileges
inspect outputs before trusting or reusing them

Agents will usually behave as instructed, but shell access is powerful. Use the same caution you would use when running code from an automated system.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/simplicity-always-wins-s…

Read original on github.com → github.com/strands-labs/benchmark-harnesses

mentioned entities

Strands Labs

Simple Strands Agent

SSA

SWE-Bench Verified

SWE-Bench Pro

Terminal Bench 2

Amazon Bedrock

Hydra

metadata

slugsimplicity-always-wins-sota-on-swe-pro-tb2-verif-on-21-models-with-simple-agent

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prevPoll: What's your primary AI cod…

next →Is AI ruining our skills? Early …

── more in #artificial-intelligence 4 stories · sorted by recency

news.ycombinator.com · 19 Jun · #artificial-intelligence

Poll: What's your primary AI coding agent/orchestrator?

sundial.md · 19 Jun · #artificial-intelligence

Sundial: Agent-Native Document Editor

askhuman.app · 19 Jun · #artificial-intelligence

Show HN: No-install, end-to-end encrypted HTML artifact sharing for agents

dev.to · 19 Jun · #artificial-intelligence

Breaking Build: Kiro and Claude delivered exactly what I asked, and it wasn't what I wanted

── more on @strands labs 3 stories trending now

wpnews · 18 Jun · #artificial-intelligence

KubeCon, OpenInfra and PyTorch Unite to Scale AI

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required