LLMs as 5x Faster Sandboxes

wpnews.pro

Docker as an LLM.Simulate an agent environment from traces instead of standing up a sandbox.

A frontier LLM acts as the environment your agent steps against, reconstructed from OpenTelemetry traces. The harness ingests recorded (state, action) -> observation

steps, builds a retrieval index, evolves the base environment prompt with GEPA, and serves the resulting world model locally.

Build from OTel traces: ingest, normalize, split train/held-out, index the replay buffer, and optimize the environment prompt.Serve or play the built model: agents callWorldModel.step(action)

in-process or through the local HTTP backend.Evaluate reconstruction fidelity withwmh eval

against trace files.

uv sync
wmh providers verify
wmh build --name airline --file examples/tau-bench/traces.otel.jsonl
wmh list
wmh eval examples/tau-bench/traces.otel.jsonl
wmh eval list
wmh eval run tau-bench
wmh eval results
wmh examples list
wmh examples run tau-bench -- --trace 0
wmh serve
wmh demo --name airline
wmh play --name airline

wmh build

with no flags launches a guided creation wizard on an interactive terminal. Pass --file

and related flags, or --no-interactive

, for scriptable runs.

World models are named and stored under .wmh/models/<name>/

. wmh list

, wmh serve

, wmh demo

, and wmh play

only use models built locally in that directory.

Command	What it does
`wmh build`
Builds a named world model from OTel traces or a vendor trace pull. It ingests traces, normalizes them, splits train/held-out data, builds the retrieval index, runs GEPA prompt optimization, and writes the artifact to `.wmh/models/<name>/` . With no required inputs on a TTY, it opens the guided wizard.
`wmh list`
Lists world models found under the selected root's `models/` directory, including provider, held-out score, rollout count, and frontier size when those metrics exist. By default, the selected root is `.wmh/` , so plain `wmh list` does not read committed example artifacts.
`wmh eval <trace files...>`
Scores reconstruction fidelity on one or more OTel trace files. It performs a deterministic train/held-out split, replays held-out steps through the base or supplied prompt, grades predicted observations against recorded observations, and prints per-file plus overall fidelity.
`wmh eval list`
Lists named eval suites from `examples/<task>/evals/*.toml` . Suites are example-local definitions for repeatable reconstruction-fidelity runs.
`wmh eval run <suite>`
Runs a named eval suite, using its configured trace files and split/scoring settings. Results are written as local JSON under `.wmh/evals/<task>/<suite>/` unless `--out` is supplied. The default suite for an example can be selected by task name, e.g. `wmh eval run tau-bench` .
`wmh eval results [suite]`
Summarizes locally saved named eval results from `.wmh/evals/` . These are generated artifacts and should not be committed.
`wmh serve`
Starts the local FastAPI backend on `127.0.0.1:8000` by default. It serves all locally built models, or only the repeated `--name` selections, through `/world_models/...` HTTP routes.
`wmh demo`
Runs a short demo against a built model. A throwaway LLM agent proposes an action from sampled trace examples, the world model predicts the environment observation, and the CLI prints the action, environment prompt, and observation.
`wmh play`
Opens an interactive REPL for a built model. You type tool calls or free-text actions, and the world model returns observations while maintaining session state and history.
`wmh providers verify`
Checks provider connectivity for locally built models. It verifies configured completion providers and any provider-backed embedder paths, skipping the offline hashing embedder.
`wmh examples list`
Lists self-contained task examples under `examples/<task>/` that include a `traces.otel.jsonl` corpus or `run.sh` launcher.
`wmh examples run <task> -- <args>`
Runs the selected example's local `run.sh` launcher and forwards all arguments after `--` . This is the standard entrypoint for dataset-specific example helpers.

Dataset-specific logic lives only under examples/

. Each task folder is self-contained:

examples/swe-bench/traces.otel.jsonl

examples/tau-bench/traces.otel.jsonl

examples/terminal-tasks/traces.otel.jsonl

Each example folder may include task-local capture or launch helpers. Launch them through wmh examples run <task> -- <args>

. Reusable harness behavior belongs in wmh/

and should be exposed through the wmh

CLI.

Repeatable eval suite definitions live under examples/<task>/evals/*.toml

. They point at example-local trace files and configure replay options such as train split, sampling, RAG, and judge. Generated eval results stay local under .wmh/evals/

.

Example-local prebuilt artifacts live under examples/<task>/models/<name>/

; pass --root examples/<task>

to wmh list

, wmh demo

, wmh play

, or wmh serve

to use one without copying it into .wmh/

.

from wmh import Action, ActionKind
from wmh.config.store import WorldModelStore
from wmh.engine. import load_world_model

model_dir = WorldModelStore(".wmh").resolve("airline")
wm, _provider = load_world_model(model_dir)

session = wm.new_session(task="check out the cart")
obs = wm.step(
    session.id,
    Action(kind=ActionKind.TOOL_CALL, name="add_to_cart", arguments={"sku": "A1"}),
)
print(obs.content)

Over HTTP, use GET /world_models

, then POST /world_models/{name}/sessions

and POST /world_models/{name}/sessions/{id}/step

.

Credentials are read from the environment.

Provider	Default model family	Env vars
Anthropic	Claude Opus	`ANTHROPIC_API_KEY`
AWS Bedrock	Claude Opus	`AWS_REGION` , `AWS_ACCESS_KEY_ID` , `AWS_SECRET_ACCESS_KEY`
Azure OpenAI	GPT	`AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT`
OpenAI	GPT	`OPENAI_API_KEY`

uv sync --extra dev
uv run ruff check .
uv run ruff format .
uv run ty check
uv run pytest -q

Conventions live in AGENTS.md

. Tests are inline next to the code they cover (foo.py

-> foo_test.py

) under wmh/

.

source & further reading

github.com — original article

LLMs as 5x Faster Sandboxes

Run your AI side-project on zahid.host