LLMs as 5x Faster Sandboxes

A new open-source tool called World Model Harness (wmh) uses frontier LLMs to simulate agent environments from OpenTelemetry traces, replacing traditional sandboxes. The tool ingests recorded state-action-observation steps, builds a retrieval index, and serves a world model that agents can interact with via a local HTTP backend. This approach claims to be up to 5x faster than standing up a full sandbox environment.

Docker as an LLM.Simulate an agent environment from traces instead of standing up a sandbox. A frontier LLM acts as the environment your agent steps against, reconstructed from OpenTelemetry traces. The harness ingests recorded state, action - observation steps, builds a retrieval index, evolves the base environment prompt with GEPA, and serves the resulting world model locally. Build from OTel traces: ingest, normalize, split train/held-out, index the replay buffer, and optimize the environment prompt. Serve or play the built model: agents call WorldModel.step action in-process or through the local HTTP backend. Evaluate reconstruction fidelity with wmh eval against trace files. uv sync wmh providers verify wmh build --name airline --file examples/tau-bench/traces.otel.jsonl wmh list wmh eval examples/tau-bench/traces.otel.jsonl wmh eval list wmh eval run tau-bench wmh eval results wmh examples list wmh examples run tau-bench -- --trace 0 wmh serve wmh demo --name airline wmh play --name airline wmh build with no flags launches a guided creation wizard on an interactive terminal. Pass --file and related flags, or --no-interactive , for scriptable runs. World models are named and stored under .wmh/models/<name / . wmh list , wmh serve , wmh demo , and wmh play only use models built locally in that directory. | Command | What it does | |---|---| wmh build | Builds a named world model from OTel traces or a vendor trace pull. It ingests traces, normalizes them, splits train/held-out data, builds the retrieval index, runs GEPA prompt optimization, and writes the artifact to .wmh/models/<name / . With no required inputs on a TTY, it opens the guided wizard. | wmh list | Lists world models found under the selected root's models/ directory, including provider, held-out score, rollout count, and frontier size when those metrics exist. By default, the selected root is .wmh/ , so plain wmh list does not read committed example artifacts. | wmh eval <trace files... | Scores reconstruction fidelity on one or more OTel trace files. It performs a deterministic train/held-out split, replays held-out steps through the base or supplied prompt, grades predicted observations against recorded observations, and prints per-file plus overall fidelity. | wmh eval list | Lists named eval suites from examples/<task /evals/ .toml . Suites are example-local definitions for repeatable reconstruction-fidelity runs. | wmh eval run <suite | Runs a named eval suite, using its configured trace files and split/scoring settings. Results are written as local JSON under .wmh/evals/<task /<suite / unless --out is supplied. The default suite for an example can be selected by task name, e.g. wmh eval run tau-bench . | wmh eval results suite | Summarizes locally saved named eval results from .wmh/evals/ . These are generated artifacts and should not be committed. | wmh serve | Starts the local FastAPI backend on 127.0.0.1:8000 by default. It serves all locally built models, or only the repeated --name selections, through /world models/... HTTP routes. | wmh demo | Runs a short demo against a built model. A throwaway LLM agent proposes an action from sampled trace examples, the world model predicts the environment observation, and the CLI prints the action, environment prompt, and observation. | wmh play | Opens an interactive REPL for a built model. You type tool calls or free-text actions, and the world model returns observations while maintaining session state and history. | wmh providers verify | Checks provider connectivity for locally built models. It verifies configured completion providers and any provider-backed embedder paths, skipping the offline hashing embedder. | wmh examples list | Lists self-contained task examples under examples/<task / that include a traces.otel.jsonl corpus or run.sh launcher. | wmh examples run <task -- <args | Runs the selected example's local run.sh launcher and forwards all arguments after -- . This is the standard entrypoint for dataset-specific example helpers. | Dataset-specific logic lives only under examples/ . Each task folder is self-contained: examples/swe-bench/traces.otel.jsonl examples/tau-bench/traces.otel.jsonl examples/terminal-tasks/traces.otel.jsonl Each example folder may include task-local capture or launch helpers. Launch them through wmh examples run <task -- <args . Reusable harness behavior belongs in wmh/ and should be exposed through the wmh CLI. Repeatable eval suite definitions live under examples/<task /evals/ .toml . They point at example-local trace files and configure replay options such as train split, sampling, RAG, and judge. Generated eval results stay local under .wmh/evals/ . Example-local prebuilt artifacts live under examples/<task /models/<name / ; pass --root examples/<task to wmh list , wmh demo , wmh play , or wmh serve to use one without copying it into .wmh/ . python from wmh import Action, ActionKind from wmh.config.store import WorldModelStore from wmh.engine.loader import load world model model dir = WorldModelStore ".wmh" .resolve "airline" wm, provider = load world model model dir session = wm.new session task="check out the cart" obs = wm.step session.id, Action kind=ActionKind.TOOL CALL, name="add to cart", arguments={"sku": "A1"} , print obs.content Over HTTP, use GET /world models , then POST /world models/{name}/sessions and POST /world models/{name}/sessions/{id}/step . Credentials are read from the environment. | Provider | Default model family | Env vars | |---|---|---| | Anthropic | Claude Opus | ANTHROPIC API KEY | | AWS Bedrock | Claude Opus | AWS REGION , AWS ACCESS KEY ID , AWS SECRET ACCESS KEY | | Azure OpenAI | GPT | AZURE OPENAI API KEY , AZURE OPENAI ENDPOINT | | OpenAI | GPT | OPENAI API KEY | uv sync --extra dev uv run ruff check . uv run ruff format . uv run ty check uv run pytest -q Conventions live in AGENTS.md . Tests are inline next to the code they cover foo.py - foo test.py under wmh/ .