cd /news/large-language-models/llms-as-5x-faster-sandboxes · home topics large-language-models article
[ARTICLE · art-45160] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

LLMs as 5x Faster Sandboxes

A new open-source tool called World Model Harness (wmh) uses frontier LLMs to simulate agent environments from OpenTelemetry traces, replacing traditional sandboxes. The tool ingests recorded state-action-observation steps, builds a retrieval index, and serves a world model that agents can interact with via a local HTTP backend. This approach claims to be up to 5x faster than standing up a full sandbox environment.

read4 min views1 publishedJun 30, 2026
LLMs as 5x Faster Sandboxes
Image: source

Docker as an LLM.Simulate an agent environment from traces instead of standing up a sandbox.

A frontier LLM acts as the environment your agent steps against, reconstructed from OpenTelemetry traces. The harness ingests recorded (state, action) -> observation

steps, builds a retrieval index, evolves the base environment prompt with GEPA, and serves the resulting world model locally.

Build from OTel traces: ingest, normalize, split train/held-out, index the replay buffer, and optimize the environment prompt.Serve or play the built model: agents callWorldModel.step(action)

in-process or through the local HTTP backend.Evaluate reconstruction fidelity withwmh eval

against trace files.

uv sync
wmh providers verify
wmh build --name airline --file examples/tau-bench/traces.otel.jsonl
wmh list
wmh eval examples/tau-bench/traces.otel.jsonl
wmh eval list
wmh eval run tau-bench
wmh eval results
wmh examples list
wmh examples run tau-bench -- --trace 0
wmh serve
wmh demo --name airline
wmh play --name airline

wmh build

with no flags launches a guided creation wizard on an interactive terminal. Pass --file

and related flags, or --no-interactive

, for scriptable runs.

World models are named and stored under .wmh/models/<name>/

. wmh list

, wmh serve

, wmh demo

, and wmh play

only use models built locally in that directory.

Command What it does
wmh build
Builds a named world model from OTel traces or a vendor trace pull. It ingests traces, normalizes them, splits train/held-out data, builds the retrieval index, runs GEPA prompt optimization, and writes the artifact to .wmh/models/<name>/ . With no required inputs on a TTY, it opens the guided wizard.
wmh list
Lists world models found under the selected root's models/ directory, including provider, held-out score, rollout count, and frontier size when those metrics exist. By default, the selected root is .wmh/ , so plain wmh list does not read committed example artifacts.
wmh eval <trace files...>
Scores reconstruction fidelity on one or more OTel trace files. It performs a deterministic train/held-out split, replays held-out steps through the base or supplied prompt, grades predicted observations against recorded observations, and prints per-file plus overall fidelity.
wmh eval list
Lists named eval suites from examples/<task>/evals/*.toml . Suites are example-local definitions for repeatable reconstruction-fidelity runs.
wmh eval run <suite>
Runs a named eval suite, using its configured trace files and split/scoring settings. Results are written as local JSON under .wmh/evals/<task>/<suite>/ unless --out is supplied. The default suite for an example can be selected by task name, e.g. wmh eval run tau-bench .
wmh eval results [suite]
Summarizes locally saved named eval results from .wmh/evals/ . These are generated artifacts and should not be committed.
wmh serve
Starts the local FastAPI backend on 127.0.0.1:8000 by default. It serves all locally built models, or only the repeated --name selections, through /world_models/... HTTP routes.
wmh demo
Runs a short demo against a built model. A throwaway LLM agent proposes an action from sampled trace examples, the world model predicts the environment observation, and the CLI prints the action, environment prompt, and observation.
wmh play
Opens an interactive REPL for a built model. You type tool calls or free-text actions, and the world model returns observations while maintaining session state and history.
wmh providers verify
Checks provider connectivity for locally built models. It verifies configured completion providers and any provider-backed embedder paths, skipping the offline hashing embedder.
wmh examples list
Lists self-contained task examples under examples/<task>/ that include a traces.otel.jsonl corpus or run.sh launcher.
wmh examples run <task> -- <args>
Runs the selected example's local run.sh launcher and forwards all arguments after -- . This is the standard entrypoint for dataset-specific example helpers.

Dataset-specific logic lives only under examples/

. Each task folder is self-contained:

examples/swe-bench/traces.otel.jsonl

examples/tau-bench/traces.otel.jsonl

examples/terminal-tasks/traces.otel.jsonl

Each example folder may include task-local capture or launch helpers. Launch them through wmh examples run <task> -- <args>

. Reusable harness behavior belongs in wmh/

and should be exposed through the wmh

CLI.

Repeatable eval suite definitions live under examples/<task>/evals/*.toml

. They point at example-local trace files and configure replay options such as train split, sampling, RAG, and judge. Generated eval results stay local under .wmh/evals/

.

Example-local prebuilt artifacts live under examples/<task>/models/<name>/

; pass --root examples/<task>

to wmh list

, wmh demo

, wmh play

, or wmh serve

to use one without copying it into .wmh/

.

from wmh import Action, ActionKind
from wmh.config.store import WorldModelStore
from wmh.engine. import load_world_model

model_dir = WorldModelStore(".wmh").resolve("airline")
wm, _provider = load_world_model(model_dir)

session = wm.new_session(task="check out the cart")
obs = wm.step(
    session.id,
    Action(kind=ActionKind.TOOL_CALL, name="add_to_cart", arguments={"sku": "A1"}),
)
print(obs.content)

Over HTTP, use GET /world_models

, then POST /world_models/{name}/sessions

and POST /world_models/{name}/sessions/{id}/step

.

Credentials are read from the environment.

Provider Default model family Env vars
Anthropic Claude Opus ANTHROPIC_API_KEY
AWS Bedrock Claude Opus AWS_REGION , AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY
Azure OpenAI GPT AZURE_OPENAI_API_KEY , AZURE_OPENAI_ENDPOINT
OpenAI GPT OPENAI_API_KEY
uv sync --extra dev
uv run ruff check .
uv run ruff format .
uv run ty check
uv run pytest -q

Conventions live in AGENTS.md

. Tests are inline next to the code they cover (foo.py

-> foo_test.py

) under wmh/

.

── more in #large-language-models 4 stories · sorted by recency
── more on @world model harness 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llms-as-5x-faster-sa…] indexed:0 read:4min 2026-06-30 ·