SkyPilot Sandboxes: Run Agent Code on Your Own Kubernetes, at Scale

SkyPilot released Sandboxes, a bring-your-own-cloud code execution layer that runs untrusted agent code on a user's existing Kubernetes infrastructure instead of a hosted third-party vendor. The tool allows teams to launch up to 50,000 isolated pods in parallel, keep code and data within their own cloud, and cut sandbox launch times by more than 50% using warm pools. The release includes a full reinforcement learning post-training example for code-generation models, positioning the tool as a cost and privacy alternative to hosted sandbox services like Modal.

Every agent, coding assistant, and RL pipeline eventually hits the same wall: the model wrote code, and now someone has to run it. Today, most teams hand that code to a hosted sandbox vendor paying a multiple of raw compute to execute untrusted code on someone else’s machines, while their prompts, test cases, and model outputs leave their cloud. Meanwhile, the Kubernetes cluster they already operate sits right there, capable of running 50,000 sandboxes at once. This post is about closing that gap: SkyPilot Sandboxes, a BYOC code execution layer, with a full RL post-training example and head-to-head benchmarks against Modal. What is a sandbox, and why do you need one? what-is-a-sandbox-and-why-do-you-need-one LLMs generate code. Whether it is an agent, a coding assistant, or an RL reward loop scoring the output of a half-trained model, at some point you have to run that code, and you cannot trust it. It can loop forever, exhaust memory, write files, spawn processes, or import something that tries to phone home. You need a disposable, isolated place to run it, and you usually need a lot of them at once. Today that means reaching for a hosted sandbox vendor. It works, but the trade is real: Cost. You pay the vendor’s per-sandbox rate on top of the compute you already own. Privacy. Your code and data the model’s output, your test cases, your prompts leave your environment for a third party. Latency for non-US users. The vendor runs in their regions. Reach them from somewhere else and every call pays a network-distance tax. SkyPilot Sandboxes run on your own infra skypilot-sandboxes-run-on-your-own-infra A SkyPilot Sandbox https://docs.skypilot.co/en/latest/sandboxes.html is a lightweight, isolated pod you create on demand, run commands in, and tear down, running on the Kubernetes you already have BYOC: bring your own cloud . Per-pod isolation. Each sandbox is its own pod with a dedicated image, CPU, and memory. Code that misbehaves is contained to its pod, and the pod is destroyed when you are done. Massively parallel. Launch many sandboxes in a single call and fan commands out across them concurrently. Sub-second launches with warm pools. A pool keeps pre-provisioned pods idle and ready, so creating a sandbox claims a running pod instead of waiting on Kubernetes scheduling and an image pull. That cuts a single sandbox’s launch time by more than 50%. Your infra, your data. Code and data never leave your cloud. If grading needs credentials a private package index, a database for integration tests , they are injected from the SkyPilot Secrets Manager at create time, never baked into an image. Modal-style API. create , exec , terminate , each with an async sibling on .aio for massive fan-out. If you have used a hosted sandbox SDK, you already know this one. python import sky.sandbox sb = sky.sandbox.create image="python:3.12", cpus=1, memory gb=2 result = sb.exec "python", "-c", "print 2 + 2 " print result "stdout" "4" also: stderr, exit code sb.terminate One call returns a LIST of isolated sandboxes. sandboxes = sky.sandbox.create image="python:3.12", num sandboxes=100 for sb in sandboxes: sb.exec "pytest", "-q", "tests/" Every entrypoint has an async sibling on .aio . sandboxes = await sky.sandbox.create.aio image="python:3.12", num sandboxes=64 results = await asyncio.gather sb.exec.aio "python", "-c", code for sb in sandboxes await asyncio.gather sb.terminate.aio for sb in sandboxes Example: RL-training a code-generation model, with sandboxed reward example-rl-training-a-code-generation-model-with-sandboxed-reward Untrusted code at volume shows up most sharply in reinforcement learning. This example post-trains a code-generation LLM, a policy model that, given a programming problem, writes a Python function to solve it. The training goal is simple to state: make the model’s generated functions pass the tests more often. On every training step, for every rollout in the batch, we execute code that a half-trained model just wrote buggy, occasionally infinite-looping, untrusted by definition and that execution sits on the critical path of training. This is the same shape of problem HuggingFace’s Open R1 hit when they used hosted sandboxes for their RL reward; here, the execution runs on your own Kubernetes cluster via SkyPilot Sandboxes. We use a standard distributed RL layout: five services in a SkyPilot job group, talking over HTTP. - The Data Server serves prompts MBPP-style problems with hidden tests to the Rollout Server - The Rollout Server SGLang has the current policy generate candidate solutions and sends them to the reward server. - The Sandbox Reward Server scores each candidate. This is where sandboxes come in. For every batch it receives, it claims a batch of sandboxes from a warm pool, runs each candidate against its hidden tests in its own sandbox, and returns 1.0 all tests passed or 0.0 anything else . - The PPO trainer writes the scored rollouts to the Replay Buffer. - The PPO Trainer GRPO uses the rewards to update the policy, and the loop repeats. Inside the reward server. The PPO trainer already POSTs a batch of {prompt, response, tests} to the /batch reward endpoint on the reward server. The only change from a string-matching reward server is what happens inside it: we run code. We create one sandbox for each of the generated scripts and call the scoring function on each pair of created sandbox and script: python import asyncio import sky.sandbox async def score batch items : One call returns a LIST of sandboxes, claimed from the warm pool. sandboxes = await sky.sandbox.create.aio name="reward", num sandboxes=len items , pool=POOL NAME try: Score every rollout concurrently, one sandbox each. rewards = await asyncio.gather score one sb, item for sb, item in zip sandboxes, items finally: ALWAYS tear sandboxes down, even if an exec raised above. await asyncio.gather sb.terminate.aio for sb in sandboxes , return exceptions=True return list rewards Scoring one rollout is where execution happens. We extract the code block, concatenate it with the setup code and hidden tests, and run that script in the sandbox. The reward is the cleanest possible signal: exit 0 means every test passed, and anything else an assertion failure, a runtime error, an infinite loop that hits the timeout, a sandbox-level error is reward 0.0. The rule is that a bad rollout must never raise out of the reward function; early in training, most rollouts are bad, and the loop has to keep going. python async def score one sb, item : code = extract code item.response if not code or not item.tests: return RewardResponse reward=0.0, passed=False script = build test script code, item.setup code, item.tests try: result = await asyncio.wait for sb.exec.aio "python", "-c", script , timeout=EXEC TIMEOUT SECONDS except asyncio.TimeoutError, Exception : A crash or a timeout is a 0.0 reward, never an exception that escapes and kills the batch. return RewardResponse reward=0.0, passed=False passed = result "exit code" == 0 stdout / stderr / exit code return RewardResponse reward=1.0 if passed else 0.0, passed=passed The warm pool is created once when the server starts, and the shared session is released once when it stops: sky.sandbox.create pool name=POOL NAME, image="python:3.11-slim", cpus=1, memory gb=2, replicas=8 ... on shutdown: await sky.sandbox.aclose Swap this reward server in for a string-matching one and the rest of a standard GRPO pipeline does not change. Competitive with Modal, on your own clusters competitive-with-modal-on-your-own-clusters Performance: faster to first command, scales with your clusters performance-faster-to-first-command-scales-with-your-clusters We benchmarked BYOC sandboxes against Modal Sandboxes https://modal.com/docs/guide/sandboxes , a managed, multi-tenant service hosted in Modal’s US infrastructure internal benchmarks, June 2026 . Three takeaways. Scale is determined by your cluster. A single Kubernetes cluster sustained ~50,000 healthy sandboxes across 220 nodes. Add clusters to go higher and SkyPilot will intelligently route requests to clusters with capacity. Time to first command is ~20% faster, with a much tighter tail. The metric that matters is how long until a command you run in a fresh sandbox comes back: create, then immediately exec. At p50, a SkyPilot sandbox completes create + first exec in ~1.0s vs Modal’s ~1.2s , and the tails diverge further p99 ~1.5s vs ~2.0s . Modal’s create returns quickly but hands back a not-yet-ready handle; readiness lands on the first exec, which is where its variance lives. SkyPilot front-loads readiness into create , so the first exec is quick and predictable. | Create + first exec | p50 | p99 | |---|---|---| | SkyPilot BYOC, warm pool | ~1.0s | ~1.5s | | Modal US | ~1.2s | ~2.0s | Run the benchmark yourself: curl -fsSLO https://gist.githubusercontent.com/lloyd-brown/58bdefdea5ff15f1563efa81fbed272a/raw/benchmark.py python benchmark.py The benchmark: 200 create, exec, terminate cycles per platform, wall time from create until an echo returns python import time import modal try: import sky.sandbox except ImportError: sky = None no SkyPilot client? bench Modal only print "Benchmarking SkyPilot + Modal" if sky else "No SkyPilot client found; benchmarking Modal only" N = 200 app = modal.App.lookup "bench", create if missing=True image = modal.Image.debian slim python version="3.12" if sky: Comparable slim Python 3.12 image; pre-provision warm capacity once. print "Creating warm pool one-time, untimed ..." sky.sandbox.create pool name="bench", image="python:3.12-slim", cpus=1, memory gb=2, replicas=5, blocking=True One untimed warmup cycle per platform, so one-time setup Modal image resolution on first use, client session init never lands in the numbers. print "Warmup cycle per platform untimed ..." msb = modal.Sandbox.create "sleep", "infinity", app=app, image=image msb.exec "echo", "hi" .wait msb.terminate if sky: sb = sky.sandbox.create name="bench-warmup", pool="bench" sb.exec "echo", "hi" sb.terminate def pctl xs, q : return sorted xs round q / 100 len xs - 1 print f"Timing {N} create - exec - terminate cycles per platform..." skypilot s, modal s = , for i in range N : if sky: t0 = time.perf counter sb = sky.sandbox.create name=f"bench-{i}", pool="bench" exec-ready sb.exec "echo", "hi" skypilot s.append time.perf counter - t0 sb.terminate t0 = time.perf counter msb = modal.Sandbox.create "sleep", "infinity", app=app, image=image msb.exec "echo", "hi" .wait container readiness lands here modal s.append time.perf counter - t0 msb.terminate if i + 1 % 20 == 0: print f" {i + 1}/{N} cycles done", flush=True for name, xs in "SkyPilot", skypilot s , "Modal", modal s : if xs: print f"{name}: p50 {pctl xs, 50 :.2f}s p99 {pctl xs, 99 :.2f}s" if sky: print "Cleaning up the warm pool..." sky.sandbox.delete pool "bench" Latency stays local. Modal’s best-case US exec latency is genuinely low ~0.096s when the client sits right next to its US region. Move that client to APAC and Modal jumps 3.9x to ~0.37s , essentially a fixed trans-Pacific round trip. Because BYOC sandboxes run in your own region, next to your users, that distance tax never appears. Cost: up to 10x cheaper cost-up-to-10x-cheaper On your own cluster you pay only for the machines. Here are two fully worked comparisons you can rerun with your own numbers, both against Modal’s per-core-second and per-GiB-second billing published on-demand pricing, June 2026 : a conservative one on general-purpose nodes, and a leaner one on burstable nodes that approaches 10x. The scenario for both: the 50,000 sandboxes a single cluster sustains, priced per hour for the whole fleet. The conservative case, on general-purpose nodes: ~4x cheaper. Each sandbox gets 2 vCPUs and 4 GB of memory. Hosted: 50,000 x 2 cores x $0.00003942/core-s + 4 GiB x $0.00000672/GiB-s = 50,000 x $0.38 per sandbox-hour = $19,030 per hour On your own cluster, we run one sandbox per n4-standard-2 node 2 vCPUs, 8 GB , which leaves memory headroom for the kubelet and system pods. At $67.01 per month, or $0.093 per hour per node GKE on-demand pricing, June 2026 : 50,000 nodes x $0.093/hr = $4,650 per hour The lean case, on burstable nodes: ~10x cheaper. Most sandbox workloads are idle-then-burst run a snippet, grade a test, exit , which is exactly the load burstable instances are priced for. This time we size the sandbox leaner too: 2 vCPUs and 2 GB of memory, one per AWS t4g.medium 2 vCPUs, 4 GB, $0.0336/hr, EC2 on-demand pricing, June 2026 : Hosted: 50,000 x 2 cores x $0.00003942/core-s + 2 GiB x $0.00000672/GiB-s = 50,000 x $0.33 per sandbox-hour = $16,610 per hour BYOC: 50,000 x $0.0336/hr one t4g.medium per sandbox = $1,680 per hour That is a ~9.9x reduction , with the caveat burstable instances always carry: sustained all-core load eventually exhausts CPU credits, so for continuously hot sandboxes use the general-purpose math above. Either way, hosted costs a multiple of the underlying compute. Four times on general-purpose nodes, nearly ten on burstable ones. That multiple is the vendor’s margin, and on your own cluster you simply do not pay it. The whole comparison in one table. | SkyPilot Sandboxes BYOC | Hosted service e.g. Modal | | |---|---|---| | Where it runs | Your own K8s cluster | Vendor’s regions | | Create + first exec p50 | ~1.0s | ~1.2s | | Cost/hr for 50k sandboxes | $1,680 burstable to $4,650 | $16,610 to $19,030 | | Exec latency | Local to your users | Low in-region; ~3.9x tax cross-region | | Code & data | Never leave your cloud | Sent to a third party | | Secrets | Injected from the Secrets Manager | Configured in the vendor dashboard | Takeaways takeaways Untrusted, LLM-generated code needs a real execution environment: isolated, massively concurrent, fast to start. SkyPilot Sandboxes give you that on the Kubernetes clusters you already own, 50,000 sandboxes on a single cluster, individual launches in under a second, with your code and data never leaving your cloud. The async SDK makes the fan-out a few lines, whether you’re scoring RL rollouts, running parallel evals, or giving coding agents disposable environments. Try it: - Get access: SkyPilot Sandboxes is in limited early access. Sign up here https://forms.gle/o4keAryXsVazNjyGA , it takes 20 seconds. - Run the RL example: the full five-service pipeline lives in the SkyPilot repo https://github.com/skypilot-org/skypilot/tree/master/llm/rl-code-execution-sandbox , including a CPU-only connectivity test so you can verify the reward path before committing GPUs. - Read the docs: the Sandboxes guide covers the SDK, warm pools, and secrets injection. To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot org, or join the SkyPilot community Slack.