Building Agents that Don't Break Themselves

Fly.io developers are building AI agents that avoid self-destruction by executing risky commands in ephemeral sandboxes called Sprites, keeping the agent loop separate from code execution. This architecture, demonstrated by Henrique's SpriteDoc project, allows agents to safely run shell commands, tests, and file operations without compromising the host or persisting credentials.

Building agents is fun. Rebuilding agents that break themselves… less so. A lot of Fly people are building agents with less of a penchant for self-destruction by teaching their agents to do anything risky in a Sprite. You get an agent that stays alive long enough to actually use its snazzy self-improvement features, and you can allow your agent to try things that would otherwise be battleship-scale footguns. Here’s how to do it. Your agent would be pretty useless without a shell, because this is where it does agent things. Run the test suite, apply the migration, install the dependency, delete the temp files. Unfortunately, your agent’s shell access is also what tends to ruin your afternoon, simply because “delete the temp files” and “delete the wrong files” are one fat-fingered glob apart, and as we’re frequently warned, AI can make mistakes. This is why we have sandboxes. But a lot of people default to putting an agent that’s going to do potentially scary work in a sandbox. This comes with a long list of tradeoffs that you really don’t have to make, because where your agent lives and where it runs code are two entirely separate considerations. The agent process is a loop. It calls a model, reads the response, picks a tool, rinse and repeat. It’s a long-lived process that only becomes more competent and less stupid if memory, skills and history persist. So a Fly Machine that sleeps when idle and wakes on a message, a small VPS. Your laptop while you iterate. These are all fine homes for a loop calling an API, which doesn’t need a blast shield. It’s when you want your agent to execute that things get hairy. bash -c + whatever string the model just produced needs to be run in a padded room. Somewhere where the agent’s code can’t break itself or anything connected to it. And if your agent is doing more work than you are, you’re going to want a whole facility of padded rooms that can be thrown away and rebuilt on a whim. Let’s take a look at two recent projects by Fly people that demonstrate this concept nicely. First up is Henrique https://github.com/vaurdan ’s internal Fly troubleshooting agent called SpriteDoc. SpriteDoc is multi-user and built on top of the Pi agent. Every session runs on one shared server, in one Node.js runtime. Running bash commands directly on that server is not really possible, and… dangerous. Every user’s shell would sit in the same process the agent itself runs in. Instead, each session runs in its own Sprite. The first time a session needs the filesystem at all, a bash call, a file read, an edit, it spins up a fresh Sprite, uploads the project’s source trees, and installs whatever CLIs that session needs. Sprites spin up fast enough that it’s all but unnoticeable as a user. Every command after that runs in that same sandbox, isolated from the agent and from every other user. This architecture leans on Sprites’ inherent disposability. A troubleshooting session shouldn’t leave anything behind, so when it’s done, the Sprite goes with it. Sprites’ idle behavior makes this architecture cheap to run too. When a sandbox sits unused, its status drops to warm and then cold, so a session waiting between questions costs near enough to nothing. Let it go idle long enough, or archive the session, and the Sprite is torn down entirely. Revive that session later and the next command that needs a shell just brings up a new one. Nobody pays for a box sitting there doing nothing. If you’re going to steal any part of Henrique’s design it should be this one: SpriteDoc runs flyctl inside the sandbox authenticated as the actual user, but the user’s token is never written to the Sprite. It is injected into the environment for the duration of that one command, and it is gone when the command returns. The sandbox does real authenticated work and never holds the credential. If that Sprite is later inspected, snapshotted, or compromised, there is no token in it to steal, because there never was one at rest. That’s hot for people building multi-user agents. Every user’s commands run as themselves, against their own resources, with their own permissions, and no long-lived secret ever lands on shared disk. The credential exists only in the moment it is used, and to the user it is invisible: they ask a question, the right command runs as them, and it just works. Next up is Kyle https://github.com/kylemclaren ’s terminal backend for Hermes Agent https://github.com/NousResearch/hermes-agent , the open-source personal agent from Nous Research. Hermes ships with several execution backends, and you pick one with a single setting. Kyle’s backend sends every command the agent needs to run into a Sprite. Where SpriteDoc spins up a throwaway sandbox per session, Hermes does the opposite with the same building block; it keeps one Sprite per task and resumes it next time, so everything it installed last session is still there. Same split, opposite lifecycle, one config decision apart. This means any time Hermes needs to run a shell command, it happens somewhere it can’t hurt anything, including itself. And let’s not gloss over a point that will resonate with anyone who has worn a groove into their return key approving agent actions. When commands run in a real sandbox, Hermes skips the “are you sure?” approval prompts on dangerous commands, because the sandbox is the security boundary now. The approval dance exists to protect your host. Once the host is out of reach, you can let the agent rip. Then run its code in a different one. A Sprite is a perfectly good place to run an agent, but the agent living in a sandbox does not mean its commands should run in that same sandbox. Kyle tested exactly that: Hermes running inside a Sprite, dispatching its commands to another Sprite. The agent’s own machine reported one identity, the executed commands came back from a second, with a different id and a different boot. Being sandboxed did not make the agent run its untrusted commands in its own sandbox. It still pushed them out to a separate, throwaway one. That is the shape you want. The agent’s home can be durable and comfortable. The place it runs untrusted strings should still be somewhere you would be happy to set on fire. Security always guides our architectural decisions right , but few among us can claim to have never skirted around security best practice in the interest of saving time. That’s why it’s worth demonstrating how much time this pattern saves. Two migration files, freshly written into a Sprite: bash $ ls /root/app/migrations 001 init.sql 002 add users.sql We checkpoint that state. Then let the agent off the leash. Here’s the seemingly innocuous prompt that makes things go sideways: Clean up the old migrations and stale binaries we don’t need anymore. The model decides that means: bash $ rm -rf /root/app /usr/bin/python3 /usr/bin/git $ ls /root/app/migrations cannot access '/root/app/migrations': No such file or directory $ git version executable file git not found in $PATH Welp. My work is gone and the agent deleted its own toolchain on the way out. If this happened on my agent’s host, this is where I have a little cry. On a Sprite, it’s a checkpoint restore with a smile: bash $ ls /root/app/migrations 001 init.sql 002 add users.sql $ git version git version 2.51.0 Both files back to the byte, git back on the path, in about nine seconds. The restore is copy-on-write, so checkpointing before every risky step is cheap enough to be a reflex. An agent that can roll back is an agent you can actually let run unattended, because the worst case is “restore and retry” instead of “restore from backup, if you have one.” Telling your agent to be careful is silly. Just make it do things somewhere it doesn’t have to be.