Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

wpnews.pro

At 9:03 am on a Tuesday, my research agent said hello and stared at an empty** /workspace/.**

Six hours of analysis from the night before. Gone.

The cloned repository. The installed packages. The notes it had spent hours writing. Gone.

I had assumed that if an agent stopped working for the night, it could simply continue the next morning.

That was wrong.

Over the next three weeks, I rebuilt the same workflow on Tensorlake, Cloudflare, and Daytona to figure out what had happened. The hardest part of running Claude Managed Agents isn’t the model. It’s everything underneath it.

This is the exact code I ran, the things that broke, and the mistake that cost me two weeks to understand.

If you want more such information about AI, consider subscribing to my newsletter, where you will get noise-free AI information every week

Link for the newsletter: Newsletter

If you’ve never built with Claude Managed Agents, the architecture needs a minute. Skip this if you already know it.

Anthropic runs the reasoning. You run the execution.

The agent loop, session state, work queue, and retry logic all live on Anthropic’s infrastructure. You configure a Self-hosted Environment in the Claude Console. When your application starts a session, Anthropic queues the work, your orchestrator picks it up, spins up a sandbox, and the model starts issuing tool calls into that sandbox.

Every** bash, read, write, grep,** and edit call executes inside an environment you own. Anthropic never touches it. You decide what that environment looks like, what it can access, and what happens between sessions.

Anthropic’s intelligence is fixed. Your engineering determines whether that intelligence has a stable, stateful environment to work in, or a clean slate that forgets everything the moment it goes idle.

I needed an agent that could do real deep-work research on a codebase: clone a repository, read through the module structure, build an understanding of how the pieces fit together, write notes, and propose refactoring strategies.

The kind of work that takes a senior engineer a full day and an AI agent about six hours.

The key constraint: the agent couldn’t do this all at once. Sometimes I’d kick off a session at 8pm, let it run until midnight, and pick it back up the next morning. The filesystem it had built during that first session — the analysis notes, the installed tools, the half-read source files — had to be there when the next session started. Rebuilding from scratch each time wasn’t viable.

That constraint is what drove every provider decision I made.

At the start, I thought I needed a Linux environment that could run Claude Managed Agents. By the end, I realized I actually needed three things. I found them all in one place, but not until I had looked in two others first.

I did not discover all three requirements on day one.I discovered them one mistake at a time.

You drive a session through the reference orchestrator using a simple command:

make session PROMPT="Clone the repository at github.com/tensorlakeai/tensorlake. \Read through the module structure. Write a summary to /workspace/analysis.md. \Note any components that look like they could be simplified."

The orchestrator sends this prompt to Anthropic as a new session.

Anthropic picks it up, starts the agent loop, and immediately begins issuing tool calls. Those tool calls arrive at your sandbox. The agent reads files, runs bash commands, writes notes. The session runs until the task is complete or you stop it.

The agent stream looks roughly like this as it runs:

[thinking] The repository appears to be a Python SDK for…[bash] git clone https://github.com/tensorlakeai/tensorlake[bash] ls -la /workspace/tensorlake/[read] /workspace/tensorlake/tensorlake/sandbox.py[write] /workspace/analysis.md[thinking] The Sandbox class handles…

Each bracketed event is a tool call going into your sandbox. The session accumulates state inside /workspace/ across all those calls. By the end of a six-hour session, that directory contains the cloned repo, installed packages, analysis files, and intermediate notes. That’s the state that needs to survive overnight.

My first assumption was that I needed a platform that could efficiently run Claude Managed Agents. Cloudflare is optimized for high-concurrency execution. My problem turned out to be different.

The agent I was building accumulated hours of filesystem state between bursts of work. Notes, cloned repositories, installed dependencies, and intermediate analysis all needed to survive overnight. Cloudflare’s execution model wasn’t designed around that requirement.That was the first time I realized I wasn’t looking for compute.

I was looking for persistent state.

The second build solved part of the problem.The agent could accumulate state throughout a session, which initially felt like progress.

Then I wanted to test three different refactoring strategies starting from the same six-hour analysis. Instead of branching from that state, I found myself repeating the setup work each time: rebuilding context, reinstalling dependencies, and re-running analysis before I could begin the actual experiment.

That was when I discovered my second requirement.Preserving state wasn’t enough.I also needed a way to branch from an existing state without repeating hours of work.

The first thing that caught my attention was not a feature.

It was an architectural decision.

Most platforms preserve state by keeping compute alive.

This one treated compute and state as separate problems.

The docs described a suspended sandbox that could preserve its state and resume in approximately 0.6 seconds. That was the first time I saw a design that directly addressed the problem I’d been running into.

I wanted to know whether it actually worked.

I started with the problem that had sent me down this path in the first place. Could an agent suspend overnight and resume with its state intact?

It could.

And once I tested checkpointing and branching, I finally had both things I’d been looking for. That’s when the architecture started to make sense.

The deployment model here is different from a traditional always-on server, and understanding it made everything else click.

The orchestrator itself runs inside a Tensorlake sandbox with a public HTTPS endpoint. Anthropic pushes incoming work to that endpoint via webhook. When there’s no traffic, the orchestrator sandbox suspends. When a new work item arrives, it wakes in under a second, processes the request, and creates a worker sandbox for that session.

Two independent lifecycles:

The orchestrator sandbox suspends when idle, preserving its memory state including the running uvicorn process. It doesn’t accumulate per-session filesystem state, so suspending it between work items costs only storage rates.

The worker sandboxes — one per session — accumulate filesystem state throughout a session and suspend when the session ends. Their state is preserved in storage, not held by running compute.

Neither one has to stay alive to preserve the other’s state. On every other platform I’d tried, “preserve state” meant “keep something running.” Here, it means “checkpoint and stop billing.”

When you call make session PROMPT=”…”, the orchestrator sends a new session to Anthropic. Anthropic validates it, adds it to the work queue, and pushes a work item payload to your orchestrator’s webhook endpoint.

That payload contains three things the orchestrator needs: the session ID, the work ID, and the environment ID. The orchestrator wakes, reads the payload, and creates a worker sandbox from your registered image:

from tensorlake.sandbox import Sandboxsandbox = Sandbox.create(    name=session_id,    image="agent-cli",    cpus=2.0,    memory_mb=4096,    timeout_secs=3600,)sandbox.start_process(    "bash",    ["-lc", "exec python3 /opt/sandbox_entrypoint.py > /tmp/runner.log 2>&1"],    env={        "ANTHROPIC_ENVIRONMENT_KEY": environment_key,        "ANTHROPIC_SESSION_ID": session_id,        "ANTHROPIC_WORK_ID": work_id,        "ANTHROPIC_ENVIRONMENT_ID": environment_id,    },)

Two things tripped me up here.

Credentials belong in start_process(env={…}), not Sandbox.create() — they’re session-specific, not image-level configuration. And sandbox names must be valid slugs. Neither was hard to fix once I knew what was happening, but both cost me time.

The worker sandbox runs sandbox_entrypoint.py, which attaches to the Anthropic session and begins consuming tool calls. From that point, the agent has a full Linux environment: bash, git, Python, and whatever else you built into the image.

Every tool the agent needs has to live in the image before the session starts:

from tensorlake import Imageimage = (    Image(name="agent-cli", base_image="tensorlake/ubuntu-minimal")    .run("apt-get update && apt-get install -y ca-certificates curl git gh python3 python3-pip")    .run("pip install --break-system-packages 'anthropic>=0.103' 'httpx>=0.27'")    .copy("sandbox_entrypoint.py", "/opt/sandbox_entrypoint.py")    .workdir("/workspace"))image.build(registered_name="agent-cli")

I added my analysis tools here: Python packages, jq, and ripgrep. If it needed to run inside the sandbox, it lived in the image.

When a session ends, the worker sandbox suspends. Not terminates. The process state and filesystem are checkpointed to storage. Compute billing stops. The sandbox sits there as a stored snapshot until the next session begins.

Setting RESUME_SUSPENDED_SESSIONS=true tells the orchestrator what to do when the next work item arrives for an existing session: restore the suspended sandbox instead of creating a fresh one from the base image.

I set the flag and expected the sandbox to wake up immediately.

Nothing happened.

For a few minutes, I thought the integration was broken. It wasn’t. The flag doesn’t trigger a resume directly. A new incoming webhook does. The flag just tells the orchestrator which action to take when that webhook arrives. Once I understood that, the behavior made complete sense. Why would a sandbox wake up before there’s work to do?

The practical difference between fresh and resumed is simple: a fresh sandbox starts with a clean /workspace/. A resumed sandbox starts with whatever the last session left there. For a research agent that had already spent hours building an understanding of a codebase, those are completely different starting points.

The session ran overnight, suspended, and resumed the next morning with its state intact. The filesystem was exactly where I had left it.

Resume solved the overnight persistence problem. Checkpointing solved the one I’d discovered the hard way during the Daytona build.

After the full analysis was complete, I created a checkpoint and launched three sandboxes from it:

snap = sandbox.checkpoint()children = [    Sandbox.create(        snapshot_id=snap.snapshot_id,        name=f"{session_id}-strategy-{i}",        cpus=2.0,        memory_mb=4096,    )    for i in range(3)]

Each child started from exactly the same state as the parent. The cloned repository was already there. The installed dependencies were already there. The analysis notes were already there. Nothing needed to be rebuilt.

Instead of repeating six hours of analysis three separate times, I ran three parallel experiments from the same verified baseline. By the next morning, I had three implementations to compare.

The fork I’d dismissed as an edge case turned out to be the feature that mattered most.

Without it, every experiment required repeating hours of work. With it, experimentation became nearly free. This generalizes beyond refactoring: any time you want to test N variations with different prompts, different dependency versions, different approaches — you run the expensive setup once, checkpoint, and branch into as many parallel experiments as you need.

The cost of exploration drops to the cost of the experiments themselves.

My agent needed two things: zero idle compute cost while preserving filesystem state overnight, and the ability to fork mid-session to explore refactoring strategies in parallel. Those two requirements narrowed the field to one option.

I didn’t choose Tensorlake because it was the fastest or the cheapest. I chose it because it directly addressed the two problems I was trying to solve.

The suspend/resume workflow preserved state without keeping compute running. Checkpointing made it possible to branch from an existing analysis instead of repeating hours of setup work. I tested both, and they worked exactly the way I needed them to.

The hard part wasn’t choosing a provider. It was understanding my requirements. Once I understood those, the decision became obvious.

Does your agent accumulate state that’s expensive to rebuild between bursts?

**Tensorlake. **This is the exact problem suspend/resume was designed to solve. If your agent starts fresh every session, continue to the next question.

Do you need to explore multiple approaches from the same mid-session state?

**Tensorlake. **Checkpointing and branching from an existing state eliminates hours of repeated setup work. If not, continue.

Do you need massive concurrency?

**Cloudflare isolates **are designed for very large-scale concurrent execution.

Do you need self-hosted infrastructure?

Daytona is the strongest fit if infrastructure ownership or data residency is a hard requirement.

How important is setup speed?

**Cloudflare **offers the fastest path to a prototype. Tensorlake requires more initial setup but provides capabilities that become valuable once agents accumulate long-lived state.

When I started, I thought I was choosing a sandbox provider. What I was actually discovering were my requirements.

First I learned I needed persistent state. Then I learned I needed a way to branch from that state without repeating hours of work.

Tensorlake was the first platform I tried that treated compute and state as separate problems.

Once I understood that distinction, the decision became obvious. The hardest part of running long-lived agents isn’t getting them to work. It’s making sure their work survives.

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async The Building Blocks of LangGraph (Part 0) Five Ways Claude Code Runs Multi-Step Work. The Two Questions That Pick the Right One.

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

Run your AI side-project on zahid.host