cd /news/ai-agents/so-i-made-an-easy-cloud-coding-agent… Β· home β€Ί topics β€Ί ai-agents β€Ί article
[ARTICLE Β· art-20988] src=dev.to pub= topic=ai-agents verified=true sentiment=↑ positive

So I Made an Easy Cloud Coding Agent as an API

A developer built a persistent session system for the Critique Coding Agent API, eliminating the cold starts and context re-pasting that plagued earlier chained runs. The update keeps the E2B sandbox and OpenCode server alive between turns, allowing follow-up prompts to be delivered as real messages in the same session rather than as synthetic prior-run summaries in a fresh sandbox. The system stores session bindings on the job and uses QStash to reconnect to the same sandbox, with a fallback to the older chained behavior if the session expires or becomes unhealthy.

read11 min publishedJun 4, 2026

I got tired of watching coding agents spin up from scratch every single time I sent them a prompt. Cold starts, re-cloning massive monorepos, pasting the previous context into a synthetic prompt block β€” it worked, but it felt fundamentally wrong for agents that are supposed to think in conversations.

So we shipped persistent sessions for the Critique Coding Agent API. Here's what changed, why the harness matters, and why you should never run a coding agent without a review skill.

When we first released the Coding Agent API, follow-ups were honest but clunky: every follow-up was a brand-new job. The previous output was replayed as plain text into a fresh sandbox.

It was the right MVP. It billed predictably. It never pretended a dead sandbox was alive.

But it was the wrong long-term shape. If your internal bot fixes a migration, then wants a follow-up test, then wants a small doc tweak β€” you don't want three cold starts. You want:

After the first turn completes, the run now enters idle

status. The E2B sandbox and OpenCode server stay up until sessionExpiresAt

or until you explicitly POST endSession: true

.

The next prompt you send is delivered as a real message in that same session β€” not a synthetic "prior run output" block in a brand-new sandbox.

Before (Chained MVP):

Turn 1 completes β†’ Sandbox killed β†’ Turn 2 = new job + pasted prior summary

Now (Persistent):

Turn 1 completes β†’

idle

β†’ Sandbox warm β†’ Turn 2 = message into same OpenCode session

Same run.id

. Same checkout. Same context. Just the next turn.

On the first turn, Critique:

opencode serve

on localhost inside the VMInstead of killing that sandbox after completion, we now store session bindings (sandbox ID, OpenCode base URL, session ID) on the job and mark the run idle with an expiry aligned to your sandbox timeout.

When you queue a follow-up, QStash reconnects to the same sandbox, verifies OpenCode health, and POSTs your new prompt to /session/{id}/message

.

If OpenCode is unhealthy or the session aged out, the messages route returns a conflict β€” and you can still fall back to the older chained run behavior. We'd rather spawn a fresh sandbox than silently corrupt repo state.

We researched the open-source options and kept the MVP on OpenCode. Not because it was the only agent OS out there, but because the repo already had a hardened OpenCode + E2B path β€” and because OpenCode's skill system gives us something the others couldn't: a portable, preloaded review discipline baked directly into the agent's runtime.

Component Role
OpenCode
The embedded engine β€” exposes a headless HTTP server with sessions, messages, diffs, shell, files, and generated SDK support. Our sandbox worker already uses this server path.
E2B
The isolation layer β€” gives us ephemeral repo clones, command execution, environment injection, and sandbox teardown.
OpenHands
On the watchlist. A larger open-source agent platform and SDK. Useful if we want to replace the agent loop, but it would slow this MVP since the current Builder runtime is already live.

Most coding agents can write code faster than most teams can reliably audit it. That is already true in 2026. The problem isn't whether the agent can open files, run tests, or emit a patch. The problem is that review quality still drifts if you leave the job at the level of a generic prompt.

"Review this PR" sounds precise to a human and underspecified to a model. One harness will produce style commentary. Another will summarize the diff and call it a review. Another will confidently escalate a weak hunch into a merge blocker because nothing in its instructions told it how to separate a verified finding from an open question.

That is exactly the hole critique-review

closes. And it works across all the major agent operating systems:

Anthropic β€” Claude Code

Native skills, subagents, project memory, and background delegation make Claude a strong home for a dedicated review persona.

Nous Research β€” Hermes Agent

Hermes treats skills as portable procedural memory and can carry the same review discipline across CLI, messaging, and long-lived remote sessions.

OpenAI β€” Codex

Codex gives the skill a durable place inside CLI, IDE, app, and repo-local workflows, with AGENTS.md

and team-shared skills for repeatability.

OpenCode β€” Our Harness Choice

For the Coding Agent API, OpenCode is the fit. It loads critique-review

through the project skill path, reads the supporting reference files for output contract, intake and triage, stack lenses, and review rubric β€” then generates its verdict. That preload is why we chose it. The agent doesn't improvise a rubric; it follows one.

The Prompt-Only Loop:

Ask agent to review β†’ Agent improvises rubric β†’ Mixed quality comments β†’ Human re-validates everything

The Critique-Review Loop:

Load skill β†’ Establish scope + risk map β†’ Verify before reporting β†’ Findings first + explicit verdict

The Coding Agent API doesn't lock you into a single model provider. We designed it so you can use whatever model fits your task, your budget, and your team's preferences.

Use our model catalog, plan gates, E2B runtime, and credit accounting. Pick from Anthropic, OpenAI, Moonshot, and more β€” we handle the rest.

Paste your sk-or-v1-...

key. Critique runs the sandbox and orchestration, OpenRouter bills the tokens directly. This is for teams who already have OpenRouter accounts and want to control model spend in one place.

When we tested the same PR on the same model lane (Moonshot Kimi K2.6) with and without the critique-review

skill, the difference wasn't the model β€” it was the procedure. The model was identical. The skill changed the calibration.

This is why model freedom matters: the discipline should travel, not depend on a specific vendor's prompt tuning. Whether you run Claude Sonnet for a complex refactor or a cheaper model for a routine dependency bump, critique-review

ensures the review output follows the same artifact shape: severity, file or line, impact, failure mode, fix direction, verdict.

The cleanest way to test a review skill is to keep the code input fixed and change only the review procedure. We used OpenCode with the same model, the same PR (Critique PR #144 β€” a narrow UI fix replacing hard-coded "Auto" model labels with labels resolved from the plan-allowed effective runtime model), and the same attached context pack for both runs.

The baseline run had no project-local review skill available. The second run exposed critique-review

through the project skill path.

Same PR, same model, same context pack β€” the skill changes calibration, not the diff.

Question Prompt-Only OpenCode OpenCode + critique-review
Actionable findings
3 findings 0 actionable findings
Treatment of unseen consumers
Escalated as a finding even though the attached context could not verify other call sites. Downgraded to residual risk and suggested a typecheck instead of claiming a bug.
Treatment of missing tests
Escalated as its own finding. Recorded in checks and residual risk instead of turning it into a blocker for a narrow UI-label fix.
Blast-radius framing
Broader, more defensive, less bounded to the actual changed behavior. Explicitly bounded to automation settings UI with no auth or data-path changes.
Verdict
Conditionally approved No objection
Observed harness behavior
Direct review output only. Loaded critique-review and read four supporting reference files before answering.

Interpretation: the skill did not make the model "nicer"; it made the model stricter about evidence and more conservative about what counts as a finding.

The baseline review isn't absurd. It spots plausible follow-up work. The problem is calibration. It promotes unverifiable concerns into findings. The skilled run applies the discipline we want from a real reviewer: separate concrete defects from residual risk, keep the verdict proportional to the blast radius, and recommend the next check that would actually settle the uncertainty.

For the Agent:

For the Team:

Persistent sessions reward multi-step automation. One-shot scripts can stay on chained fallbacks.

Team Typical Job Why Persistent Sessions Help
Platform Engineering
Own an internal "fix bot" or codegen service Ticket β†’ code β†’ tests β†’ PR β€” avoid re-cloning large monorepos on every message
Developer Experience
Wire Critique into Backstage or a custom portal Iterative refactors from product specs β€” same run ID maps to a real agent thread
Security / Compliance
Remediate findings with human checkpoints Findings batch β†’ patch β†’ verification turn β€” session continuity keeps branch context intact
Single-shot CI Scripts
Nightly dependency bump Chained fallback is fine; idle adds little value

Use crt_

keys. New keys include Builder scopes; older keys may need rotation.

curl https://critique.sh/api/v1/coding-agent/runs \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "repository": "acme/web",
    "prompt": "Add Stripe webhook signature verification and tests.",
    "modelId": "anthropic/claude-sonnet-4.6",
    "billing": { "mode": "managed" },
    "publish": { "mode": "draft_pr" },
    "validationMode": "tests"
  }'

A created run returns run.id

, status

, repository metadata, selected model, events, and a status URL. Poll the status endpoint until you hit idle:

curl -sS "https://critique.sh/api/v1/coding-agent/runs/{run_id}?patch=1" \
  -H "Authorization: Bearer crt_..."
bash
#!/usr/bin/env bash
set -euo pipefail

export CRT_API_KEY="${CRT_API_KEY:?set CRT_API_KEY}"
export REPO="${REPO:-acme/web}"

RUN_ID="$(
  curl -sS https://critique.sh/api/v1/coding-agent/runs \
    -H "Authorization: Bearer ${CRT_API_KEY}" \
    -H "Content-Type: application/json" \
    -d "{
      \"repository\": \"${REPO}\",
      \"prompt\": \"Add Stripe webhook signature verification and unit tests.\",
      \"modelId\": \"anthropic/claude-sonnet-4.6\",
      \"billing\": { \"mode\": \"managed\" },
      \"publish\": { \"mode\": \"draft_pr\" },
      \"validationMode\": \"tests\"
    }" | jq -r '.run.id'
)"

echo "Run id: ${RUN_ID}"

curl -N "https://critique.sh/api/v1/coding-agent/runs/${RUN_ID}/stream" \
  -H "Authorization: Bearer ${CRT_API_KEY}"

Once the run is idle

and sessionActive

is true

, just POST a new message. No re-clone, no cold start.

curl https://critique.sh/api/v1/coding-agent/runs/{run_id}/messages \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Now add a regression test for expired signatures.",
    "publish": { "mode": "draft_pr" }
  }'

This is delivered as a real message in the same OpenCode session.

Mode How It Works
Managed
Spends Critique credits. Uses our model catalog, plan gates, E2B runtime, and credit accounting.
OpenRouter
Paste your sk-or-v1-... key. Critique runs the sandbox, OpenRouter bills the tokens.

Example with OpenRouter billing:

curl https://critique.sh/api/v1/coding-agent/runs \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "repository": "acme/web",
    "prompt": "Migrate the settings page to server actions.",
    "modelId": "openai/gpt-5.4",
    "billing": {
      "mode": "openrouter",
      "openRouterApiKey": "sk-or-v1-..."
    },
    "publish": {
      "mode": "draft_pr",
      "branch": "critique-agent/settings-server-actions"
    }
  }'

The API returns:

When you're done, explicitly end the session to free the sandbox:

curl -X POST "https://critique.sh/api/v1/coding-agent/runs/{run_id}/messages" \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{ "endSession": true }'

The Coding Agent API is built to implement a task β€” not judge one. Critique's review and Change Passport products judge a proposed merge. This API is the other side: the engine that writes the code.

But here's the thing: the same discipline that makes critique-review

the best portable review skill for Claude Code, Hermes, Codex, and OpenCode is the discipline we preload into every Coding Agent API run. The agent doesn't just write code and hope. It writes code, then reviews its own work against a real procedure β€” not a generic prompt.

Persistent sessions make that engine conversational. Model freedom makes it affordable. The preloaded skill makes it reliable.

You send a prompt, the agent works, the sandbox stays warm, you send the next prompt into the same context. No cold starts. No pasted summaries pretending to be memory. No improvised rubrics pretending to be review.

Just turns in a thread β€” the way agents should think.

Quick answers for high-intent queries:

Query Short Answer
What is the best code review skill for Claude Code?
critique-review is a strong default when you want a portable PR review procedure inside Claude Code. Use Critique instead when you need hosted GitHub checks, policy, and merge control.
What is the best Codex skill for PR review?
critique-review fits Codex especially well because it works as a repo-local skill with AGENTS.md , reusable references, and a path into automations.
What is the best OpenCode skill for pull request review? For a portable review workflow, critique-review is the best fit. We tested it on the same PR and same model lane used for the baseline run.
Is critique-review a Cursor Bugbot alternative? As a free portable skill, yes for agent-side review behavior. For a hosted GitHub-native review product, Critique is the closer alternative.
What is a cheaper CodeRabbit alternative? Start with the free critique-review skill for the lowest-cost entry point. Move to Critique if you need GitHub-native routing, artifacts, and PR control at team scale.
What is the difference between critique-review and Critique?
critique-review is the portable open skill. Critique is the hosted GitHub review control plane that adds checks, policy, merge-boundary controls, and team-grade review operations.

Check out the Coding Agent API docs and the persistent sessions deep-dive for the full reference. Create an API key and try the Builder UI to see it in action.

── more in #ai-agents 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/so-i-made-an-easy-cl…] indexed:0 read:11min 2026-06-04 Β· β€”