Today we are launching BabyChain: a self-hosted canvas studio and durable chain API for image and video model workflows.
The short version is this: BabyChain lets you design a ComfyUI-style media chain on a canvas, then call that same chain from product code as POST /api/v1/chains/runs
. Every step executes through provider APIs with server-side credentials, every state transition persists to AWS Aurora, and Vercel functions stay stateless.
The product has one invariant: every output becomes the next input.
If you run model chains on a local GPU workbench today, BabyChain is the version of that workflow you can deploy, call from a backend, and keep forever. The canvas is not a demo shell. It is a visual editor on top of the same durable contract your application calls.
Real generative media work is rarely one model call. It is an image model feeding an image-to-video model, often with a refine step in the middle and a video-to-video step at the end. Canvas tools made that composable, but most of them are creative workbenches. The workflow lives inside a UI, expects a local GPU or a managed model runtime, and does not naturally become an authenticated API that another product can call.
We kept hitting the same wall in our own projects: the moment a visual workflow needed to become product infrastructure (authenticated, retryable, callable from a queue, safe to expose to another backend), we had to rewrite it as glue code.
BabyChain's design goal was to make that distance zero.
Design the chain on a canvas. Call the same chain from your backend. Those should not be two different systems.
BabyChain is Apache-2.0 and built to be deployed, not hosted by us:
git clone https://github.com/babysea-community/babychain.git
cd babychain && pnpm install --frozen-lockfile
cp .env.example .env.local # DATABASE_URL, owner login, provider keys
pnpm run aurora:migrate # applies the schema, idempotent
pnpm dev # or use the one-click Vercel deploy button
For production, BabyChain is designed around AWS Aurora. For local development, it can also point at a local PostgreSQL database. The README walks through creating the Aurora cluster, setting DATABASE_URL
, applying the schema, and deploying the app on Vercel.
The rest of this post is the architecture: how a canvas workflow becomes durable infrastructure on Aurora and Vercel.
The naive way to run a multi-model chain on serverless is to hold the whole chain in one function invocation. That dies quickly. A single image → video → video-modify workflow can spend several minutes inside provider queues, and a stateless function should not be asked to babysit that entire wait.
The durable way is to make the database the only place workflow state lives:
Aurora owns every fact about a run.
Vercel functions are stateless workers.
Each invocation advances a run by at most one step.
Any instance can pick up any run mid-chain.
When a caller creates a run, BabyChain persists the run and its ordered steps to Aurora, may opportunistically advance the first ready step, and returns without waiting for the full chain. Each subsequent poll of GET /api/v1/chains/get/{runId}
, or a cron sweep, loads the run from Aurora, advances exactly one provider step (submit or poll), persists the result, and returns. Long chains survive serverless limits because no instance ever needs to outlive a step.
Aurora owns every fact about a run, so a Vercel function is allowed to disappear at any moment. The chain is not.
Aurora Serverless v2 fits the workload: bursty, low-idle, spiky on demo days. The connection pool absorbs Aurora wake-ups when a cluster is configured to with a 30-second connection timeout. For Aurora/RDS endpoints, deployers keep ?sslmode=require
in DATABASE_URL
; BabyChain strips the driver-level query param and connects with TLS, including the RDS CA behavior expected by the Node.js pg
client.
Everything durable lives in one private schema, babychain_private
, applied idempotently by pnpm aurora:migrate
:
| Table | Owns |
|---|---|
chain_run |
|
| Run lifecycle: status, input, output, error code/message, idempotency key hash, callback intent | |
chain_step |
|
| Ordered steps: per-step params, provider request ids, generation ids, output files, failure details | |
canvas |
|
Saved node graphs as jsonb , owner-scoped, with a touch trigger and a (owner_email, updated_at desc) index |
|
api_key |
|
| Hashed caller keys with scopes | |
audit_event |
|
| Append-only audit trail | |
callback_delivery |
|
| Final signed-webhook delivery state | |
babysea_webhook_delivery |
|
| Inbound provider webhook bookkeeping |
Two design details earned their keep:
The input_order sidecar. PostgreSQL
jsonb
does not preserve key order, but the public run resource echoes the caller's input back in API responses, and key order is part of how people read their own request. Run creation stores a small jsonb
array of the caller's original key order alongside the canonicalized input, and the presenter re-applies it on the way out. It is a small detail, but it matters when an API response is also a debugging surface.Guarded state transitions. Steps only leave the queued
state through updates with a where status = 'queued'
guard. That single predicate makes the fail-fast path race-safe: when a step fails, the runner marks the run failed and sweeps every still-queued downstream step to skipped
(their input can never arrive) without ever clobbering a step that a concurrent invocation already started.
Generative media is expensive enough that retries must not multiply spend. BabyChain makes idempotency a property of the whole pipeline, not one endpoint:
Idempotency-Key
per principal and stores it on chain_run
with a unique constraint. A retried create replays the stored run: same id, same response, zero new provider calls.The same discipline applies on the way out: when a run includes a webhook URL, the terminal callback is claimed on the run row and each signed delivery attempt is recorded in callback_delivery
, so concurrent instances do not both send the same terminal callback.
The studio is a multi-flow React Flow canvas. Every edit autosaves to the canvas
table in Aurora, and surviving real-world usage took three iterations:
Debounced autosave is a data-loss bug with good intentions: it can drop the last burst of edits before a reload.
sendBeacon
final flushThe result is the demo I like most: edit a prompt, log out, log back in on another machine. The edit is there, served from Aurora. Close the tab mid-run, reopen, and the run resumes, because run progress was never in the browser to begin with.
Six providers (Black Forest Labs, Runway, Alibaba Cloud DashScope, Google Gemini API, OpenAI, BytePlus ARK), 57 supported models, 78,948 valid chain combinations. Not one provider agrees on what "give me a 16:9 image" means.
The deepest rabbit hole was Alibaba DashScope output sizes. Each model family has different rules, documented nowhere and discovered only by probing the live API:
qwen-image
/ qwen-image-plus
accept qwen-image-max
and z-image-turbo
cap each dimension at 2048.wan2.6
/ wan2.7
families enforce per-model Provider docs are a starting point. The live API is the truth.
So the adapter computes sizes per model. For budgeted models, a requested ratio w:h
is fitted into a pixel budget P_max
:
scale = sqrt(P_max / (w * h))
W = floor(scale * w / 16) * 16
H = floor(scale * h / 16) * 16
…and snapped-size models get a lookup table instead, because they allow no freedom at all. Wrong sizes now physically cannot be sent.
The same empirical attitude shaped everything at the boundary: Runway's per-endpoint pixel ratios, OpenAI's permanent quota 429s masquerading as transient rate limits, and BFL output URLs that expire after ~10 minutes (the UI shows honest and expiry states instead of leaking alt text).
One structural decision keeps this manageable: the canvas node cards are generated from each model's schema (fields, enum options, ranges, defaults). The UI cannot offer a parameter the API would reject, because both are projections of the same source of truth.
BabyChain is built around runtime invariants instead of optimistic workflows. A chain should be able to fail cleanly, resume after an interrupted function, reject invalid model roles and normalized inputs before dispatch, and preserve canvas state even if the browser disappears mid-edit.
The runtime behavior we validated end to end:
Step fails -> run goes terminal, downstream steps skipped, caller sees the provider's real error
Function instance dies -> next poll resumes the run from Aurora, idempotent resubmit
Client retries create -> same run replayed, zero duplicate spend
Tab closes mid-edit -> sendBeacon flush, canvas intact after re-login
Aurora wake-up -> 30s connection budget absorbs it when is enabled
The current project gate is 237 tests plus typecheck, lint, and production build. The tests cover the runner, provider adapters, templates, API behavior, migrations, idempotency errors, callback behavior, and the schema rules that keep the canvas and API aligned.
BabyChain is already usable as a deployable starter, but the next layer is about making runs cheaper to inspect, easier to share, and safer to operate for teams:
api_key
model.Statelessness is a feature you design for, not a constraint you fight. Once every fact about a run lives in Aurora (runs, steps, provider ids, outputs, failures, callbacks, canvases, audit), serverless time limits, cold starts, and instance churn stop being the center of the system. Vercel gives the control plane instant deployment; Aurora gives it durable memory.
Design on the canvas. Ship the same contract as an API. Let the database remember everything.
Creators and developers: deploy it, chain your own models in your own cloud, and tell us what you automate first.
BabyChain is our entry to the H0: Hack the Zero Stack with Vercel v0 & AWS Databases hackathon. This post was created for the purposes of entering that hackathon. #H0Hackathon