{"slug": "launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and", "title": "Launching BabyChain: durable image and video model chains on AWS Aurora and Vercel", "summary": "BabyChain, a self-hosted canvas studio and durable chain API for image and video model workflows, has been launched. It allows users to design ComfyUI-style media chains on a canvas and call them from product code via a POST endpoint, with every state transition persisted to AWS Aurora and Vercel functions remaining stateless. The project is Apache-2.0 and designed for deployment, not hosting.", "body_md": "Today we are launching **BabyChain**: a self-hosted canvas studio and durable chain API for image and video model workflows.\n\nThe short version is this: BabyChain lets you design a ComfyUI-style media chain on a canvas, then call that same chain from product code as `POST /api/v1/chains/runs`\n\n. Every step executes through provider APIs with server-side credentials, every state transition persists to [AWS Aurora](https://aws.amazon.com/rds/aurora), and [Vercel](https://vercel.com) functions stay stateless.\n\nThe product has one invariant: **every output becomes the next input.**\n\nIf you run model chains on a local GPU workbench today, BabyChain is the version of that workflow you can deploy, call from a backend, and keep forever. The canvas is not a demo shell. It is a visual editor on top of the same durable contract your application calls.\n\nReal generative media work is rarely one model call. It is an image model feeding an image-to-video model, often with a refine step in the middle and a video-to-video step at the end. Canvas tools made that composable, but most of them are creative workbenches. The workflow lives inside a UI, expects a local GPU or a managed model runtime, and does not naturally become an authenticated API that another product can call.\n\nWe kept hitting the same wall in our own projects: the moment a visual workflow needed to become *product infrastructure* (authenticated, retryable, callable from a queue, safe to expose to another backend), we had to rewrite it as glue code.\n\nBabyChain's design goal was to make that distance zero.\n\nDesign the chain on a canvas. Call the same chain from your backend. Those should not be two different systems.\n\nBabyChain is Apache-2.0 and built to be deployed, not hosted by us:\n\n```\ngit clone https://github.com/babysea-community/babychain.git\ncd babychain && pnpm install --frozen-lockfile\ncp .env.example .env.local   # DATABASE_URL, owner login, provider keys\npnpm run aurora:migrate      # applies the schema, idempotent\npnpm dev                     # or use the one-click Vercel deploy button\n```\n\nFor production, BabyChain is designed around AWS Aurora. For local development, it can also point at a local PostgreSQL database. The README walks through creating the Aurora cluster, setting `DATABASE_URL`\n\n, applying the schema, and deploying the app on Vercel.\n\nThe rest of this post is the architecture: how a canvas workflow becomes durable infrastructure on Aurora and Vercel.\n\nThe naive way to run a multi-model chain on serverless is to hold the whole chain in one function invocation. That dies quickly. A single image → video → video-modify workflow can spend several minutes inside provider queues, and a stateless function should not be asked to babysit that entire wait.\n\nThe durable way is to make the database the only place workflow state lives:\n\n```\nAurora owns every fact about a run.\nVercel functions are stateless workers.\nEach invocation advances a run by at most one step.\nAny instance can pick up any run mid-chain.\n```\n\nWhen a caller creates a run, BabyChain persists the run and its ordered steps to Aurora, may opportunistically advance the first ready step, and returns without waiting for the full chain. Each subsequent poll of `GET /api/v1/chains/get/{runId}`\n\n, or a cron sweep, loads the run from Aurora, advances exactly one provider step (submit or poll), persists the result, and returns. Long chains survive serverless limits because **no instance ever needs to outlive a step**.\n\nAurora owns every fact about a run, so a Vercel function is allowed to disappear at any moment. The chain is not.\n\nAurora Serverless v2 fits the workload: bursty, low-idle, spiky on demo days. The connection pool absorbs Aurora wake-ups when a cluster is configured to pause with a 30-second connection timeout. For Aurora/RDS endpoints, deployers keep `?sslmode=require`\n\nin `DATABASE_URL`\n\n; BabyChain strips the driver-level query param and connects with TLS, including the RDS CA behavior expected by the Node.js `pg`\n\nclient.\n\nEverything durable lives in one private schema, `babychain_private`\n\n, applied idempotently by `pnpm aurora:migrate`\n\n:\n\n| Table | Owns |\n|---|---|\n`chain_run` |\nRun lifecycle: status, input, output, error code/message, idempotency key hash, callback intent |\n`chain_step` |\nOrdered steps: per-step params, provider request ids, generation ids, output files, failure details |\n`canvas` |\nSaved node graphs as `jsonb` , owner-scoped, with a touch trigger and a `(owner_email, updated_at desc)` index |\n`api_key` |\nHashed caller keys with scopes |\n`audit_event` |\nAppend-only audit trail |\n`callback_delivery` |\nFinal signed-webhook delivery state |\n`babysea_webhook_delivery` |\nInbound provider webhook bookkeeping |\n\nTwo design details earned their keep:\n\n**The input_order sidecar.** PostgreSQL\n\n`jsonb`\n\ndoes not preserve key order, but the public run resource echoes the caller's input back in API responses, and key order is part of how people read their own request. Run creation stores a small `jsonb`\n\narray of the caller's original key order alongside the canonicalized input, and the presenter re-applies it on the way out. It is a small detail, but it matters when an API response is also a debugging surface.**Guarded state transitions.** Steps only leave the `queued`\n\nstate through updates with a `where status = 'queued'`\n\nguard. That single predicate makes the fail-fast path race-safe: when a step fails, the runner marks the run failed and sweeps every still-queued downstream step to `skipped`\n\n(their input can never arrive) without ever clobbering a step that a concurrent invocation already started.\n\nGenerative media is expensive enough that retries must not multiply spend. BabyChain makes idempotency a property of the whole pipeline, not one endpoint:\n\n`Idempotency-Key`\n\nper principal and stores it on `chain_run`\n\nwith a unique constraint. A retried create replays the stored run: same id, same response, zero new provider calls.The same discipline applies on the way out: when a run includes a webhook URL, the terminal callback is claimed on the run row and each signed delivery attempt is recorded in `callback_delivery`\n\n, so concurrent instances do not both send the same terminal callback.\n\nThe studio is a multi-flow [React Flow](https://reactflow.dev) canvas. Every edit autosaves to the `canvas`\n\ntable in Aurora, and surviving real-world usage took three iterations:\n\nDebounced autosave is a data-loss bug with good intentions: it can drop the last burst of edits before a reload.\n\n`sendBeacon`\n\nfinal flushThe result is the demo I like most: edit a prompt, log out, log back in on another machine. The edit is there, served from Aurora. Close the tab mid-run, reopen, and the run resumes, because run progress was never in the browser to begin with.\n\nSix providers (Black Forest Labs, Runway, Alibaba Cloud DashScope, Google Gemini API, OpenAI, BytePlus ARK), 57 supported models, 78,948 valid chain combinations. Not one provider agrees on what \"give me a 16:9 image\" means.\n\nThe deepest rabbit hole was Alibaba DashScope output sizes. Each model family has different rules, documented nowhere and discovered only by probing the live API:\n\n`qwen-image`\n\n/ `qwen-image-plus`\n\naccept `qwen-image-max`\n\nand `z-image-turbo`\n\ncap each dimension at 2048.`wan2.6`\n\n/ `wan2.7`\n\nfamilies enforce per-model Provider docs are a starting point. The live API is the truth.\n\nSo the adapter computes sizes per model. For budgeted models, a requested ratio `w:h`\n\nis fitted into a pixel budget `P_max`\n\n:\n\n```\nscale = sqrt(P_max / (w * h))\nW = floor(scale * w / 16) * 16\nH = floor(scale * h / 16) * 16\n```\n\n…and snapped-size models get a lookup table instead, because they allow no freedom at all. Wrong sizes now physically cannot be sent.\n\nThe same empirical attitude shaped everything at the boundary: Runway's per-endpoint pixel ratios, OpenAI's permanent quota 429s masquerading as transient rate limits, and BFL output URLs that expire after ~10 minutes (the UI shows honest loading and expiry states instead of leaking alt text).\n\nOne structural decision keeps this manageable: the canvas node cards are **generated from each model's schema** (fields, enum options, ranges, defaults). The UI cannot offer a parameter the API would reject, because both are projections of the same source of truth.\n\nBabyChain is built around runtime invariants instead of optimistic workflows. A chain should be able to fail cleanly, resume after an interrupted function, reject invalid model roles and normalized inputs before dispatch, and preserve canvas state even if the browser disappears mid-edit.\n\nThe runtime behavior we validated end to end:\n\n``` php\nStep fails             -> run goes terminal, downstream steps skipped, caller sees the provider's real error\nFunction instance dies -> next poll resumes the run from Aurora, idempotent resubmit\nClient retries create  -> same run replayed, zero duplicate spend\nTab closes mid-edit    -> sendBeacon flush, canvas intact after re-login\nAurora wake-up         -> 30s connection budget absorbs it when pause is enabled\n```\n\nThe current project gate is 237 tests plus typecheck, lint, and production build. The tests cover the runner, provider adapters, templates, API behavior, migrations, idempotency errors, callback behavior, and the schema rules that keep the canvas and API aligned.\n\nBabyChain is already usable as a deployable starter, but the next layer is about making runs cheaper to inspect, easier to share, and safer to operate for teams:\n\n`api_key`\n\nmodel.Statelessness is a feature you design for, not a constraint you fight. Once every fact about a run lives in Aurora (runs, steps, provider ids, outputs, failures, callbacks, canvases, audit), serverless time limits, cold starts, and instance churn stop being the center of the system. Vercel gives the control plane instant deployment; Aurora gives it durable memory.\n\nDesign on the canvas. Ship the same contract as an API. Let the database remember everything.\n\nCreators and developers: deploy it, chain your own models in your own cloud, and tell us what you automate first.\n\nBabyChain is our entry to the **H0: Hack the Zero Stack with Vercel v0 & AWS Databases** hackathon. This post was created for the purposes of entering that hackathon. #H0Hackathon", "url": "https://wpnews.pro/news/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and", "canonical_source": "https://dev.to/akirayuusha/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and-vercel-1p5h", "published_at": "2026-06-13 01:34:00+00:00", "updated_at": "2026-06-13 01:42:55.777955+00:00", "lang": "en", "topics": ["generative-ai", "ai-infrastructure", "ai-tools", "computer-vision", "ai-products"], "entities": ["BabyChain", "AWS Aurora", "Vercel", "ComfyUI", "Apache-2.0", "GitHub", "PostgreSQL", "RDS"], "alternates": {"html": "https://wpnews.pro/news/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and", "markdown": "https://wpnews.pro/news/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and.md", "text": "https://wpnews.pro/news/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and.txt", "jsonld": "https://wpnews.pro/news/launching-babychain-durable-image-and-video-model-chains-on-aws-aurora-and.jsonld"}}