The Stateful Spot Instance

wpnews.pro

On this page:

We ran Valkey on an EKS spot instance without persistence. The instance got reclaimed. The container moved to another host with all its in-memory data intact. [Try it on a pre-configured Kubernetes cluster in the iximiuz

Labs playground](https://labs.iximiuz.com/tutorials/architect-valkey).

[The spot instance tradeoff](#the-spot-instance-tradeoff)

Spot instances are up to 90% cheaper than on-demand. We use them for web servers, CI runners, batch jobs, and anything that can restart from scratch. Valkey is not one of those things. Most Valkey deployments run without persistence because speed is the whole point, so teams design around the loss.

But Valkey grew beyond caching a long time ago. Teams use it for sessions, job queues, rate limiting, pub/sub, and increasingly as the memory layer for AI agents where it holds semantic facts, conversation context, and vector embeddings. None of that can disappear on a spot reclamation, so teams keep paying full price. Stateful means on-demand.

There is no technical reason a stateful workload cannot run on spot. The problem is that when the instance disappears, everything on it disappears too. So we fixed it.

What you lose When the cloud provider reclaims your spot instance, Valkey loses everything:

A cache warms up again in a few minutes, but a session store logs out every user, a job queue loses tasks mid-flight, and every pub/sub subscriber disconnects and misses messages until it reconnects.

Then there are the connections: they all drop. Every client has to reconnect, re-authenticate, re-subscribe.

Most teams just avoid spot entirely and run Valkey on on-demand. Safe, but 2-3x the cost. Reserved instances claw most of that back, betting you'll still want that same instance one to three years out. The rest design around the loss: cache invalidation strategies, read-through patterns, warm-up scripts that pre-populate data after a restart. That works for simple caches, but breaks when Valkey holds sessions, job queues, or agent memory that cannot be reconstructed.

Attaching a persistent volume looks like the fix, but EBS volumes are locked to one availability zone and slow to re-attach to a new node. Durable storage surrenders the cross-zone diversity that makes spot cheap, and still costs you minutes of downtime. Either way, a spot reclamation hurts.

Spot reclamations are not sudden crashes though. The cloud provider warns you

first: 2 minutes on AWS, 30 seconds on GCP and Azure. That does not sound like
much, but [Architect](/architect) needs about 10 seconds to migrate most

workloads (it's usually down to network throughput and amount of data that needs to be transferred).

The source node needs to be alive for those seconds. A kernel panic or a sudden power loss takes the process with it, and no migration can help. But sudden death is the minority: spot warns you, drains have grace periods, autoscalers cordon before scaling down. Mature infrastructure usually says goodbye before it goes.

30 seconds is plenty When a Kubernetes node gets a spot reclamation notice, Architect moves the container to another spot node with all its in-memory data. The container s on one host and picks up on another. Yes, a checkpoint travels between the nodes, but it carries the entire running process, not a data dump that a fresh Valkey has to load.

Clients reconnect, but to a store that already has everything. No cold cache, no thundering herd against an empty Valkey.

Compare that to a traditional recovery. We will even be generous and assume RDB persistence was on, so only the writes since the last snapshot are lost:

Try it yourself Do not take our word for it. Run it on your own EKS cluster with [the quick

start](/docs/quick-start), or [try it on a pre-configured cluster on the
iximiuz Labs playground](https://labs.iximiuz.com/tutorials/architect-valkey).

Write your own data, trigger a migration, and see what survives. Try to break it. If you manage, we are most curious to [read all about

it](mailto:contact@loopholelabs.io?subject=I%20broke%20Architect). A fresh EKS cluster takes 20+ minutes and costs money. The iximiuz Labs playground is free for up to 1 hour, ready in 3-5 minutes, and just requires GitHub authentication. One caveat: it does not run real spot instances. You drain the node yourself, the same eviction a spot interruption triggers. The migration is real, only the eviction notice is simulated.

If you want to try this on your Kubernetes cluster, Architect works best on EKS with AL2023 nodes. On GKE, use the Ubuntu node image. Other Kubernetes distributions may work, but I cannot promise it: the installer integrates tightly with containerd, and distributions that relocate it, like k3s, need manual surgery first. You also need Kubernetes 1.33+ and at least 2 nodes, or there is nowhere to migrate to.

Add your cluster in [the Console](https://console-v2.architect.io/) and it hands

you a pre-filled helm install

. The manifest below gives you migration with the data intact, plus hibernate and wake.

Start a single Valkey instance with no persistence and no replicas. The annotations are the entire integration: Architect manages the container, hibernates it when idle, and network monitoring wakes it on traffic:

The 10-second scale-down is demo tuning. Real workloads set their own.

To watch the Valkey hibernation in real-time, run this command in a separate tab:

I find this Console view most helpful to understand how Architect manages workloads:

In the iximiuz Labs playground, the valkey-cli

client & alias are set up, skip next two steps.

On your own cluster, run the commands from a throwaway client pod started from the Valkey image, so nothing has to be installed locally. Give it anti-affinity to Valkey so it never lands on the node you drain later:

Expose Valkey inside the cluster so the client can reach it by name, then alias valkey-cli

to run inside that pod, so the command examples below work as-is:

Now you can write some data, trigger a pod migration, and check that in-memory state persists:

On a real EKS cluster, you would not drain by hand. Install the AWS Node Termination Handler so that it catches interruption notice and drains the node for you (same as we did above).

If you still have the watch Valkey command running, you will see it migrate to another node, and then hibernate. Alternatively, in the

Console:

Click on the new pod
Uncheck Only show completed hibernation events
Click on the Checkpoint downloaded

entry to see the following:

Now let's check that the in-memory data is still there:

The above proves that Valkey migrated across nodes, and it preserved all its in-memory state.

Stateful on spot Stateful means on-demand. That has been true for as long as losing an instance meant losing everything, and it does not have to be true anymore.

A single Valkey instance on spot, no persistence, no replicas. This is effectively spot pricing with on-demand guarantees. No code changes, one Helm chart, and three annotations that keep the data through the move.

Go look at your cloud bill. Sort by on-demand spend. The biggest line items are almost certainly stateful: session stores, message brokers, databases. All stuck on on-demand because the state cannot disappear. That constraint is now gone.

source & further reading

loopholelabs.io — original article AI Took My Coding. What's Left For Me? Rewriting stale OSS projects using LLM Ollama Doesn't Know Its GPU Is on Another Machine

The Stateful Spot Instance

Run your AI side-project on zahid.host