# The Stateful Spot Instance

> Source: <https://loopholelabs.io/blog/stateful-spot>
> Published: 2026-06-16 00:00:00+00:00

On this page:

We ran Valkey on an EKS spot instance without persistence. The instance got
reclaimed. The container moved to another host with all its in-memory data
intact. [Try it on a pre-configured Kubernetes cluster in the iximiuz
Labs playground](https://labs.iximiuz.com/tutorials/architect-valkey).

[The spot instance tradeoff](#the-spot-instance-tradeoff)

Spot instances are up to 90% cheaper than on-demand. We use them for web servers, CI runners, batch jobs, and anything that can restart from scratch. Valkey is not one of those things. Most Valkey deployments run without persistence because speed is the whole point, so teams design around the loss.

But Valkey grew beyond caching a long time ago. Teams use it for sessions, job
queues, rate limiting, pub/sub, and increasingly as the [memory layer for AI
agents](https://venturebeat.com/data/context-architecture-is-replacing-rag-as-agentic-ai-pushes-enterprise-retrieval-to-its-limits/)
where it holds semantic facts, conversation context, and [vector
embeddings](https://developers.llamaindex.ai/python/framework-api-reference/storage/vector_store/redis/).
None of that can disappear on a spot reclamation, so teams keep paying full
price. Stateful means on-demand.

There is no technical reason a stateful workload cannot run on spot. The problem is that when the instance disappears, everything on it disappears too. So we fixed it.

[What you lose](#what-you-lose)

When the cloud provider reclaims your spot instance, Valkey loses everything:

A cache warms up again in a few minutes, but a session store logs out every user, a job queue loses tasks mid-flight, and every pub/sub subscriber disconnects and misses messages until it reconnects.

Then there are the connections: they all drop. Every client has to reconnect, re-authenticate, re-subscribe.

Most teams just avoid spot entirely and run Valkey on on-demand. Safe, but 2-3x the cost. Reserved instances claw most of that back, betting you'll still want that same instance one to three years out. The rest design around the loss: cache invalidation strategies, read-through patterns, warm-up scripts that pre-populate data after a restart. That works for simple caches, but breaks when Valkey holds sessions, job queues, or agent memory that cannot be reconstructed.

Attaching a persistent volume looks like the fix, but EBS volumes are locked to one availability zone and slow to re-attach to a new node. Durable storage surrenders the cross-zone diversity that makes spot cheap, and still costs you minutes of downtime. Either way, a spot reclamation hurts.

Spot reclamations are not sudden crashes though. The cloud provider warns you
first: 2 minutes on AWS, 30 seconds on GCP and Azure. That does not sound like
much, but [Architect](/architect) needs about 10 seconds to migrate most
workloads (it's usually down to network throughput and amount of data that
needs to be transferred).

The source node needs to be alive for those seconds. A kernel panic or a sudden power loss takes the process with it, and no migration can help. But sudden death is the minority: spot warns you, drains have grace periods, autoscalers cordon before scaling down. Mature infrastructure usually says goodbye before it goes.

[30 seconds is plenty](#30-seconds-is-plenty)

When a Kubernetes node gets a spot reclamation notice, Architect moves the container to another spot node with all its in-memory data. The container pauses on one host and picks up on another. Yes, a checkpoint travels between the nodes, but it carries the entire running process, not a data dump that a fresh Valkey has to load.

Clients reconnect, but to a store that already has everything. No cold cache, no thundering herd against an empty Valkey.

Compare that to a traditional recovery. We will even be generous and assume RDB persistence was on, so only the writes since the last snapshot are lost:

[Try it yourself](#try-it-yourself)

Do not take our word for it. Run it on your own EKS cluster with [the quick
start](/docs/quick-start), or [try it on a pre-configured cluster on the
iximiuz Labs playground](https://labs.iximiuz.com/tutorials/architect-valkey).
Write your own data, trigger a migration, and see what survives. Try to break
it. If you manage, we are most curious to [read all about
it](mailto:contact@loopholelabs.io?subject=I%20broke%20Architect).

A fresh EKS cluster takes 20+ minutes and costs money. The iximiuz Labs playground is free for up to 1 hour, ready in 3-5 minutes, and just requires GitHub authentication. One caveat: it does not run real spot instances. You drain the node yourself, the same eviction a spot interruption triggers. The migration is real, only the eviction notice is simulated.

If you want to try this on your Kubernetes cluster, Architect works best on EKS with AL2023 nodes. On GKE, use the Ubuntu node image. Other Kubernetes distributions may work, but I cannot promise it: the installer integrates tightly with containerd, and distributions that relocate it, like k3s, need manual surgery first. You also need Kubernetes 1.33+ and at least 2 nodes, or there is nowhere to migrate to.

Add your cluster in [the Console](https://console-v2.architect.io/) and it hands
you a pre-filled `helm install`

. The manifest below gives you migration with
the data intact, plus hibernate and wake.

Start a single Valkey instance with no persistence and no replicas. The annotations are the entire integration: Architect manages the container, hibernates it when idle, and network monitoring wakes it on traffic:

The 10-second scale-down is demo tuning. Real workloads set their own.

To watch the Valkey hibernation in real-time, run this command in a separate tab:

I find this [Console](https://console-v2.architect.io/) view most
helpful to understand how Architect manages workloads:

In the [iximiuz Labs
playground](https://labs.iximiuz.com/tutorials/architect-valkey), the
`valkey-cli`

client & alias are set up, skip next two steps.

On your own cluster, run the commands from a throwaway client pod started from the Valkey image, so nothing has to be installed locally. Give it anti-affinity to Valkey so it never lands on the node you drain later:

Expose Valkey inside the cluster so the client can reach it by name, then alias
`valkey-cli`

to run inside that pod, so the command examples below work as-is:

Now you can write some data, trigger a pod migration, and check that in-memory state persists:

On a real EKS cluster, you would not drain by hand. Install the [AWS Node
Termination Handler](https://github.com/aws/aws-node-termination-handler) so
that it catches interruption notice and drains the node for you (same as we did
above).

If you still have the watch Valkey command running, you will see it migrate to
another node, and then hibernate. Alternatively, in the
[Console](https://console-v2.architect.io):

- Click on the new pod
- Uncheck
`Only show completed hibernation events`

- Click on the
`Checkpoint downloaded`

entry to see the following:

Now let's check that the in-memory data is still there:

The above proves that Valkey migrated across nodes, and it preserved all its in-memory state.

[Stateful on spot](#stateful-on-spot)

Stateful means on-demand. That has been true for as long as losing an instance meant losing everything, and it does not have to be true anymore.

A single Valkey instance on spot, no persistence, no replicas. This is effectively spot pricing with on-demand guarantees. No code changes, one Helm chart, and three annotations that keep the data through the move.

Go look at your cloud bill. Sort by on-demand spend. The biggest line items are
almost certainly stateful: session stores, message brokers, databases. All
stuck on on-demand because the state cannot disappear. [That constraint is now
gone.](/architect)
