{"slug": "the-stateful-spot-instance", "title": "The Stateful Spot Instance", "summary": "A new tool called Architect migrates Valkey containers between spot instances in under 10 seconds, preserving all in-memory data when a spot reclamation notice is received. This allows stateful workloads like session stores and AI agent memory to run on spot instances, cutting costs by up to 90% without data loss.", "body_md": "On this page:\n\nWe ran Valkey on an EKS spot instance without persistence. The instance got\nreclaimed. The container moved to another host with all its in-memory data\nintact. [Try it on a pre-configured Kubernetes cluster in the iximiuz\nLabs playground](https://labs.iximiuz.com/tutorials/architect-valkey).\n\n[The spot instance tradeoff](#the-spot-instance-tradeoff)\n\nSpot instances are up to 90% cheaper than on-demand. We use them for web servers, CI runners, batch jobs, and anything that can restart from scratch. Valkey is not one of those things. Most Valkey deployments run without persistence because speed is the whole point, so teams design around the loss.\n\nBut Valkey grew beyond caching a long time ago. Teams use it for sessions, job\nqueues, rate limiting, pub/sub, and increasingly as the [memory layer for AI\nagents](https://venturebeat.com/data/context-architecture-is-replacing-rag-as-agentic-ai-pushes-enterprise-retrieval-to-its-limits/)\nwhere it holds semantic facts, conversation context, and [vector\nembeddings](https://developers.llamaindex.ai/python/framework-api-reference/storage/vector_store/redis/).\nNone of that can disappear on a spot reclamation, so teams keep paying full\nprice. Stateful means on-demand.\n\nThere is no technical reason a stateful workload cannot run on spot. The problem is that when the instance disappears, everything on it disappears too. So we fixed it.\n\n[What you lose](#what-you-lose)\n\nWhen the cloud provider reclaims your spot instance, Valkey loses everything:\n\nA cache warms up again in a few minutes, but a session store logs out every user, a job queue loses tasks mid-flight, and every pub/sub subscriber disconnects and misses messages until it reconnects.\n\nThen there are the connections: they all drop. Every client has to reconnect, re-authenticate, re-subscribe.\n\nMost teams just avoid spot entirely and run Valkey on on-demand. Safe, but 2-3x the cost. Reserved instances claw most of that back, betting you'll still want that same instance one to three years out. The rest design around the loss: cache invalidation strategies, read-through patterns, warm-up scripts that pre-populate data after a restart. That works for simple caches, but breaks when Valkey holds sessions, job queues, or agent memory that cannot be reconstructed.\n\nAttaching a persistent volume looks like the fix, but EBS volumes are locked to one availability zone and slow to re-attach to a new node. Durable storage surrenders the cross-zone diversity that makes spot cheap, and still costs you minutes of downtime. Either way, a spot reclamation hurts.\n\nSpot reclamations are not sudden crashes though. The cloud provider warns you\nfirst: 2 minutes on AWS, 30 seconds on GCP and Azure. That does not sound like\nmuch, but [Architect](/architect) needs about 10 seconds to migrate most\nworkloads (it's usually down to network throughput and amount of data that\nneeds to be transferred).\n\nThe source node needs to be alive for those seconds. A kernel panic or a sudden power loss takes the process with it, and no migration can help. But sudden death is the minority: spot warns you, drains have grace periods, autoscalers cordon before scaling down. Mature infrastructure usually says goodbye before it goes.\n\n[30 seconds is plenty](#30-seconds-is-plenty)\n\nWhen a Kubernetes node gets a spot reclamation notice, Architect moves the container to another spot node with all its in-memory data. The container pauses on one host and picks up on another. Yes, a checkpoint travels between the nodes, but it carries the entire running process, not a data dump that a fresh Valkey has to load.\n\nClients reconnect, but to a store that already has everything. No cold cache, no thundering herd against an empty Valkey.\n\nCompare that to a traditional recovery. We will even be generous and assume RDB persistence was on, so only the writes since the last snapshot are lost:\n\n[Try it yourself](#try-it-yourself)\n\nDo not take our word for it. Run it on your own EKS cluster with [the quick\nstart](/docs/quick-start), or [try it on a pre-configured cluster on the\niximiuz Labs playground](https://labs.iximiuz.com/tutorials/architect-valkey).\nWrite your own data, trigger a migration, and see what survives. Try to break\nit. If you manage, we are most curious to [read all about\nit](mailto:contact@loopholelabs.io?subject=I%20broke%20Architect).\n\nA fresh EKS cluster takes 20+ minutes and costs money. The iximiuz Labs playground is free for up to 1 hour, ready in 3-5 minutes, and just requires GitHub authentication. One caveat: it does not run real spot instances. You drain the node yourself, the same eviction a spot interruption triggers. The migration is real, only the eviction notice is simulated.\n\nIf you want to try this on your Kubernetes cluster, Architect works best on EKS with AL2023 nodes. On GKE, use the Ubuntu node image. Other Kubernetes distributions may work, but I cannot promise it: the installer integrates tightly with containerd, and distributions that relocate it, like k3s, need manual surgery first. You also need Kubernetes 1.33+ and at least 2 nodes, or there is nowhere to migrate to.\n\nAdd your cluster in [the Console](https://console-v2.architect.io/) and it hands\nyou a pre-filled `helm install`\n\n. The manifest below gives you migration with\nthe data intact, plus hibernate and wake.\n\nStart a single Valkey instance with no persistence and no replicas. The annotations are the entire integration: Architect manages the container, hibernates it when idle, and network monitoring wakes it on traffic:\n\nThe 10-second scale-down is demo tuning. Real workloads set their own.\n\nTo watch the Valkey hibernation in real-time, run this command in a separate tab:\n\nI find this [Console](https://console-v2.architect.io/) view most\nhelpful to understand how Architect manages workloads:\n\nIn the [iximiuz Labs\nplayground](https://labs.iximiuz.com/tutorials/architect-valkey), the\n`valkey-cli`\n\nclient & alias are set up, skip next two steps.\n\nOn your own cluster, run the commands from a throwaway client pod started from the Valkey image, so nothing has to be installed locally. Give it anti-affinity to Valkey so it never lands on the node you drain later:\n\nExpose Valkey inside the cluster so the client can reach it by name, then alias\n`valkey-cli`\n\nto run inside that pod, so the command examples below work as-is:\n\nNow you can write some data, trigger a pod migration, and check that in-memory state persists:\n\nOn a real EKS cluster, you would not drain by hand. Install the [AWS Node\nTermination Handler](https://github.com/aws/aws-node-termination-handler) so\nthat it catches interruption notice and drains the node for you (same as we did\nabove).\n\nIf you still have the watch Valkey command running, you will see it migrate to\nanother node, and then hibernate. Alternatively, in the\n[Console](https://console-v2.architect.io):\n\n- Click on the new pod\n- Uncheck\n`Only show completed hibernation events`\n\n- Click on the\n`Checkpoint downloaded`\n\nentry to see the following:\n\nNow let's check that the in-memory data is still there:\n\nThe above proves that Valkey migrated across nodes, and it preserved all its in-memory state.\n\n[Stateful on spot](#stateful-on-spot)\n\nStateful means on-demand. That has been true for as long as losing an instance meant losing everything, and it does not have to be true anymore.\n\nA single Valkey instance on spot, no persistence, no replicas. This is effectively spot pricing with on-demand guarantees. No code changes, one Helm chart, and three annotations that keep the data through the move.\n\nGo look at your cloud bill. Sort by on-demand spend. The biggest line items are\nalmost certainly stateful: session stores, message brokers, databases. All\nstuck on on-demand because the state cannot disappear. [That constraint is now\ngone.](/architect)", "url": "https://wpnews.pro/news/the-stateful-spot-instance", "canonical_source": "https://loopholelabs.io/blog/stateful-spot", "published_at": "2026-06-16 00:00:00+00:00", "updated_at": "2026-06-17 12:26:33.052420+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-agents", "ai-tools"], "entities": ["Valkey", "AWS", "EKS", "Architect", "GCP", "Azure", "EBS"], "alternates": {"html": "https://wpnews.pro/news/the-stateful-spot-instance", "markdown": "https://wpnews.pro/news/the-stateful-spot-instance.md", "text": "https://wpnews.pro/news/the-stateful-spot-instance.txt", "jsonld": "https://wpnews.pro/news/the-stateful-spot-instance.jsonld"}}