You did everything right. You containerized your massive model training job, deployed it to Google Kubernetes Engine (GKE), and cleverly routed it to a Spot VM node pool to save up to 90% on compute costs.
Everything is humming along perfectly for 38 hours. Then, a priority on-demand customer needs capacity, Google Cloud reclaims your underlying Spot VM, and your node vanishes.
Whether you are using preemptible Spot VMs to save money, or leveraging the
Here is a practical guide to building interruptible workloads on GKE.
When Google Cloud reclaims a Spot VM, it doesn't just pull the power cord immediately. It sends an ACPI signal to the underlying node to begin a power off cycle. Kubernetes intercepts this and translates it into a SIGTERM signal sent directly to your running containers.
You have a grace period (up to 15 seconds for non-system pods) between that SIGTERM and the fatal SIGKILL.
Your application must explicitly listen for this signal. When caught, your code should immediately stop accepting new batches, finish its current loop, flush any in-memory data to disk, and exit with a 0 (success) status.
Here is a simple example on how to catch this signal in Python:
import signal
import sys
import time
def handle_sigterm(signum, frame):
print("Received SIGTERM. Initiating graceful shutdown...")
print("State saved. Exiting cleanly.")
sys.exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
print("Starting training loop...")
while True:
time.sleep(1)
If your container dies, everything inside its local filesystem dies with it. To survive an interruption, you must periodically save your progress (model weights, optimizer states, epoch counters, etc.) to an external storage location.
Cloud Storage (GCS) is a common solution for this on Google Cloud.
"Idempotency" is a fancy way of saying that doing something twice yields the same result as doing it once.
Imagine a batch inference job that reads an image, processes it, and writes the result to a database. If your pod is preempted milliseconds after writing to the database but before it can mark the task as complete, the rescheduled pod will likely process that image again.
If your database blindly inserts new rows, you now have unintentional, duplicate data.
To build an idempotent pipeline:
If you are running a massive batch processing or inference job across thousands of files, do not write a monolithic Python script that iterates through a static CSV list. If the node dies at row 5,000, managing the state of where to restart is a nightmare.
Instead, decouple the workload:
If a Spot node is preempted mid-inference, the worker dies before sending the ACK. After a brief timeout, Pub/Sub will automatically make that specific message available again. Another surviving worker pod will pick it up seamlessly. No data lost, no manual intervention required.
Running on ephemeral compute like Spot VMs isn't just an infrastructure choice; it is a design choice. By handling termination signals, checkpointing aggressively to GCS, ensuring idempotent operations, and decoupling your queues, you can unlock massive cost savings and tap into scarce GPU pools without sacrificing reliability.