Spot VMs

How Spot VMs work on Google Compute Engine, when to use them, and how to design your workloads to handle preemption.

What Are Spot VMs?

Spot VMs let you use spare Compute Engine capacity at a steep discount (60-91% off on-demand pricing). The trade-off is that Google can reclaim the capacity at any time with a short termination notice. Spot VMs are the current recommended approach, replacing the older preemptible VMs.

Key facts:

Discount: 60% to 91% off on-demand, depending on machine type and region
No minimum or maximum runtime (old preemptible VMs had a 24-hour limit)
No live migration, no SLA
Free tier credits do not apply
CUDs and SUDs do not apply to Spot VMs

Spot VMs vs Preemptible VMs

Feature	Spot VMs	Preemptible VMs
Max runtime	No limit	24 hours
Discount	60-91%	Up to ~80-91%
Preemption notice	0s or 120s (configurable, preview)	30s only
Termination action	STOP or DELETE	Always DELETE
Status	Current, recommended	Legacy

Note: If you are still using preemptible VMs, migrate to Spot VMs. They offer more flexibility (no 24-hour limit, configurable termination action) and are the actively supported option.

How Preemption Works

When Google needs to reclaim Spot VM capacity:

Metadata update: The VM’s preempted metadata is set to TRUE
ACPI signal: An ACPI G2 Soft Off signal is sent to the VM
Shutdown window: Your shutdown script has up to 30 seconds to run (save state, drain traffic, upload checkpoints)
Forced termination: If the VM has not stopped after 30 seconds, an ACPI G3 signal forces termination
Final state: The VM enters TERMINATED state (default) or is deleted, depending on your configured termination action

Preemption notice: By default, the notice is 0 seconds (you only get the 30-second shutdown window). You can configure a 120-second notice (in Preview) to get advance warning before the ACPI signal.

Detecting preemption in a script:

# Check if the VM is about to be preempted
if curl -s -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/preempted" | grep -q "TRUE"; then
  echo "VM is being preempted. Saving state..."
  # Save checkpoints, drain traffic, etc.
fi

Good and Bad Use Cases

Ideal for Spot VMs:

Workload	Why It Works
Batch processing	Jobs can be checkpointed and resumed
CI/CD builds	Failed builds can be retried
Distributed data processing (Spark, Hadoop)	Frameworks handle worker failure
Image/video rendering	Frame-by-frame processing, retry on failure
Stateless web serving (with MIG)	MIG auto-replaces preempted instances
Development and testing	Non-critical, interruption is acceptable

Bad fit for Spot VMs:

Workload	Why It Fails
Single-instance databases	Data loss risk, no failover
Long-running monolithic apps	Cannot checkpoint or resume
Real-time interactive workloads	Latency spikes on preemption
Workloads that cannot tolerate any interruption	Obvious reason

Designing for Failure

To use Spot VMs effectively, your workload must handle interruption gracefully.

Managed Instance Groups (MIGs): Use a MIG with Spot VMs. When instances are preempted, the MIG automatically recreates them when capacity is available. This gives you self-healing without manual intervention.

gcloud compute instance-groups managed create spot-mig \
  --template=spot-template \
  --size=10 \
  --zone=us-central1-a

Shutdown scripts: Write a shutdown script that saves state before the 30-second window expires. Upload partial results to Cloud Storage, drain from load balancers, or send a notification.

gcloud compute instances create spot-worker \
  --machine-type=e2-medium \
  --provisioning-model=SPOT \
  --metadata=shutdown-script='#!/bin/bash
# Upload checkpoint to Cloud Storage
gsutil cp /tmp/checkpoint.json gs://my-bucket/checkpoints/
# Notify job coordinator
curl -X POST https://coordinator.example.com/worker-down'

Checkpoints for batch jobs: Design batch jobs to save progress periodically. If the VM is preempted, the job resumes from the last checkpoint on the next run.

Metadata polling: For the 120-second notice (Preview), poll the metadata server to detect preemption early and begin graceful shutdown before the 30-second window.

Retries and queues: Use Cloud Pub/Sub or Cloud Tasks to queue work items. If a Spot VM is preempted mid-task, the message returns to the queue and another worker picks it up.

Limitations

No live migration (VMs are terminated during host maintenance)
Not all machine types are supported (e.g., A4X and bare metal are excluded)
No automatic restart on host events
Not covered by any SLA
Cannot change an existing VM to Spot or vice versa (must recreate)
Console does not show preemption probability

TL;DR

Spot VMs offer 60-91% off on-demand pricing for using spare capacity.
No runtime limit (unlike old preemptible VMs with a 24-hour max). Preferred over preemptible.
Google can preempt at any time. You get a 30-second shutdown window to save state.
Good for: batch jobs, CI/CD, distributed processing, stateless web serving with MIGs.
Bad for: databases, monolithic apps, real-time workloads, anything that cannot tolerate interruption.
Design for failure: use MIGs for auto-recreation, shutdown scripts for state saving, checkpoints for batch jobs, and queues for retry logic.
Spot VMs do not receive CUDs or SUDs. They are a separate pricing category.

Resources

Spot VMs Documentation Official documentation for Spot VM pricing, preemption, and best practices.

Preemptible VM Instances Legacy preemptible VMs documentation (use Spot VMs for new workloads).

Committed-Use Discounts For steady workloads where interruption is not acceptable.

Cost Optimization Overview of all cost levers on Google Cloud.

Lalit's Cloud & DevOps notes