High Availability, Live Migration & Automatic Restart

How Compute Engine handles host maintenance and hardware failures using live migration and automatic restart, and how to configure availability policies for your VMs.

Host Events: What Happens to the Underlying Hardware

Compute Engine VMs run on physical hosts in Google’s data centers. Those hosts occasionally need attention. There are two types of events that can affect your VM:

Event Type	Trigger	Frequency	Impact
Maintenance event	Scheduled infrastructure upkeep: hardware upgrades, security patches, BIOS updates, kernel patches	Roughly every few weeks to months per host	GCE sends a notice, then acts based on your maintenance policy
Host error	Unexpected hardware or software failure on the host machine	Rare	VM stops immediately; automatic restart kicks in if enabled

Maintenance events are predictable. GCE sends an advance notice through the metadata server and then carries out the action you configured. Host errors are not predictable. Your VM just dies, and you rely on automatic restart to bring it back.

Key Insight: The onHostMaintenance policy determines what happens during a scheduled maintenance event. The automaticRestart policy determines what happens after an unscheduled host error (or a TERMINATE maintenance event). These are two separate controls for two separate scenarios.

Live Migration

Live migration moves a running VM from one physical host to another without rebooting it and without interrupting the guest OS or applications. The VM’s IP addresses, persistent disks, metadata, and application state all stay the same. From the VM’s perspective, nothing happened except a brief period of reduced performance.

This is a significant capability. Most cloud providers terminate VMs during host maintenance and rely on automation to restart them. GCE can transparently relocate the VM instead.

The 3-Stage Process

flowchart LR
    A["<b>Stage 1: Source Brownout</b><br/>Memory pages copied to target<br/>VM still runs on source"] --> B["<b>Stage 2: Blackout</b><br/>Brief pause &lt; 1 sec<br/>Remaining state transferred"]
    B --> C["<b>Stage 3: Target Brownout</b><br/>VM runs on target<br/>Source forwards packets"]
    C --> D["<b>Complete</b><br/>VM fully on target<br/>No disruption"]

Stage 1 — Source brownout: GCE allocates a target host and begins copying memory pages from the source. The VM keeps running on the source host during this phase. Disk I/O and network performance may be slightly reduced.

Stage 2 — Blackout: The VM is paused on the source (typically less than one second). The remaining memory pages and device state are transferred to the target. The guest OS and applications freeze for this brief window.

Stage 3 — Target brownout: The VM resumes execution on the target host. The source host continues forwarding network packets to the target until the network fabric updates to route traffic directly. This stage is transparent to the VM.

What Triggers Live Migration

Scheduled infrastructure maintenance (hardware upgrades, BIOS updates)
Security patches to the host kernel or hypervisor
Impending hardware failure detected by Google’s monitoring systems

What Stays the Same

Internal and external IP addresses
Persistent Disk attachments and data
VM metadata and tags
Network configuration (firewall rules, routes)
Application state and in-memory data
VM name, zone, and machine type

Performance Impact

During the migration, expect a short period of decreased performance across CPU, memory, disk, and network. For most workloads this is negligible. Latency-sensitive applications (high-frequency trading, real-time gaming) may want to set onHostMaintenance to TERMINATE and use a managed instance group with autohealing instead.

VM Types That Cannot Live Migrate

Not all VMs support live migration. The following VM configurations force termination during maintenance:

VM Type	Maintenance Behavior	Notes
GPU-attached VMs (A100, H100, L4, T4, etc.)	Terminate	Most GPU VMs receive a 60-minute stopping notice. A4X, A4, and A3 Ultra receive a 10-minute stopping notice.
Bare metal instances (C3, C4 metal)	Terminate + restart in place	No hypervisor, so no migration possible.
Most Confidential VMs	Terminate	Exception: N2D with AMD SEV can live migrate.
Cloud TPUs	Terminate	TPUs are physically tied to the host.
H4D with Local SSD	Terminate + restart in place	High-performance configuration limits mobility.
Z3 with >18 TiB Titanium SSD	Terminate + restart in place	Large local storage prevents migration.
Spot / Preemptible VMs	Terminate (no restart)	Always terminated during maintenance. No automatic restart.

Note: E2 machine types can only use MIGRATE policy. You cannot set onHostMaintenance=TERMINATE on E2 VMs.

Note: A4X, A4, and A3 Ultra also have AI Hypercomputer maintenance planning features. Scheduled maintenance can surface much earlier, such as 90 days in advance, but the metadata stopping notice before the VM stops is still 10 minutes.

Host Maintenance Policy

You configure availability behavior through the VM’s scheduling policy. There are four relevant settings:

Policy	Default	Options	When to Change
`onHostMaintenance`	`MIGRATE`	`MIGRATE` or `TERMINATE`	Set to `TERMINATE` when the VM has attached GPUs, is a bare metal instance, or your application handles its own restart logic.
`automaticRestart`	`true`	`true` or `false`	Set to `false` for Spot/Preemptible VMs or when you want a custom restart mechanism.
`hostErrorTimeoutSeconds`	`330`	90–330 (30s increments)	Lower this if your workload needs faster failover. Higher values give GCE more time to recover the host.
`localSsdRecoveryTimeout`	`1 hour`	0–168 hours	Increase for workloads that need Local SSD data recovered before restart. Set to `0` to skip recovery.

onHostMaintenance controls what happens during scheduled maintenance. MIGRATE (default) live-migrates the VM. TERMINATE shuts it down, and GCE restarts it on a new host if automaticRestart is enabled.

automaticRestart controls what happens after an unscheduled failure or a TERMINATE maintenance event. When true, GCE attempts to restart the VM on the same or a different host in the same zone.

hostErrorTimeoutSeconds defines how long GCE waits before declaring a host error and triggering automatic restart. The default of 330 seconds (5.5 minutes) balances false positives against recovery time.

localSsdRecoveryTimeout defines how long GCE tries to recover a Local SSD attached to a failed VM before restarting without it. The instance enters a REPAIRING state during recovery and you are not charged for this time.

Automatic Restart

When a VM crashes due to a host error (hardware failure, hypervisor crash), GCE can automatically restart it. This is enabled by default for most VM types.

When Automatic Restart Kicks In

Host error (hardware or software failure on the physical machine)
Maintenance event where onHostMaintenance is set to TERMINATE
The VM enters REPAIRING state during Local SSD recovery (not charged during this time)

When It Does Not

You manually stopped the VM (gcloud compute instances stop)
Zonal outage (the entire zone is down)
Spot or Preemptible VM termination
The VM was terminated due to a billing issue

Restart Behavior

GCE restarts the VM on the same host if it has recovered, or on a different host in the same zone. The VM retains its name, IP addresses (ephemeral IPs may change), persistent disks, and metadata.

Local SSD Handling

Local SSDs are physically attached to the host. If the host fails, the Local SSD data might be lost. GCE attempts to recover the Local SSD for up to the localSsdRecoveryTimeout period (default 1 hour). If recovery succeeds, the VM restarts with its Local SSD data intact. If the timeout expires, GCE restarts the VM without the Local SSD.

The operation type logged in Cloud Logging is compute.instances.automaticRestart.

Practical Configuration

Create a VM with Live Migration (Default Behavior)

The default policy is MIGRATE with automaticRestart=true. You get this without specifying anything:

gcloud compute instances create my-vm \
  --machine-type=n2-standard-4 \
  --zone=us-central1-a

Create a VM with TERMINATE Policy (GPU Workloads)

gcloud compute instances create gpu-training \
  --machine-type=n1-standard-8 \
  --accelerator=count=1,type=nvidia-tesla-t4 \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

GPU VMs cannot live migrate, so you must use TERMINATE. Adding --restart-on-failure ensures the VM comes back after maintenance.

Update an Existing VM’s Maintenance Policy

gcloud compute instances set-scheduling my-vm \
  --maintenance-policy=MIGRATE \
  --restart-on-failure \
  --host-error-timeout-seconds=120 \
  --zone=us-central1-a

Set Local SSD Recovery Timeout

gcloud compute instances set-scheduling db-vm \
  --maintenance-policy=MIGRATE \
  --restart-on-failure \
  --local-ssd-recovery-timeout=1 \
  --zone=us-central1-a

The timeout is in hours. 1 = 1 hour. Set to 0 to skip recovery entirely and restart immediately without the Local SSD.

View Current Scheduling Policy

gcloud compute instances describe my-vm \
  --flatten=scheduling \
  --zone=us-central1-a

Query the Metadata Server for Maintenance Notices

Applications running on the VM can check for upcoming maintenance events:

# Check if a maintenance event is scheduled
curl "http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event" \
  -H "Metadata-Flavor: Google"

Returns NONE if no event is scheduled, MIGRATE_ON_HOST_MAINTENANCE if the VM will be live-migrated, or TERMINATE_ON_HOST_MAINTENANCE if the VM will be terminated. Use this to implement graceful shutdown or drain logic in your application.

Decision Guide

Workload	Maintenance Policy	Auto Restart	Rationale
Web server (stateless)	`MIGRATE`	`true`	No disruption. VM moves to a new host without downtime.
GPU ML training	`TERMINATE`	`true`	GPU VMs cannot live migrate. Set TERMINATE and let GCE restart. Use checkpointing in your training code.
Database with Local SSD	`MIGRATE`	`true` + recovery timeout	Live migration keeps the DB running. If the host fails, recovery timeout gives GCE time to recover Local SSD data.
Spot VM batch job	`TERMINATE`	`false`	Spot VMs are always terminated during maintenance. No point in auto restart since the spot allocation is gone.
HPC (H4D)	`TERMINATE`	`true`	Cannot migrate. Restart in place after maintenance.
Latency-sensitive app	`TERMINATE`	`true`	Live migration causes a brief performance dip. Your app may prefer a clean restart over degraded performance. Use a MIG with autohealing.

Detecting and Monitoring

Cloud Logging

GCE logs specific operation types when maintenance events and restarts occur:

Operation Type	What It Means
`compute.instances.migrateOnHostMaintenance`	VM was live-migrated during a maintenance event
`compute.instances.terminateOnHostMaintenance`	VM was terminated due to maintenance policy set to TERMINATE
`compute.instances.automaticRestart`	VM was automatically restarted after a host error

Query in Cloud Logging:

resource.type="gce_instance"
protoPayload.operationType=("compute.instances.migrateOnHostMaintenance"
  OR "compute.instances.terminateOnHostMaintenance"
  OR "compute.instances.automaticRestart")

gcloud

Check the current scheduling policy of a VM:

gcloud compute instances describe my-vm \
  --format="yaml(scheduling)" \
  --zone=us-central1-a

Metadata Server

From within the VM, poll the metadata server for real-time maintenance status:

curl "http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event" \
  -H "Metadata-Flavor: Google"

Tip: Use a startup script or application-level health check that polls the metadata server. When the response changes from NONE to MIGRATE_ON_HOST_MAINTENANCE or TERMINATE_ON_HOST_MAINTENANCE, your application can start draining connections or saving state before the event occurs.

TL;DR

Compute Engine VMs face two types of host events: scheduled maintenance (predictable, with notice) and host errors (unpredictable, immediate failure).
Live migration transparently moves a running VM to a new host during maintenance. No reboot, no IP change, no data loss. Most VMs support it by default.
VMs with GPUs, bare metal, Confidential VMs, TPUs, and Spot/Preemptible VMs cannot live migrate. They terminate during maintenance.
onHostMaintenance controls behavior during scheduled maintenance: MIGRATE (default) or TERMINATE.
automaticRestart controls behavior after host failures or TERMINATE events. Enabled by default.
localSsdRecoveryTimeout (0–168 hours, default 1 hour) controls how long GCE tries to recover Local SSD data before restarting without it.
hostErrorTimeoutSeconds (90–330 seconds, default 330) controls how long GCE waits before declaring a host error.
Use the metadata server to detect upcoming maintenance events from within your VM. Applications can drain connections or checkpoint state before the event.
Monitor maintenance events through Cloud Logging using operation types migrateOnHostMaintenance, terminateOnHostMaintenance, and automaticRestart.

Resources

Host Maintenance Overview How GCE handles host maintenance events, host errors, and availability policies.

Live Migration Process Technical details of the live migration mechanism and what stays the same during migration.

Setting VM Host Options How to configure onHostMaintenance, automaticRestart, and related policies.

Instance Templates How instance templates encode scheduling policies for managed instance groups.

Regions and Zones How zones relate to host infrastructure and failure domains.

Lalit's Cloud & DevOps notes