How Compute Engine handles host maintenance and hardware failures using live migration and automatic restart, and how to configure availability policies for your VMs.
Host Events: What Happens to the Underlying Hardware
Compute Engine VMs run on physical hosts in Google’s data centers. Those hosts occasionally need attention. There are two types of events that can affect your VM:
| Event Type | Trigger | Frequency | Impact |
|---|---|---|---|
| Maintenance event | Scheduled infrastructure upkeep: hardware upgrades, security patches, BIOS updates, kernel patches | Roughly every few weeks to months per host | GCE sends a notice, then acts based on your maintenance policy |
| Host error | Unexpected hardware or software failure on the host machine | Rare | VM stops immediately; automatic restart kicks in if enabled |
Maintenance events are predictable. GCE sends an advance notice through the metadata server and then carries out the action you configured. Host errors are not predictable. Your VM just dies, and you rely on automatic restart to bring it back.
Key Insight: The
onHostMaintenancepolicy determines what happens during a scheduled maintenance event. TheautomaticRestartpolicy determines what happens after an unscheduled host error (or a TERMINATE maintenance event). These are two separate controls for two separate scenarios.
Live Migration
Live migration moves a running VM from one physical host to another without rebooting it and without interrupting the guest OS or applications. The VM’s IP addresses, persistent disks, metadata, and application state all stay the same. From the VM’s perspective, nothing happened except a brief period of reduced performance.
This is a significant capability. Most cloud providers terminate VMs during host maintenance and rely on automation to restart them. GCE can transparently relocate the VM instead.
The 3-Stage Process
flowchart LR A["<b>Stage 1: Source Brownout</b><br/>Memory pages copied to target<br/>VM still runs on source"] --> B["<b>Stage 2: Blackout</b><br/>Brief pause < 1 sec<br/>Remaining state transferred"] B --> C["<b>Stage 3: Target Brownout</b><br/>VM runs on target<br/>Source forwards packets"] C --> D["<b>Complete</b><br/>VM fully on target<br/>No disruption"]
Stage 1 — Source brownout: GCE allocates a target host and begins copying memory pages from the source. The VM keeps running on the source host during this phase. Disk I/O and network performance may be slightly reduced.
Stage 2 — Blackout: The VM is paused on the source (typically less than one second). The remaining memory pages and device state are transferred to the target. The guest OS and applications freeze for this brief window.
Stage 3 — Target brownout: The VM resumes execution on the target host. The source host continues forwarding network packets to the target until the network fabric updates to route traffic directly. This stage is transparent to the VM.
What Triggers Live Migration
- Scheduled infrastructure maintenance (hardware upgrades, BIOS updates)
- Security patches to the host kernel or hypervisor
- Impending hardware failure detected by Google’s monitoring systems
What Stays the Same
- Internal and external IP addresses
- Persistent Disk attachments and data
- VM metadata and tags
- Network configuration (firewall rules, routes)
- Application state and in-memory data
- VM name, zone, and machine type
Performance Impact
During the migration, expect a short period of decreased performance across CPU, memory, disk, and network. For most workloads this is negligible. Latency-sensitive applications (high-frequency trading, real-time gaming) may want to set onHostMaintenance to TERMINATE and use a managed instance group with autohealing instead.
VM Types That Cannot Live Migrate
Not all VMs support live migration. The following VM configurations force termination during maintenance:
| VM Type | Maintenance Behavior | Notes |
|---|---|---|
| GPU-attached VMs (A100, H100, L4, T4, etc.) | Terminate | Most GPU VMs receive a 60-minute stopping notice. A4X, A4, and A3 Ultra receive a 10-minute stopping notice. |
| Bare metal instances (C3, C4 metal) | Terminate + restart in place | No hypervisor, so no migration possible. |
| Most Confidential VMs | Terminate | Exception: N2D with AMD SEV can live migrate. |
| Cloud TPUs | Terminate | TPUs are physically tied to the host. |
| H4D with Local SSD | Terminate + restart in place | High-performance configuration limits mobility. |
| Z3 with >18 TiB Titanium SSD | Terminate + restart in place | Large local storage prevents migration. |
| Spot / Preemptible VMs | Terminate (no restart) | Always terminated during maintenance. No automatic restart. |
Note: E2 machine types can only use
MIGRATEpolicy. You cannot setonHostMaintenance=TERMINATEon E2 VMs.
Note: A4X, A4, and A3 Ultra also have AI Hypercomputer maintenance planning features. Scheduled maintenance can surface much earlier, such as 90 days in advance, but the metadata stopping notice before the VM stops is still 10 minutes.
Host Maintenance Policy
You configure availability behavior through the VM’s scheduling policy. There are four relevant settings:
| Policy | Default | Options | When to Change |
|---|---|---|---|
onHostMaintenance | MIGRATE | MIGRATE or TERMINATE | Set to TERMINATE when the VM has attached GPUs, is a bare metal instance, or your application handles its own restart logic. |
automaticRestart | true | true or false | Set to false for Spot/Preemptible VMs or when you want a custom restart mechanism. |
hostErrorTimeoutSeconds | 330 | 90–330 (30s increments) | Lower this if your workload needs faster failover. Higher values give GCE more time to recover the host. |
localSsdRecoveryTimeout | 1 hour | 0–168 hours | Increase for workloads that need Local SSD data recovered before restart. Set to 0 to skip recovery. |
onHostMaintenance controls what happens during scheduled maintenance. MIGRATE (default) live-migrates the VM. TERMINATE shuts it down, and GCE restarts it on a new host if automaticRestart is enabled.
automaticRestart controls what happens after an unscheduled failure or a TERMINATE maintenance event. When true, GCE attempts to restart the VM on the same or a different host in the same zone.
hostErrorTimeoutSeconds defines how long GCE waits before declaring a host error and triggering automatic restart. The default of 330 seconds (5.5 minutes) balances false positives against recovery time.
localSsdRecoveryTimeout defines how long GCE tries to recover a Local SSD attached to a failed VM before restarting without it. The instance enters a REPAIRING state during recovery and you are not charged for this time.
Automatic Restart
When a VM crashes due to a host error (hardware failure, hypervisor crash), GCE can automatically restart it. This is enabled by default for most VM types.
When Automatic Restart Kicks In
- Host error (hardware or software failure on the physical machine)
- Maintenance event where
onHostMaintenanceis set toTERMINATE - The VM enters
REPAIRINGstate during Local SSD recovery (not charged during this time)
When It Does Not
- You manually stopped the VM (
gcloud compute instances stop) - Zonal outage (the entire zone is down)
- Spot or Preemptible VM termination
- The VM was terminated due to a billing issue
Restart Behavior
GCE restarts the VM on the same host if it has recovered, or on a different host in the same zone. The VM retains its name, IP addresses (ephemeral IPs may change), persistent disks, and metadata.
Local SSD Handling
Local SSDs are physically attached to the host. If the host fails, the Local SSD data might be lost. GCE attempts to recover the Local SSD for up to the localSsdRecoveryTimeout period (default 1 hour). If recovery succeeds, the VM restarts with its Local SSD data intact. If the timeout expires, GCE restarts the VM without the Local SSD.
The operation type logged in Cloud Logging is compute.instances.automaticRestart.
Practical Configuration
Create a VM with Live Migration (Default Behavior)
The default policy is MIGRATE with automaticRestart=true. You get this without specifying anything:
gcloud compute instances create my-vm \
--machine-type=n2-standard-4 \
--zone=us-central1-aCreate a VM with TERMINATE Policy (GPU Workloads)
gcloud compute instances create gpu-training \
--machine-type=n1-standard-8 \
--accelerator=count=1,type=nvidia-tesla-t4 \
--maintenance-policy=TERMINATE \
--restart-on-failure \
--zone=us-central1-aGPU VMs cannot live migrate, so you must use TERMINATE. Adding --restart-on-failure ensures the VM comes back after maintenance.
Update an Existing VM’s Maintenance Policy
gcloud compute instances set-scheduling my-vm \
--maintenance-policy=MIGRATE \
--restart-on-failure \
--host-error-timeout-seconds=120 \
--zone=us-central1-aSet Local SSD Recovery Timeout
gcloud compute instances set-scheduling db-vm \
--maintenance-policy=MIGRATE \
--restart-on-failure \
--local-ssd-recovery-timeout=1 \
--zone=us-central1-aThe timeout is in hours. 1 = 1 hour. Set to 0 to skip recovery entirely and restart immediately without the Local SSD.
View Current Scheduling Policy
gcloud compute instances describe my-vm \
--flatten=scheduling \
--zone=us-central1-aQuery the Metadata Server for Maintenance Notices
Applications running on the VM can check for upcoming maintenance events:
# Check if a maintenance event is scheduled
curl "http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event" \
-H "Metadata-Flavor: Google"Returns NONE if no event is scheduled, MIGRATE_ON_HOST_MAINTENANCE if the VM will be live-migrated, or TERMINATE_ON_HOST_MAINTENANCE if the VM will be terminated. Use this to implement graceful shutdown or drain logic in your application.
Decision Guide
| Workload | Maintenance Policy | Auto Restart | Rationale |
|---|---|---|---|
| Web server (stateless) | MIGRATE | true | No disruption. VM moves to a new host without downtime. |
| GPU ML training | TERMINATE | true | GPU VMs cannot live migrate. Set TERMINATE and let GCE restart. Use checkpointing in your training code. |
| Database with Local SSD | MIGRATE | true + recovery timeout | Live migration keeps the DB running. If the host fails, recovery timeout gives GCE time to recover Local SSD data. |
| Spot VM batch job | TERMINATE | false | Spot VMs are always terminated during maintenance. No point in auto restart since the spot allocation is gone. |
| HPC (H4D) | TERMINATE | true | Cannot migrate. Restart in place after maintenance. |
| Latency-sensitive app | TERMINATE | true | Live migration causes a brief performance dip. Your app may prefer a clean restart over degraded performance. Use a MIG with autohealing. |
Detecting and Monitoring
Cloud Logging
GCE logs specific operation types when maintenance events and restarts occur:
| Operation Type | What It Means |
|---|---|
compute.instances.migrateOnHostMaintenance | VM was live-migrated during a maintenance event |
compute.instances.terminateOnHostMaintenance | VM was terminated due to maintenance policy set to TERMINATE |
compute.instances.automaticRestart | VM was automatically restarted after a host error |
Query in Cloud Logging:
resource.type="gce_instance"
protoPayload.operationType=("compute.instances.migrateOnHostMaintenance"
OR "compute.instances.terminateOnHostMaintenance"
OR "compute.instances.automaticRestart")gcloud
Check the current scheduling policy of a VM:
gcloud compute instances describe my-vm \
--format="yaml(scheduling)" \
--zone=us-central1-aMetadata Server
From within the VM, poll the metadata server for real-time maintenance status:
curl "http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event" \
-H "Metadata-Flavor: Google"Tip: Use a startup script or application-level health check that polls the metadata server. When the response changes from
NONEtoMIGRATE_ON_HOST_MAINTENANCEorTERMINATE_ON_HOST_MAINTENANCE, your application can start draining connections or saving state before the event occurs.
TL;DR
- Compute Engine VMs face two types of host events: scheduled maintenance (predictable, with notice) and host errors (unpredictable, immediate failure).
- Live migration transparently moves a running VM to a new host during maintenance. No reboot, no IP change, no data loss. Most VMs support it by default.
- VMs with GPUs, bare metal, Confidential VMs, TPUs, and Spot/Preemptible VMs cannot live migrate. They terminate during maintenance.
onHostMaintenancecontrols behavior during scheduled maintenance:MIGRATE(default) orTERMINATE.automaticRestartcontrols behavior after host failures or TERMINATE events. Enabled by default.localSsdRecoveryTimeout(0–168 hours, default 1 hour) controls how long GCE tries to recover Local SSD data before restarting without it.hostErrorTimeoutSeconds(90–330 seconds, default 330) controls how long GCE waits before declaring a host error.- Use the metadata server to detect upcoming maintenance events from within your VM. Applications can drain connections or checkpoint state before the event.
- Monitor maintenance events through Cloud Logging using operation types
migrateOnHostMaintenance,terminateOnHostMaintenance, andautomaticRestart.
Resources
Host Maintenance Overview How GCE handles host maintenance events, host errors, and availability policies.
Live Migration Process Technical details of the live migration mechanism and what stays the same during migration.
Setting VM Host Options How to configure
onHostMaintenance,automaticRestart, and related policies.Instance Templates How instance templates encode scheduling policies for managed instance groups.
Regions and Zones How zones relate to host infrastructure and failure domains.