How to use managed and unmanaged instance groups on Google Compute Engine for scalable, self-healing VM deployments.


What Are Instance Groups?

An instance group is a collection of VMs that you manage as a single entity. Google Cloud offers two types:

  • Managed Instance Groups (MIGs) — Identical VMs created from an Instance Template, with built-in autoscaling, autohealing, and automated updates.
  • Unmanaged Instance Groups — A loose collection of heterogeneous VMs, used primarily as a load balancer backend when VMs have different configurations.
AspectManaged Instance GroupUnmanaged Instance Group
VMsIdentical (from template)Heterogeneous
AutoscalingYesNo
AutohealingYesNo
Rolling updatesYesNo
Multi-zoneYes (regional MIGs)No
Use caseScalable production workloadsLoad balancing existing VMs

In practice: MIGs are the standard way to run production VM fleets on GCE. You declare the desired state (template, target size), and the MIG keeps the actual state converged automatically.


Managed Instance Groups (MIGs)

A MIG creates and maintains a group of identical VMs from a single instance template. You set the target size, and the MIG handles creation, monitoring, healing, and scaling.

Core capabilities:

  • Autoscaling — Dynamically adds or removes VMs based on load (CPU, LB capacity, custom metrics, schedules, Pub/Sub queue depth)
  • Autohealing — Recreates VMs that fail health checks, including application-level checks (crashes, freezes, OOM)
  • Regional deployment — Distributes VMs across multiple zones to survive zonal failures
  • Automated updates — Rolling updates and canary deployments with controlled disruption
  • Stateful support — Optional per-instance state preservation (disks, IPs, metadata)
flowchart TD
    T["Instance Template"] --> MIG["Managed Instance Group<br/>(target size: N)"]
    MIG --> VM1["VM 1"]
    MIG --> VM2["VM 2"]
    MIG --> VMN["VM N"]
    LB["Load Balancer"] --> MIG
    HC["Health Check"] -->|signals unhealthy| MIG
    MIG -->|recreates| VMN
    AS["Autoscaler"] -->|scale out/in| MIG

Key Insight: MIGs are intent-based. You declare the desired state (which template, how many VMs), and the MIG continuously converges to that state. If a VM crashes, is preempted, or fails a health check, the MIG replaces it automatically.


Unmanaged Instance Groups

An unmanaged instance group is a collection of VMs with different configurations that you can use as a load balancer backend. You add and remove VMs manually.

What they do: Let you load balance across a fleet of individually managed, non-identical VMs.

What they do not provide: Autoscaling, autohealing, rolling updates, multi-zone deployment, instance templates, or any automated management. Maximum 2,000 VMs per group.

ScenarioWhy Unmanaged
Existing heterogeneous VMsVMs with different configs that need a single LB backend
Migration phaseTemporarily grouping VMs while migrating to MIGs
One-off load balancingSimple case where MIG overhead is unnecessary

Warning: Do not use unmanaged instance groups for new production workloads. They lack autoscaling, autohealing, and automated updates. Use MIGs instead. Unmanaged groups exist primarily for load balancing legacy or heterogeneous VM fleets.


MIG vs Unmanaged Comparison

FeatureManaged (MIG)Unmanaged
VM homogeneityIdentical (template-based)Heterogeneous
AutoscalingYes (CPU, LB, metrics, schedule, Pub/Sub)No
AutohealingYes (health-check driven recreation)No
Rolling updatesYes (with canary support)No
Regional (multi-zone)YesNo
Instance templatesRequiredNot used
Stateful supportYesNo
Max VMs1,000 zonal / 2,000 regional (expandable to 4,000)2,000
Load balancingBackend service or target poolBackend service or target pool
PricingNo separate instance group chargeNo separate instance group charge

Zonal vs Regional MIGs

PropertyZonal MIGRegional MIG
ZonesSingle zoneMultiple zones (default 3)
Max VMs1,000 (expandable)2,000 (expandable)
Zonal failure toleranceNoneYes (traffic shifts to remaining zones)
Creation--zone=ZONE--region=REGION
Default maxSurge1Number of zones (default 3)
Default maxUnavailable1Number of zones (default 3)
Pub/Sub autoscalingYesYes

A zonal MIG is simpler but vulnerable to a single-zone outage. A regional MIG spreads instances across multiple zones within a region and can redistribute them after a zone recovers.

Tip: Use regional MIGs for production workloads. There is no separate charge for choosing a regional MIG, but you still pay for the VMs, disks, load balancers, and other resources the group uses.


Autoscaling

MIG autoscaling adds or removes VMs based on load signals. You can combine multiple signals in a single policy; the autoscaler uses the largest recommended size across all of them.

PolicySignalBest For
CPU utilizationAverage CPU across groupGeneral web serving, API backends
Load balancing capacityHTTP load per instanceHTTP/S traffic behind a load balancer
Cloud Monitoring metricAny custom or built-in metricApplication-specific signals (queue depth, latency)
Schedule-basedTime of day / day of weekPredictable traffic patterns
PredictiveML-based forecastWorkloads with historical patterns and slow initialization
Pub/Sub queueUnacknowledged messages in a subscriptionAsync processing, event-driven workloads

CPU-Based Autoscaling

gcloud compute instance-groups managed set-autoscaling my-mig \
  --max-num-replicas=10 \
  --min-num-replicas=2 \
  --target-cpu-utilization=0.7 \
  --zone=us-central1-a

Pub/Sub-Based Autoscaling

gcloud compute instance-groups managed set-autoscaling pubsub-mig \
  --max-num-replicas=20 \
  --min-num-replicas=1 \
  --update-stackdriver-metric=pubsub.googleapis.com/subscription/num_undelivered_messages \
  --stackdriver-metric-filter='resource.type="pubsub_subscription" AND resource.labels.subscription_id="my-sub"' \
  --stackdriver-metric-single-instance-assignment=100 \
  --zone=us-central1-a

Use --region=REGION instead of --zone=ZONE when configuring autoscaling for a regional MIG.

Note: Autoscaling is configured on the MIG, not the instance template. You set it after creating the MIG.

Scale-in controls let you limit how fast the group can shrink (e.g., “remove at most 3 VMs per 300 seconds”). Use these for workloads with long initialization times to prevent sudden capacity drops.

Initialization period (formerly cool down) tells the autoscaler how long to ignore usage data from a newly created VM while it boots and initializes. Set this to match your application’s startup time.

Tip: Set --min-num-replicas to at least 2 for production workloads. A single instance is a single point of failure.


Autohealing and Health Checks

Autohealing automatically recreates VMs that fail health checks. This catches application-level failures (crashes, freezes, out-of-memory) that a VM-level restart would miss.

LB Health Checks vs Autohealing Health Checks

AspectLB Health CheckAutohealing Health Check
PurposeStop sending traffic to unhealthy instancesDelete and recreate unhealthy instances
AggressivenessShould be aggressive (quick detection)Should be conservative (avoid unnecessary recreation)
ImpactTraffic shifts; instance keeps runningInstance is deleted and recreated
Recommended check interval5–10 seconds30–60 seconds
Recommended unhealthy threshold2–3 consecutive failures5–10 consecutive failures

Key Insight: Use separate health checks for load balancing and autohealing. LB checks should be aggressive — catch a struggling instance quickly and stop sending traffic. Autohealing checks should be conservative — recreating a VM is disruptive, so you want to be sure it’s actually broken, not just temporarily slow.

Configuring Autohealing

# Create a health check for autohealing (conservative settings)
gcloud compute health-checks create http autohealing-check \
  --port=80 \
  --check-interval=30 \
  --timeout=10 \
  --unhealthy-threshold=5 \
  --healthy-threshold=2
 
# Attach to the MIG
gcloud compute instance-groups managed update my-mig \
  --health-check=autohealing-check \
  --initial-delay=120 \
  --zone=us-central1-a

--initial-delay sets the grace period after a VM starts before health checking begins. Set this long enough for your Startup Scripts to finish and the application to initialize. If the health check fires too early, autohealing will recreate VMs that are still booting.

For Spot VMs in a MIG, autohealing automatically recreates instances that get preempted. See Spot VMs for cost-effective compute with self-healing.


Rolling Updates and Canary Deployments

The MIG Updater lets you deploy new configurations across your instances with controlled disruption.

During a rolling update, the MIG compares the current VM configuration with the target template, creates or recreates VMs in batches, waits for each new VM to become ready, and then continues until the group reaches the target version. The disruption is controlled by two budgets:

  • maxSurge controls how many extra VMs can be created above the target size.
  • maxUnavailable controls how many existing VMs can be offline at the same time.

For zero-downtime stateless updates, use maxUnavailable=0 and maxSurge>0 so replacement VMs become ready before old VMs are removed. This requires enough quota for the temporary extra VMs. If you must preserve instance names, use replacementMethod=RECREATE; that mode requires maxSurge=0, so it is slower and more disruptive.

flowchart TD
    START["Start Rolling Update<br/>Target: replace all VMs"] --> CHECK{Enough quota<br/>for maxSurge?}
    CHECK -->|Yes| SURGE["Create replacement VMs<br/>(up to maxSurge)"]
    CHECK -->|No| FAIL["Update waits<br/>for quota"]
    SURGE --> READY["Wait for healthy<br/>(minReadySec + health check)"]
    READY --> DELETE["Delete old VMs<br/>(maxUnavailable budget)"]
    DELETE --> REMAIN{More VMs<br/>to update?}
    REMAIN -->|Yes| SURGE
    REMAIN -->|No| DONE["Update complete<br/>version target reached<br/>status.isStable=true"]

Update Parameters

ParameterDefault (Zonal)Default (Regional)Purpose
maxSurge1Number of zones (3)Extra VMs created during update
maxUnavailable1Number of zones (3)VMs allowed offline at any time
minReadySec00Wait time before considering a VM ready
replacementMethodSUBSTITUTESUBSTITUTESUBSTITUTE creates replacement VMs with new names; RECREATE preserves names but requires maxSurge=0

Update types:

  • Proactive — The MIG automatically rolls out the update to all instances
  • Opportunistic — Updates applied only when instances are recreated for other reasons (resize, repair)

To confirm completion, check both status.versionTarget.isReached and status.isStable. The version target can be reached while the group is still finishing repairs, verifications, or other actions.

Rolling Update

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v2 \
  --max-surge=3 \
  --max-unavailable=1 \
  --min-ready=2m \
  --zone=us-central1-a

Canary Update (10% of VMs)

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v1 \
  --canary-version=template=web-server-v2,target-size=10% \
  --zone=us-central1-a

A MIG supports up to two instance template versions simultaneously. After verifying the canary, roll forward:

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v2 \
  --zone=us-central1-a

Rollback

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v1 \
  --max-unavailable=100% \
  --zone=us-central1-a

See Instance Templates for the full template creation and update workflow.


Stateful vs Stateless MIGs

AspectStateless MIGStateful MIG
VM identityDisposable; names can changePreserved across recreation
Persistent disksEphemeral or recreated from templateAttached to specific instance, preserved
MetadataSame across all VMsPer-instance metadata preserved
AutoscalingSupportedNot supported
AutohealingSupportedSupported
Update methodSUBSTITUTE (default)RECREATE (required)
Use caseWeb servers, API backends, workersDatabases, legacy apps, stateful processing

Stateful MIGs preserve instance names, persistent disks, internal IPs, and per-instance metadata across VM recreation. This makes them suitable for workloads like Cassandra, Elasticsearch, Kafka, ZooKeeper, or legacy monoliths that depend on stable instance identity.

Stateless MIGs treat all VMs as interchangeable. When a VM is recreated, it gets a fresh disk and no preserved state. This is the right choice for web frontends, REST APIs, and any horizontally scalable workload.

Note: Stateful MIGs are a specialized feature. Most workloads should use stateless MIGs. Only use stateful MIGs when your application requires stable instance identity or per-instance disk state. Consider managed services (Cloud SQL, Dataproc, Memorystore) before committing to stateful MIGs for databases or data processing.


Creating MIGs

MIGs require an Instance Template. Create one first, then create the MIG.

Zonal MIG

gcloud compute instance-groups managed create my-mig \
  --template=web-server-template \
  --size=3 \
  --zone=us-central1-a

Regional MIG

gcloud compute instance-groups managed create my-regional-mig \
  --template=web-server-template \
  --size=6 \
  --region=us-central1

MIG with Autoscaling

gcloud compute instance-groups managed create autoscaled-mig \
  --template=web-server-template \
  --size=3 \
  --zone=us-central1-a
 
gcloud compute instance-groups managed set-autoscaling autoscaled-mig \
  --max-num-replicas=10 \
  --min-num-replicas=3 \
  --target-cpu-utilization=0.7 \
  --zone=us-central1-a

MIG with Spot VMs

The template specifies the Spot provisioning model, then the MIG auto-recreates preempted instances:

gcloud compute instance-templates create spot-template \
  --machine-type=e2-medium \
  --provisioning-model=SPOT \
  --image-family=debian-12 \
  --image-project=debian-cloud
 
gcloud compute instance-groups managed create spot-mig \
  --template=spot-template \
  --size=5 \
  --zone=us-central1-a

See Spot VMs for details on Spot provisioning behavior and cost savings.

Tip: Use Custom Images in your template instead of long startup scripts. Pre-baked images boot faster, which means faster scale-out and shorter initialization periods for autoscaling.

Useful Commands

TaskCommand
List instances in a MIGgcloud compute instance-groups managed list-instances MIG_NAME --zone=ZONE
Describe a MIGgcloud compute instance-groups managed describe MIG_NAME --zone=ZONE
Resize a MIGgcloud compute instance-groups managed resize MIG_NAME --size=N --zone=ZONE
Delete a MIGgcloud compute instance-groups managed delete MIG_NAME --zone=ZONE
Wait for update to finishgcloud compute instance-groups managed wait-until MIG_NAME --version-target-reached --zone=ZONE

Best Practices

PracticeWhy
Use regional MIGs for productionMulti-zone distribution protects against zonal failure
Use separate health checks for LB and autohealingLB checks should be aggressive; autohealing checks should be conservative
Set initialDelaySec correctlyPrevent autohealing from recreating VMs that are still booting
Use custom images instead of long startup scriptsFaster scale-out, no dependency on package repos at boot time
Set maxSurge > 0 and maxUnavailable = 0Zero-downtime updates: new VMs are ready before old ones are removed
Use canary updates for risky deploymentsTest on a subset before full rollout
Keep templates immutableCreate new templates for updates; never try to edit existing ones
Monitor status.versionTarget.isReached and status.isStableConfirm the target template is reached and the MIG has no pending actions
Use RECREATE replacement for stateful MIGsRequired to preserve instance names and disk state
Set --min-num-replicas >= 2Avoid a single point of failure in production

TL;DR

  • Instance groups come in two types: managed (MIGs) for identical, auto-managed VMs, and unmanaged for load balancing heterogeneous VMs.
  • MIGs provide autoscaling, autohealing, rolling updates, and regional deployment. Unmanaged groups provide none of these.
  • Regional MIGs spread VMs across multiple zones for zonal failure protection. Default zone count is 3.
  • Autoscaling supports CPU, load balancing, Cloud Monitoring metrics, schedules, predictive, and Pub/Sub signals.
  • Use separate health checks for load balancing (aggressive) and autohealing (conservative). They serve different purposes.
  • Rolling updates use maxSurge and maxUnavailable to control disruption. Canary updates test a new template on a subset of VMs.
  • MIGs require an instance template. Create a new template for each update — templates are immutable.
  • Use custom images in templates for fast scale-out. Avoid long startup scripts that slow boot time.
  • There is no separate charge for instance groups. You pay for the VMs, disks, load balancers, health checks, logging, and other resources the group creates or uses.

Resources

Instance Groups Documentation Official overview of managed and unmanaged instance groups.

Create Managed Instance Groups Step-by-step guides for zonal and regional MIG creation.

Autoscaling Groups of Instances Autoscaling policies, configuration, and behavior.

Rolling Updates in MIGs Automated updates, canary deployments, and update policy options.

Set Up Autohealing Health check configuration for autohealing policies.

Stateful MIGs Preserving per-instance state across recreation and updates.

Instance Templates How to create the templates that MIGs require.

Custom Images Build fast-booting images for MIG scale-out.

High Availability, Live Migration, and Automatic Restart How host maintenance events interact with MIG autohealing.

VM Startup Scripts Automate VM configuration — health checks detect failed startup scripts.

Spot VMs Use Spot VMs in MIGs for cost-effective, self-healing compute.

Google Compute Engine Overview of GCE features and architecture.