Instance Groups

How to use managed and unmanaged instance groups on Google Compute Engine for scalable, self-healing VM deployments.

What Are Instance Groups?

An instance group is a collection of VMs that you manage as a single entity. Google Cloud offers two types:

Managed Instance Groups (MIGs) — Identical VMs created from an Instance Template, with built-in autoscaling, autohealing, and automated updates.
Unmanaged Instance Groups — A loose collection of heterogeneous VMs, used primarily as a load balancer backend when VMs have different configurations.

Aspect	Managed Instance Group	Unmanaged Instance Group
VMs	Identical (from template)	Heterogeneous
Autoscaling	Yes	No
Autohealing	Yes	No
Rolling updates	Yes	No
Multi-zone	Yes (regional MIGs)	No
Use case	Scalable production workloads	Load balancing existing VMs

In practice: MIGs are the standard way to run production VM fleets on GCE. You declare the desired state (template, target size), and the MIG keeps the actual state converged automatically.

Managed Instance Groups (MIGs)

A MIG creates and maintains a group of identical VMs from a single instance template. You set the target size, and the MIG handles creation, monitoring, healing, and scaling.

Core capabilities:

Autoscaling — Dynamically adds or removes VMs based on load (CPU, LB capacity, custom metrics, schedules, Pub/Sub queue depth)
Autohealing — Recreates VMs that fail health checks, including application-level checks (crashes, freezes, OOM)
Regional deployment — Distributes VMs across multiple zones to survive zonal failures
Automated updates — Rolling updates and canary deployments with controlled disruption
Stateful support — Optional per-instance state preservation (disks, IPs, metadata)

flowchart TD
    T["Instance Template"] --> MIG["Managed Instance Group<br/>(target size: N)"]
    MIG --> VM1["VM 1"]
    MIG --> VM2["VM 2"]
    MIG --> VMN["VM N"]
    LB["Load Balancer"] --> MIG
    HC["Health Check"] -->|signals unhealthy| MIG
    MIG -->|recreates| VMN
    AS["Autoscaler"] -->|scale out/in| MIG

Key Insight: MIGs are intent-based. You declare the desired state (which template, how many VMs), and the MIG continuously converges to that state. If a VM crashes, is preempted, or fails a health check, the MIG replaces it automatically.

Unmanaged Instance Groups

An unmanaged instance group is a collection of VMs with different configurations that you can use as a load balancer backend. You add and remove VMs manually.

What they do: Let you load balance across a fleet of individually managed, non-identical VMs.

What they do not provide: Autoscaling, autohealing, rolling updates, multi-zone deployment, instance templates, or any automated management. Maximum 2,000 VMs per group.

Scenario	Why Unmanaged
Existing heterogeneous VMs	VMs with different configs that need a single LB backend
Migration phase	Temporarily grouping VMs while migrating to MIGs
One-off load balancing	Simple case where MIG overhead is unnecessary

Warning: Do not use unmanaged instance groups for new production workloads. They lack autoscaling, autohealing, and automated updates. Use MIGs instead. Unmanaged groups exist primarily for load balancing legacy or heterogeneous VM fleets.

MIG vs Unmanaged Comparison

Feature	Managed (MIG)	Unmanaged
VM homogeneity	Identical (template-based)	Heterogeneous
Autoscaling	Yes (CPU, LB, metrics, schedule, Pub/Sub)	No
Autohealing	Yes (health-check driven recreation)	No
Rolling updates	Yes (with canary support)	No
Regional (multi-zone)	Yes	No
Instance templates	Required	Not used
Stateful support	Yes	No
Max VMs	1,000 zonal / 2,000 regional (expandable to 4,000)	2,000
Load balancing	Backend service or target pool	Backend service or target pool
Pricing	No separate instance group charge	No separate instance group charge

Zonal vs Regional MIGs

Property	Zonal MIG	Regional MIG
Zones	Single zone	Multiple zones (default 3)
Max VMs	1,000 (expandable)	2,000 (expandable)
Zonal failure tolerance	None	Yes (traffic shifts to remaining zones)
Creation	`--zone=ZONE`	`--region=REGION`
Default maxSurge	1	Number of zones (default 3)
Default maxUnavailable	1	Number of zones (default 3)
Pub/Sub autoscaling	Yes	Yes

A zonal MIG is simpler but vulnerable to a single-zone outage. A regional MIG spreads instances across multiple zones within a region and can redistribute them after a zone recovers.

Tip: Use regional MIGs for production workloads. There is no separate charge for choosing a regional MIG, but you still pay for the VMs, disks, load balancers, and other resources the group uses.

Autoscaling

MIG autoscaling adds or removes VMs based on load signals. You can combine multiple signals in a single policy; the autoscaler uses the largest recommended size across all of them.

Policy	Signal	Best For
CPU utilization	Average CPU across group	General web serving, API backends
Load balancing capacity	HTTP load per instance	HTTP/S traffic behind a load balancer
Cloud Monitoring metric	Any custom or built-in metric	Application-specific signals (queue depth, latency)
Schedule-based	Time of day / day of week	Predictable traffic patterns
Predictive	ML-based forecast	Workloads with historical patterns and slow initialization
Pub/Sub queue	Unacknowledged messages in a subscription	Async processing, event-driven workloads

CPU-Based Autoscaling

gcloud compute instance-groups managed set-autoscaling my-mig \
  --max-num-replicas=10 \
  --min-num-replicas=2 \
  --target-cpu-utilization=0.7 \
  --zone=us-central1-a

Pub/Sub-Based Autoscaling

gcloud compute instance-groups managed set-autoscaling pubsub-mig \
  --max-num-replicas=20 \
  --min-num-replicas=1 \
  --update-stackdriver-metric=pubsub.googleapis.com/subscription/num_undelivered_messages \
  --stackdriver-metric-filter='resource.type="pubsub_subscription" AND resource.labels.subscription_id="my-sub"' \
  --stackdriver-metric-single-instance-assignment=100 \
  --zone=us-central1-a

Use --region=REGION instead of --zone=ZONE when configuring autoscaling for a regional MIG.

Note: Autoscaling is configured on the MIG, not the instance template. You set it after creating the MIG.

Scale-in controls let you limit how fast the group can shrink (e.g., “remove at most 3 VMs per 300 seconds”). Use these for workloads with long initialization times to prevent sudden capacity drops.

Initialization period (formerly cool down) tells the autoscaler how long to ignore usage data from a newly created VM while it boots and initializes. Set this to match your application’s startup time.

Tip: Set --min-num-replicas to at least 2 for production workloads. A single instance is a single point of failure.

Autohealing and Health Checks

Autohealing automatically recreates VMs that fail health checks. This catches application-level failures (crashes, freezes, out-of-memory) that a VM-level restart would miss.

LB Health Checks vs Autohealing Health Checks

Aspect	LB Health Check	Autohealing Health Check
Purpose	Stop sending traffic to unhealthy instances	Delete and recreate unhealthy instances
Aggressiveness	Should be aggressive (quick detection)	Should be conservative (avoid unnecessary recreation)
Impact	Traffic shifts; instance keeps running	Instance is deleted and recreated
Recommended check interval	5–10 seconds	30–60 seconds
Recommended unhealthy threshold	2–3 consecutive failures	5–10 consecutive failures

Key Insight: Use separate health checks for load balancing and autohealing. LB checks should be aggressive — catch a struggling instance quickly and stop sending traffic. Autohealing checks should be conservative — recreating a VM is disruptive, so you want to be sure it’s actually broken, not just temporarily slow.

Configuring Autohealing

# Create a health check for autohealing (conservative settings)
gcloud compute health-checks create http autohealing-check \
  --port=80 \
  --check-interval=30 \
  --timeout=10 \
  --unhealthy-threshold=5 \
  --healthy-threshold=2
 
# Attach to the MIG
gcloud compute instance-groups managed update my-mig \
  --health-check=autohealing-check \
  --initial-delay=120 \
  --zone=us-central1-a

--initial-delay sets the grace period after a VM starts before health checking begins. Set this long enough for your Startup Scripts to finish and the application to initialize. If the health check fires too early, autohealing will recreate VMs that are still booting.

For Spot VMs in a MIG, autohealing automatically recreates instances that get preempted. See Spot VMs for cost-effective compute with self-healing.

Rolling Updates and Canary Deployments

The MIG Updater lets you deploy new configurations across your instances with controlled disruption.

During a rolling update, the MIG compares the current VM configuration with the target template, creates or recreates VMs in batches, waits for each new VM to become ready, and then continues until the group reaches the target version. The disruption is controlled by two budgets:

maxSurge controls how many extra VMs can be created above the target size.
maxUnavailable controls how many existing VMs can be offline at the same time.

For zero-downtime stateless updates, use maxUnavailable=0 and maxSurge>0 so replacement VMs become ready before old VMs are removed. This requires enough quota for the temporary extra VMs. If you must preserve instance names, use replacementMethod=RECREATE; that mode requires maxSurge=0, so it is slower and more disruptive.

flowchart TD
    START["Start Rolling Update<br/>Target: replace all VMs"] --> CHECK{Enough quota<br/>for maxSurge?}
    CHECK -->|Yes| SURGE["Create replacement VMs<br/>(up to maxSurge)"]
    CHECK -->|No| FAIL["Update waits<br/>for quota"]
    SURGE --> READY["Wait for healthy<br/>(minReadySec + health check)"]
    READY --> DELETE["Delete old VMs<br/>(maxUnavailable budget)"]
    DELETE --> REMAIN{More VMs<br/>to update?}
    REMAIN -->|Yes| SURGE
    REMAIN -->|No| DONE["Update complete<br/>version target reached<br/>status.isStable=true"]

Update Parameters

Parameter	Default (Zonal)	Default (Regional)	Purpose
`maxSurge`	1	Number of zones (3)	Extra VMs created during update
`maxUnavailable`	1	Number of zones (3)	VMs allowed offline at any time
`minReadySec`	0	0	Wait time before considering a VM ready
`replacementMethod`	`SUBSTITUTE`	`SUBSTITUTE`	`SUBSTITUTE` creates replacement VMs with new names; `RECREATE` preserves names but requires `maxSurge=0`

Update types:

Proactive — The MIG automatically rolls out the update to all instances
Opportunistic — Updates applied only when instances are recreated for other reasons (resize, repair)

To confirm completion, check both status.versionTarget.isReached and status.isStable. The version target can be reached while the group is still finishing repairs, verifications, or other actions.

Rolling Update

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v2 \
  --max-surge=3 \
  --max-unavailable=1 \
  --min-ready=2m \
  --zone=us-central1-a

Canary Update (10% of VMs)

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v1 \
  --canary-version=template=web-server-v2,target-size=10% \
  --zone=us-central1-a

A MIG supports up to two instance template versions simultaneously. After verifying the canary, roll forward:

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v2 \
  --zone=us-central1-a

Rollback

gcloud compute instance-groups managed rolling-action start-update my-mig \
  --version=template=web-server-v1 \
  --max-unavailable=100% \
  --zone=us-central1-a

See Instance Templates for the full template creation and update workflow.

Stateful vs Stateless MIGs

Aspect	Stateless MIG	Stateful MIG
VM identity	Disposable; names can change	Preserved across recreation
Persistent disks	Ephemeral or recreated from template	Attached to specific instance, preserved
Metadata	Same across all VMs	Per-instance metadata preserved
Autoscaling	Supported	Not supported
Autohealing	Supported	Supported
Update method	`SUBSTITUTE` (default)	`RECREATE` (required)
Use case	Web servers, API backends, workers	Databases, legacy apps, stateful processing

Stateful MIGs preserve instance names, persistent disks, internal IPs, and per-instance metadata across VM recreation. This makes them suitable for workloads like Cassandra, Elasticsearch, Kafka, ZooKeeper, or legacy monoliths that depend on stable instance identity.

Stateless MIGs treat all VMs as interchangeable. When a VM is recreated, it gets a fresh disk and no preserved state. This is the right choice for web frontends, REST APIs, and any horizontally scalable workload.

Note: Stateful MIGs are a specialized feature. Most workloads should use stateless MIGs. Only use stateful MIGs when your application requires stable instance identity or per-instance disk state. Consider managed services (Cloud SQL, Dataproc, Memorystore) before committing to stateful MIGs for databases or data processing.

Creating MIGs

MIGs require an Instance Template. Create one first, then create the MIG.

Zonal MIG

gcloud compute instance-groups managed create my-mig \
  --template=web-server-template \
  --size=3 \
  --zone=us-central1-a

Regional MIG

gcloud compute instance-groups managed create my-regional-mig \
  --template=web-server-template \
  --size=6 \
  --region=us-central1

MIG with Autoscaling

gcloud compute instance-groups managed create autoscaled-mig \
  --template=web-server-template \
  --size=3 \
  --zone=us-central1-a
 
gcloud compute instance-groups managed set-autoscaling autoscaled-mig \
  --max-num-replicas=10 \
  --min-num-replicas=3 \
  --target-cpu-utilization=0.7 \
  --zone=us-central1-a

MIG with Spot VMs

The template specifies the Spot provisioning model, then the MIG auto-recreates preempted instances:

gcloud compute instance-templates create spot-template \
  --machine-type=e2-medium \
  --provisioning-model=SPOT \
  --image-family=debian-12 \
  --image-project=debian-cloud
 
gcloud compute instance-groups managed create spot-mig \
  --template=spot-template \
  --size=5 \
  --zone=us-central1-a

See Spot VMs for details on Spot provisioning behavior and cost savings.

Tip: Use Custom Images in your template instead of long startup scripts. Pre-baked images boot faster, which means faster scale-out and shorter initialization periods for autoscaling.

Useful Commands

Task	Command
List instances in a MIG	`gcloud compute instance-groups managed list-instances MIG_NAME --zone=ZONE`
Describe a MIG	`gcloud compute instance-groups managed describe MIG_NAME --zone=ZONE`
Resize a MIG	`gcloud compute instance-groups managed resize MIG_NAME --size=N --zone=ZONE`
Delete a MIG	`gcloud compute instance-groups managed delete MIG_NAME --zone=ZONE`
Wait for update to finish	`gcloud compute instance-groups managed wait-until MIG_NAME --version-target-reached --zone=ZONE`

Best Practices

Practice	Why
Use regional MIGs for production	Multi-zone distribution protects against zonal failure
Use separate health checks for LB and autohealing	LB checks should be aggressive; autohealing checks should be conservative
Set `initialDelaySec` correctly	Prevent autohealing from recreating VMs that are still booting
Use custom images instead of long startup scripts	Faster scale-out, no dependency on package repos at boot time
Set `maxSurge > 0` and `maxUnavailable = 0`	Zero-downtime updates: new VMs are ready before old ones are removed
Use canary updates for risky deployments	Test on a subset before full rollout
Keep templates immutable	Create new templates for updates; never try to edit existing ones
Monitor `status.versionTarget.isReached` and `status.isStable`	Confirm the target template is reached and the MIG has no pending actions
Use `RECREATE` replacement for stateful MIGs	Required to preserve instance names and disk state
Set `--min-num-replicas >= 2`	Avoid a single point of failure in production

TL;DR

Instance groups come in two types: managed (MIGs) for identical, auto-managed VMs, and unmanaged for load balancing heterogeneous VMs.
MIGs provide autoscaling, autohealing, rolling updates, and regional deployment. Unmanaged groups provide none of these.
Regional MIGs spread VMs across multiple zones for zonal failure protection. Default zone count is 3.
Autoscaling supports CPU, load balancing, Cloud Monitoring metrics, schedules, predictive, and Pub/Sub signals.
Use separate health checks for load balancing (aggressive) and autohealing (conservative). They serve different purposes.
Rolling updates use maxSurge and maxUnavailable to control disruption. Canary updates test a new template on a subset of VMs.
MIGs require an instance template. Create a new template for each update — templates are immutable.
Use custom images in templates for fast scale-out. Avoid long startup scripts that slow boot time.
There is no separate charge for instance groups. You pay for the VMs, disks, load balancers, health checks, logging, and other resources the group creates or uses.

Resources

Instance Groups Documentation Official overview of managed and unmanaged instance groups.

Create Managed Instance Groups Step-by-step guides for zonal and regional MIG creation.

Autoscaling Groups of Instances Autoscaling policies, configuration, and behavior.

Rolling Updates in MIGs Automated updates, canary deployments, and update policy options.

Set Up Autohealing Health check configuration for autohealing policies.

Stateful MIGs Preserving per-instance state across recreation and updates.

Instance Templates How to create the templates that MIGs require.

Custom Images Build fast-booting images for MIG scale-out.

High Availability, Live Migration, and Automatic Restart How host maintenance events interact with MIG autohealing.

VM Startup Scripts Automate VM configuration — health checks detect failed startup scripts.

Spot VMs Use Spot VMs in MIGs for cost-effective, self-healing compute.

Google Compute Engine Overview of GCE features and architecture.

Lalit's Cloud & DevOps notes