GKE supports scaling at multiple levels: individual pods, node pools, and the cluster itself. Understanding when and how to scale each layer is key to balancing performance and cost.

Scaling Levels

flowchart TB
    subgraph "Layer 1: Pod Scaling"
        HPA["Horizontal Pod Autoscaler\n(more/fewer pods)"]
        VPA["Vertical Pod Autoscaler\n(bigger/smaller pods)"]
    end

    subgraph "Layer 2: Node Pool Scaling"
        CA["Cluster Autoscaler\n(more/fewer nodes in a pool)"]
    end

    subgraph "Layer 3: Manual"
        M1["kubectl scale"]
        M2["gcloud resize"]
    end

    HPA -->|"not enough resources"| CA
    VPA -->|"pod needs more CPU/memory"| M1
Scaling TypeScopeTriggerAutomation
ManualPod replicas or node countHuman decisionNo
HPAPod count (horizontal)CPU, memory, custom metricsYes
VPAPod resource requests (vertical)Historical usage patternsYes
Cluster AutoscalerNode count in a poolPending pods / idle nodesYes

Manual Scaling

Scaling Pods

# Scale a deployment to 5 replicas
kubectl scale deployment my-app --replicas=5
 
# Check current scale
kubectl get deployment my-app

Scaling Node Pools

# Resize a node pool
gcloud container clusters resize my-cluster \
  --node-pool=default-pool \
  --zone=us-central1-a \
  --num-nodes=5
 
# Resize an Autopilot cluster (not applicable — nodes auto-scale)

Note: Manual scaling is fine for predictable load patterns. For variable workloads, use autoscaling.

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pod replicas based on observed metrics.

How HPA Works

flowchart LR
    Metrics["Metrics Server\n(CPU, Memory, Custom)"] --> HPA["HPA Controller"]
    HPA -->|"current > target"| ScaleUp["Scale Up\n(+ replicas)"]
    HPA -->|"current < target"| ScaleDown["Scale Down\n(- replicas)"]
    HPA -->|"current ≈ target"| NoChange["No Change"]
    ScaleUp --> Deploy["Deployment"]
    ScaleDown --> Deploy

HPA YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # Scale up when CPU > 70%
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80    # Scale up when memory > 80%

HPA with Custom Metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"       # 100 requests/sec per pod

HPA Commands

# Create HPA imperatively (CPU-based)
kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10
 
# Apply HPA from YAML
kubectl apply -f hpa.yaml
 
# Check HPA status
kubectl get hpa
 
# Detailed HPA info
kubectl describe hpa my-app-hpa

HPA Scaling Formula

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Example: Current = 3 replicas, CPU utilization = 90%, target = 70%

desiredReplicas = ceil[3 × (90 / 70)] = ceil[3.86] = 4 replicas

Key Insight: HPA requires the Metrics Server to be running. GKE clusters have it enabled by default. Verify with kubectl top pods.

HPA Behavior Settings

Control how fast HPA scales up and down:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  minReplicas: 2
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up again
      policies:
        - type: Percent
          value: 100                     # Double replicas at most
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1                       # Remove 1 pod at a time
          periodSeconds: 120

Vertical Pod Autoscaler (VPA)

VPA adjusts pod CPU and memory requests based on historical and current usage. Unlike HPA (which adds/removes pods), VPA makes pods bigger or smaller.

Warning: VPA in auto mode evicts and recreates pods to apply new resource settings. This causes temporary disruption. Use updateMode: "Off" to get recommendations without enforcement.

VPA YAML

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"     # Options: Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "4Gi"

VPA Update Modes

ModeBehaviorUse Case
OffOnly provides recommendations, no changesPlanning and analysis
InitialSets resources on pod creation onlyGradual adoption
RecreateEvicts and recreates pods with new settingsStateless workloads
AutoSame as Recreate (currently)Most workloads

Viewing VPA Recommendations

# Enable VPA in GKE
gcloud container clusters update my-cluster \
  --zone=us-central1-a \
  --enable-vertical-pod-autoscaling
 
# Check VPA recommendations
kubectl describe vpa my-app-vpa

Output includes:

  Recommendation:
    Container Recommendations:
      Container Name:  app
      Lower Bound:
        Cpu:     100m
        Memory:  128Mi
      Target:               # Recommended values
        Cpu:     250m
        Memory:  256Mi
      Uncapped Target:
        Cpu:     230m
        Memory:  245Mi
      Upper Bound:
        Cpu:     1
        Memory:  1Gi

Cluster Autoscaler

Cluster Autoscaler adjusts the number of nodes in a node pool based on pod scheduling needs:

  • Scale up: When pods are pending due to insufficient resources
  • Scale down: When nodes are underutilized for a period
sequenceDiagram
    participant Deploy as Deployment
    participant Sched as Scheduler
    participant CA as Cluster Autoscaler
    participant NP as Node Pool

    Deploy->>Sched: Create 5 new pods
    Sched->>Sched: Not enough node capacity
    Note over Sched: Pods in Pending state
    Sched->>CA: Report pending pods
    CA->>NP: Add 2 nodes
    NP-->>Sched: New nodes ready
    Sched->>Deploy: Schedule pending pods

Enabling Cluster Autoscaler

# Enable on a Standard cluster (per node pool)
gcloud container clusters update my-cluster \
  --zone=us-central1-a \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=1 \
  --max-nodes=10
 
# Or during cluster creation
gcloud container clusters create my-cluster \
  --zone=us-central1-a \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --num-nodes=3

Note: Autopilot clusters have built-in autoscaling. You don’t need to configure Cluster Autoscaler for Autopilot.

Cluster Autoscaler Configuration

ParameterPurposeRecommended
--min-nodesMinimum nodes per zoneAt least 1 for production
--max-nodesMaximum nodes per zoneSet a budget-appropriate ceiling
--total-nodes (Autopilot)Total node limit across all zonesDefault: 70 (soft), can request increase

Scale-Down Behavior

Cluster Autoscaler won’t scale down a node if:

  • A pod on the node has a PodDisruptionBudget that would be violated
  • A pod is not managed by a controller (standalone pod)
  • A pod has a local EmptyDir volume
  • The node has the annotation "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"

HPA vs VPA vs Cluster Autoscaler

AspectHPAVPACluster Autoscaler
What scalesPod countPod size (CPU/memory)Node count
DirectionHorizontal (more/fewer pods)Vertical (bigger/smaller pods)Infrastructure
TriggerCurrent metrics vs targetHistorical usage analysisPending pods / idle nodes
DisruptionNone (add/remove pods)Pod eviction (in Auto mode)Pod eviction (node removal)
Best combined withCluster AutoscalerHPA (not on same metric)HPA

Warning: Do not use HPA and VPA on the same metric (e.g., both on CPU). They can conflict — HPA scales out while VPA scales up, causing instability.

Workload TypeScaling Strategy
Web applicationsHPA (CPU) + Cluster Autoscaler
Batch processingHPA (queue depth) + Spot node pools
Databases (StatefulSets)VPA (right-sizing) + manual node scaling
Memory-heavy appsHPA (memory metric) + Cluster Autoscaler
Unpredictable trafficHPA (CPU) + VPA (Off mode for recommendations)

Useful Commands

CommandPurpose
kubectl top podsView pod CPU/memory usage
kubectl top nodesView node CPU/memory usage
kubectl get hpaList HPA resources
kubectl describe hpa NAMEHPA details and scaling events
kubectl get vpaList VPA resources
kubectl describe vpa NAMEVPA recommendations
kubectl get events --field-selector reason=FailedSchedulingCheck pending pods
gcloud container clusters describe NAME --zone ZONECheck autoscaler config

Common Pitfalls

PitfallConsequenceFix
HPA without resource requestsHPA cannot calculate utilizationAlways set resources.requests in pod specs
Missing Metrics ServerHPA shows <unknown> metricsGKE includes it by default; verify with kubectl top pods
HPA + VPA on same metricConflicting scaling decisionsUse HPA for CPU, VPA for memory (or use VPA in Off mode)
No PodDisruptionBudgetsCluster Autoscaler evicts too many podsDefine PDBs for critical workloads
Tight max-nodes limitPods stay Pending when limit is hitSet max-nodes based on budget and peak demand
VPA auto mode on stateful appsDatabase pods evicted mid-operationUse updateMode: "Off" for stateful workloads
Scale-down too aggressiveNodes removed during temporary dipsIncrease --scale-down-unneeded-time (default 10 min)

TL;DR

  • Manual scalingkubectl scale for pods, gcloud resize for nodes (good for predictable workloads)
  • HPA — Automatic pod count based on CPU/memory/custom metrics (most common autoscaler)
  • VPA — Adjusts pod resource requests based on usage (use Off mode for recommendations first)
  • Cluster Autoscaler — Adds/removes nodes based on pod demand (built into Autopilot)
  • Always set resource requests — HPA and VPA depend on them
  • Don’t run HPA and VPA on the same metric
  • Use PodDisruptionBudgets to protect workloads during scale-down

Resources