GKE Scaling

GKE supports scaling at multiple levels: individual pods, node pools, and the cluster itself. Understanding when and how to scale each layer is key to balancing performance and cost.

Scaling Levels

flowchart TB
    subgraph "Layer 1: Pod Scaling"
        HPA["Horizontal Pod Autoscaler\n(more/fewer pods)"]
        VPA["Vertical Pod Autoscaler\n(bigger/smaller pods)"]
    end

    subgraph "Layer 2: Node Pool Scaling"
        CA["Cluster Autoscaler\n(more/fewer nodes in a pool)"]
    end

    subgraph "Layer 3: Manual"
        M1["kubectl scale"]
        M2["gcloud resize"]
    end

    HPA -->|"not enough resources"| CA
    VPA -->|"pod needs more CPU/memory"| M1

Scaling Type	Scope	Trigger	Automation
Manual	Pod replicas or node count	Human decision	No
HPA	Pod count (horizontal)	CPU, memory, custom metrics	Yes
VPA	Pod resource requests (vertical)	Historical usage patterns	Yes
Cluster Autoscaler	Node count in a pool	Pending pods / idle nodes	Yes

Manual Scaling

Scaling Pods

# Scale a deployment to 5 replicas
kubectl scale deployment my-app --replicas=5
 
# Check current scale
kubectl get deployment my-app

Scaling Node Pools

# Resize a node pool
gcloud container clusters resize my-cluster \
  --node-pool=default-pool \
  --zone=us-central1-a \
  --num-nodes=5
 
# Resize an Autopilot cluster (not applicable — nodes auto-scale)

Note: Manual scaling is fine for predictable load patterns. For variable workloads, use autoscaling.

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pod replicas based on observed metrics.

How HPA Works

flowchart LR
    Metrics["Metrics Server\n(CPU, Memory, Custom)"] --> HPA["HPA Controller"]
    HPA -->|"current > target"| ScaleUp["Scale Up\n(+ replicas)"]
    HPA -->|"current < target"| ScaleDown["Scale Down\n(- replicas)"]
    HPA -->|"current ≈ target"| NoChange["No Change"]
    ScaleUp --> Deploy["Deployment"]
    ScaleDown --> Deploy

HPA YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # Scale up when CPU > 70%
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80    # Scale up when memory > 80%

HPA with Custom Metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"       # 100 requests/sec per pod

HPA Commands

# Create HPA imperatively (CPU-based)
kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10
 
# Apply HPA from YAML
kubectl apply -f hpa.yaml
 
# Check HPA status
kubectl get hpa
 
# Detailed HPA info
kubectl describe hpa my-app-hpa

HPA Scaling Formula

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Example: Current = 3 replicas, CPU utilization = 90%, target = 70%

desiredReplicas = ceil[3 × (90 / 70)] = ceil[3.86] = 4 replicas

Key Insight: HPA requires the Metrics Server to be running. GKE clusters have it enabled by default. Verify with kubectl top pods.

HPA Behavior Settings

Control how fast HPA scales up and down:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  minReplicas: 2
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up again
      policies:
        - type: Percent
          value: 100                     # Double replicas at most
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1                       # Remove 1 pod at a time
          periodSeconds: 120

Vertical Pod Autoscaler (VPA)

VPA adjusts pod CPU and memory requests based on historical and current usage. Unlike HPA (which adds/removes pods), VPA makes pods bigger or smaller.

Warning: VPA in auto mode evicts and recreates pods to apply new resource settings. This causes temporary disruption. Use updateMode: "Off" to get recommendations without enforcement.

VPA YAML

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"     # Options: Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "4Gi"

VPA Update Modes

Mode	Behavior	Use Case
`Off`	Only provides recommendations, no changes	Planning and analysis
`Initial`	Sets resources on pod creation only	Gradual adoption
`Recreate`	Evicts and recreates pods with new settings	Stateless workloads
`Auto`	Same as Recreate (currently)	Most workloads

Viewing VPA Recommendations

# Enable VPA in GKE
gcloud container clusters update my-cluster \
  --zone=us-central1-a \
  --enable-vertical-pod-autoscaling
 
# Check VPA recommendations
kubectl describe vpa my-app-vpa

Output includes:

  Recommendation:
    Container Recommendations:
      Container Name:  app
      Lower Bound:
        Cpu:     100m
        Memory:  128Mi
      Target:               # Recommended values
        Cpu:     250m
        Memory:  256Mi
      Uncapped Target:
        Cpu:     230m
        Memory:  245Mi
      Upper Bound:
        Cpu:     1
        Memory:  1Gi

Cluster Autoscaler

Cluster Autoscaler adjusts the number of nodes in a node pool based on pod scheduling needs:

Scale up: When pods are pending due to insufficient resources
Scale down: When nodes are underutilized for a period

sequenceDiagram
    participant Deploy as Deployment
    participant Sched as Scheduler
    participant CA as Cluster Autoscaler
    participant NP as Node Pool

    Deploy->>Sched: Create 5 new pods
    Sched->>Sched: Not enough node capacity
    Note over Sched: Pods in Pending state
    Sched->>CA: Report pending pods
    CA->>NP: Add 2 nodes
    NP-->>Sched: New nodes ready
    Sched->>Deploy: Schedule pending pods

Enabling Cluster Autoscaler

# Enable on a Standard cluster (per node pool)
gcloud container clusters update my-cluster \
  --zone=us-central1-a \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=1 \
  --max-nodes=10
 
# Or during cluster creation
gcloud container clusters create my-cluster \
  --zone=us-central1-a \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --num-nodes=3

Note: Autopilot clusters have built-in autoscaling. You don’t need to configure Cluster Autoscaler for Autopilot.

Cluster Autoscaler Configuration

Parameter	Purpose	Recommended
`--min-nodes`	Minimum nodes per zone	At least 1 for production
`--max-nodes`	Maximum nodes per zone	Set a budget-appropriate ceiling
`--total-nodes` (Autopilot)	Total node limit across all zones	Default: 70 (soft), can request increase

Scale-Down Behavior

Cluster Autoscaler won’t scale down a node if:

A pod on the node has a PodDisruptionBudget that would be violated
A pod is not managed by a controller (standalone pod)
A pod has a local EmptyDir volume
The node has the annotation "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"

HPA vs VPA vs Cluster Autoscaler

Aspect	HPA	VPA	Cluster Autoscaler
What scales	Pod count	Pod size (CPU/memory)	Node count
Direction	Horizontal (more/fewer pods)	Vertical (bigger/smaller pods)	Infrastructure
Trigger	Current metrics vs target	Historical usage analysis	Pending pods / idle nodes
Disruption	None (add/remove pods)	Pod eviction (in Auto mode)	Pod eviction (node removal)
Best combined with	Cluster Autoscaler	HPA (not on same metric)	HPA

Warning: Do not use HPA and VPA on the same metric (e.g., both on CPU). They can conflict — HPA scales out while VPA scales up, causing instability.

Recommended Combinations

Workload Type	Scaling Strategy
Web applications	HPA (CPU) + Cluster Autoscaler
Batch processing	HPA (queue depth) + Spot node pools
Databases (StatefulSets)	VPA (right-sizing) + manual node scaling
Memory-heavy apps	HPA (memory metric) + Cluster Autoscaler
Unpredictable traffic	HPA (CPU) + VPA (Off mode for recommendations)

Useful Commands

Command	Purpose
`kubectl top pods`	View pod CPU/memory usage
`kubectl top nodes`	View node CPU/memory usage
`kubectl get hpa`	List HPA resources
`kubectl describe hpa NAME`	HPA details and scaling events
`kubectl get vpa`	List VPA resources
`kubectl describe vpa NAME`	VPA recommendations
`kubectl get events --field-selector reason=FailedScheduling`	Check pending pods
`gcloud container clusters describe NAME --zone ZONE`	Check autoscaler config

Common Pitfalls

Pitfall	Consequence	Fix
HPA without resource requests	HPA cannot calculate utilization	Always set `resources.requests` in pod specs
Missing Metrics Server	HPA shows `<unknown>` metrics	GKE includes it by default; verify with `kubectl top pods`
HPA + VPA on same metric	Conflicting scaling decisions	Use HPA for CPU, VPA for memory (or use VPA in Off mode)
No PodDisruptionBudgets	Cluster Autoscaler evicts too many pods	Define PDBs for critical workloads
Tight max-nodes limit	Pods stay Pending when limit is hit	Set `max-nodes` based on budget and peak demand
VPA auto mode on stateful apps	Database pods evicted mid-operation	Use `updateMode: "Off"` for stateful workloads
Scale-down too aggressive	Nodes removed during temporary dips	Increase `--scale-down-unneeded-time` (default 10 min)

TL;DR

Manual scaling — kubectl scale for pods, gcloud resize for nodes (good for predictable workloads)
HPA — Automatic pod count based on CPU/memory/custom metrics (most common autoscaler)
VPA — Adjusts pod resource requests based on usage (use Off mode for recommendations first)
Cluster Autoscaler — Adds/removes nodes based on pod demand (built into Autopilot)
Always set resource requests — HPA and VPA depend on them
Don’t run HPA and VPA on the same metric
Use PodDisruptionBudgets to protect workloads during scale-down

Lalit's Cloud & DevOps notes