Nodes are the Compute Engine VMs that run your Kubernetes workloads. In GKE, nodes are organized into node pools — groups of nodes with identical configuration. Understanding node pools is essential for optimizing cost, performance, and reliability.

Node Architecture

flowchart TB
    subgraph Node["GKE Node (Compute Engine VM)"]
        KUBELET["kubelet"]
        KUBEPROXY["kube-proxy"]
        CRI["Container Runtime (containerd)"]
        OS["Node OS (COS / Ubuntu)"]

        subgraph Pods["Running Pods"]
            P1["Pod 1"]
            P2["Pod 2"]
            P3["Pod 3"]
        end
    end

    CP["Control Plane"] --> KUBELET
    KUBELET --> CRI
    CRI --> Pods
    KUBEPROXY --> Pods

Each node runs several critical components:

ComponentRole
kubeletAgent that manages pod lifecycle on the node, reports status to control plane
kube-proxyMaintains network rules for Service routing on the node
Container runtimeRuns containers (containerd in GKE)
Node OSContainer-Optimized OS (COS) or Ubuntu — managed by GKE auto-upgrade

Node Pools

A node pool is a group of nodes within a cluster that share the same configuration:

  • Machine type (CPU, memory)
  • OS image
  • Labels and taints
  • Disk type and size
  • GPU attachments
  • Network tags
flowchart TB
    subgraph Cluster["GKE Cluster"]
        subgraph NP1["Default Pool (e2-medium)"]
            N1["Node 1"]
            N2["Node 2"]
            N3["Node 3"]
        end

        subgraph NP2["GPU Pool (n1-standard-4 + T4)"]
            N4["Node 4"]
            N5["Node 5"]
        end

        subgraph NP3["Spot Pool (e2-medium, Spot)"]
            N6["Node 6"]
            N7["Node 7"]
        end
    end

Key Insight: Standard clusters create a first node pool, commonly called the default node pool, unless you explicitly remove or skip it. You can add more pools for different workload types. In Autopilot, Google manages nodes for you — you define workload resource requests instead of managing node pools.

Default Node Pool vs Additional Pools

AspectDefault PoolAdditional Pools
Created with clusterYesNo — added separately
Can be deletedYes (but cluster needs at least one pool)Yes
ConfigurationSet during cluster creationIndependent configuration
Workload targetingGeneric workloadsUse taints/tolerations or node selectors

Managing Node Pools

Creating Node Pools

# Add a node pool to an existing Standard cluster
gcloud container node-pools create gpu-pool \
  --cluster=my-cluster \
  --zone=us-central1-a \
  --machine-type=n1-standard-4 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --num-nodes=2 \
  --spot \
  --enable-autoupgrade \
  --enable-autorepair \
  --node-labels=workload=gpu,gpu-type=t4 \
  --node-taints=nvidia.com/gpu=present:NoSchedule

Key Node Pool Flags

FlagPurposeExample
--machine-typeVM type for nodese2-medium, n1-standard-4
--num-nodesInitial node count per zone3
--disk-typeBoot disk typepd-ssd, pd-balanced, pd-standard
--disk-sizeBoot disk size100GB
--image-typeNode OS imageCOS_CONTAINERD, UBUNTU_CONTAINERD
--spotUse Spot VMs (cheaper, evictible)
--preemptibleUse Preemptible VMs (24hr max)
--acceleratorAttach GPUstype=nvidia-tesla-t4,count=1
--node-labelsLabels for schedulingworkload=gpu
--node-taintsTaints to repel non-matching podsnvidia.com/gpu=present:NoSchedule
--enable-autoupgradeAuto-upgrade node OS and K8s
--enable-autorepairAuto-replace unhealthy nodes
--max-pods-per-nodeLimit pods per node110 (default)
--tagsNetwork tags for firewall rulesbackend,ssh-allowed

Listing and Inspecting Node Pools

# List all node pools in a cluster
gcloud container node-pools list --cluster=my-cluster --zone=us-central1-a
 
# Describe a specific node pool
gcloud container node-pools describe gpu-pool --cluster=my-cluster --zone=us-central1-a
 
# View nodes with their pool membership
kubectl get nodes -o wide
 
# View nodes in a specific pool
kubectl get nodes -l cloud.google.com/gke-nodepool=gpu-pool

Resizing and Deleting Node Pools

# Resize a node pool
gcloud container clusters resize my-cluster \
  --node-pool=gpu-pool \
  --zone=us-central1-a \
  --num-nodes=5
 
# Delete a node pool (pods will be evicted)
gcloud container node-pools delete gpu-pool \
  --cluster=my-cluster \
  --zone=us-central1-a

Warning: Deleting a node pool evicts all pods running on those nodes. Ensure pods can be rescheduled elsewhere before deleting a pool.

Machine Type Selection

Common machine types for GKE nodes:

Machine TypevCPUsMemoryUse Case
e2-medium24 GBDevelopment, small workloads
e2-standard-4416 GBGeneral-purpose production
e2-standard-8832 GBMedium production workloads
e2-highmem-4432 GBMemory-intensive applications
e2-highcpu-888 GBCPU-intensive batch processing
n1-standard-4415 GBGPU-attached workloads
n2d-standard-161664 GBAMD-based compute workloads
t2a-standard-4416 GBArm-based workloads (cost-effective)

Tip: Use E2 machine types for most general-purpose workloads. For accelerator workloads, choose a machine series and zone that support the GPU or TPU you need. Use T2A (Arm) for further cost savings if your containers support Arm.

Scheduling Workloads to Specific Node Pools

Node Selectors

The simplest way to target pods to specific nodes:

spec:
  nodeSelector:
    workload: gpu        # matches --node-labels=workload=gpu

Taints and Tolerations

Taints repel pods that don’t tolerate them. This is how you reserve a GPU pool for GPU workloads:

# Node pool created with: --node-taints=nvidia.com/gpu=present:NoSchedule
 
# Pod that tolerates the taint (can be scheduled on GPU nodes)
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Equal
      value: "present"
      effect: NoSchedule

Taint Effects

EffectBehavior
NoSchedulePod will not be scheduled unless it has a matching toleration
PreferNoScheduleScheduler tries to avoid the node, but will use it if needed
NoExecutePod is evicted if it doesn’t have a matching toleration

Key Insight: Use taints + tolerations for exclusive pools (GPU, Spot). Use node selectors for preferences. Use both together for strict targeting.

Node Affinity (Advanced)

For more complex scheduling rules:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: cloud.google.com/gke-nodepool
                operator: In
                values:
                  - gpu-pool
                  - high-mem-pool

Spot and Preemptible VMs

Reduce costs by using short-lived VMs for fault-tolerant workloads:

FeatureSpot VMsPreemptible VMs
Max lifetimeNo fixed limit (until reclaimed)24 hours max
Reclaim notice30-second warning via metadata30-second warning via metadata
AvailabilitySubject to capacitySubject to capacity
Pricing discountUp to 91% off on-demandUp to 91% off on-demand
RecommendedYes (newer, more flexible)No (legacy, use Spot instead)
# Create a Spot node pool
gcloud container node-pools create spot-pool \
  --cluster=my-cluster \
  --zone=us-central1-a \
  --machine-type=e2-medium \
  --num-nodes=3 \
  --spot \
  --node-labels=cloud.google.com/gke-spot=true

Warning: Spot nodes can be reclaimed at any time. Only run fault-tolerant, interruptible workloads on them (batch jobs, CI/CD, stateless workers). Always pair with a PodDisruptionBudget.

Node Best Practices

PracticeWhyHow
Use E2 machine typesBest price-performance for most workloads--machine-type=e2-standard-4
Separate workloads by poolDifferent hardware needs, cost optimizationCreate dedicated pools with taints
Use Spot pools for batch workloadsUp to 91% cost savings--spot flag + fault-tolerant pods
Enable auto-upgradeSecurity patches without manual intervention--enable-autoupgrade
Enable auto-repairUnhealthy nodes replaced automatically--enable-autorepair
Set resource requests/limitsEnsures fair scheduling and prevents noisy neighborsDefine in pod specs
Use PodDisruptionBudgetsProtect critical workloads during node drainsDefine minAvailable or maxUnavailable
Monitor node resource usageDetect under/over-provisioningkubectl top nodes + Cloud Monitoring
Use Shielded GKE nodesSecure boot and integrity monitoring--enable-shielded-nodes (default)
Use containerd OS imagesRequired for modern GKE featuresCOS_CONTAINERD or UBUNTU_CONTAINERD

Common Pitfalls

PitfallConsequenceFix
All workloads on default poolCannot scale or cost-optimize independentlyCreate separate pools per workload type
No taints on GPU poolNon-GPU pods scheduled on expensive GPU nodesUse taints + tolerations
Over-provisioned nodesPaying for idle resourcesRight-size based on kubectl top nodes
No Spot fallbackBatch jobs fail when Spot VMs are reclaimedDesign for interruption + use PDBs
Ignoring auto-repairUnhealthy nodes stay in cluster indefinitelyEnable auto-repair on all pools
Using Preemptible instead of SpotFixed 24-hour termination, less flexibleUse Spot VMs (--spot) for new pools

TL;DR

  • Nodes are Compute Engine VMs; node pools group nodes with identical configuration
  • Use multiple node pools to separate workloads by hardware need (GPU, Spot, high-memory)
  • Target workloads to pools with taints + tolerations (exclusive) or node selectors (preference)
  • Use Spot VMs for fault-tolerant workloads to save up to 91%
  • Always enable auto-upgrade and auto-repair on node pools
  • In Autopilot, Google manages nodes — you define pod resource requests instead of node pools

Resources