Skip to Content
GuideGKEGKE Support Overview

GKE Support Overview

CloudPilot AI includes a GKE integration path that covers the lifecycle from Phase 1 registration to Phase 2 optimization workflows. The current implementation spans the CloudPilot API server, the cluster agent, the console flows, and the GKE-specific onboarding scripts.

What is currently covered

The current GKE implementation includes the following building blocks:

  • Phase 1 cluster registration through the standard Add Cluster flow in the CloudPilot AI console.
  • Phase 2 optimization installation through the GKE-specific Start Saving script.
  • Restore / upgrade / uninstall lifecycle scripts for existing GKE clusters.
  • NodePool schedule handling and rebalance configuration upload through the GCP runtime paths in cloudpilot-agent.
  • Console support for GCP NodePool / NodeClass management, cluster removal, and lifecycle entrypoints.
  • Workload Autoscaler support on GKE, with an important caveat for GKE-managed kube-state-metrics; see Workload Autoscaler Installation Troubleshooting.

Key GKE concepts

VariableMeaningExample
GCP_PROJECT_IDThe Google Cloud project that owns the cluster and CloudPilot-managed GCP resources.my-prod-project
CLUSTER_NAMEThe GKE cluster name.prod-gke
CLUSTER_REGIONThe broad region used for pricing, discovery, and script wiring.us-central1
CLUSTER_LOCATIONThe exact GKE location. Use a zone for zonal clusters and a region for regional clusters.us-central1-a or us-central1
CLUSTER_IDThe CloudPilot AI cluster ID assigned after Phase 1 registration.gcp-xxxx

For GKE, CLUSTER_LOCATION is not interchangeable with CLUSTER_REGION. Use us-central1-a for a zonal cluster and us-central1 for a regional cluster. When the cluster name is ambiguous across multiple locations, export CLUSTER_LOCATION explicitly before running lifecycle scripts.

How the GKE lifecycle fits together

Phase 1: Add Cluster

The console-generated GKE Add Cluster script installs the standard Phase 1 agent manifest and sets CLOUD_PROVIDER=gcp. The GKE registration path relies on cluster metadata to discover the cluster name and related context instead of requiring you to inject CLUSTER_NAME manually into the Phase 1 script.

Phase 2: Start Saving

The GKE Start Saving flow runs the GKE-specific Phase 2 install script. This script validates:

  • kubectl, helm, jq, curl, and gcloud
  • the current Kubernetes context
  • the exact cluster location
  • GKE Workload Identity

It then installs the CloudPilot base components and the GCP optimizer stack, while:

  • creating or reusing the controller GSA
  • binding cloudpilot/cloudpilot-admin to that GSA through Workload Identity
  • reconciling the least-privilege custom IAM role used by the controller

Day-2 operations

Once the cluster is managed by CloudPilot AI, the console uses GKE-specific scripts for:

  • restoring original capacity
  • upgrading CloudPilot AI components
  • uninstalling CloudPilot AI safely

See GKE Getting Started and GKE Day-2 Operations for the operational steps.

GKE-specific operational notes

  • Workload Identity is required for Phase 2. The default flow uses the cloudpilot-admin Kubernetes service account plus a bound Google service account instead of a downloaded JSON key.
  • The exact location matters everywhere. Restore, upgrade, and uninstall all rely on the exact CLUSTER_LOCATION value.
  • Private nodes still need outbound access. If the source node pool resolves to private nodes, keep Cloud NAT or equivalent outbound image-pull access available.
  • CloudPilot uninstall is scoped. The GKE uninstall flow targets CloudPilot-managed NodeClaims, NodePools, and GCENodeClasses instead of deleting all Karpenter resources in the cluster.
  • GKE-managed kube-state-metrics can block Workload Autoscaler recommendations. If WA stays in Optimization: Recommending, review the GKE troubleshooting guidance linked above.
Last updated on