GKE Support Overview

CloudPilot AI includes a GKE integration path that covers the lifecycle from Phase 1 registration to Phase 2 optimization workflows. The current implementation spans the CloudPilot API server, the cluster agent, the console flows, and the GKE-specific onboarding scripts.

What is currently covered

The current GKE implementation includes the following building blocks:

Phase 1 cluster registration through the standard Add Cluster flow in the CloudPilot AI console.
Phase 2 optimization installation through the GKE-specific Start Saving script.
Restore / upgrade / uninstall lifecycle scripts for existing GKE clusters.
NodePool schedule handling and rebalance configuration upload through the GCP runtime paths in cloudpilot-agent.
Console support for GCP NodePool / NodeClass management, cluster removal, and lifecycle entrypoints.
Workload Autoscaler support on GKE, with an important caveat for GKE-managed kube-state-metrics; see Workload Autoscaler Installation Troubleshooting.

Key GKE concepts

Variable	Meaning	Example
`GCP_PROJECT_ID`	The Google Cloud project that owns the cluster and CloudPilot-managed GCP resources.	`my-prod-project`
`CLUSTER_NAME`	The GKE cluster name.	`prod-gke`
`CLUSTER_REGION`	The broad region used for pricing, discovery, and script wiring.	`us-central1`
`CLUSTER_LOCATION`	The exact GKE location. Use a zone for zonal clusters and a region for regional clusters.	`us-central1-a` or `us-central1`
`CLUSTER_ID`	The CloudPilot AI cluster ID assigned after Phase 1 registration.	`gcp-xxxx`

For GKE, CLUSTER_LOCATION is not interchangeable with CLUSTER_REGION. Use us-central1-a for a zonal cluster and us-central1 for a regional cluster. When the cluster name is ambiguous across multiple locations, export CLUSTER_LOCATION explicitly before running lifecycle scripts.

How the GKE lifecycle fits together

Phase 1: Add Cluster

The console-generated GKE Add Cluster script installs the standard Phase 1 agent manifest and sets CLOUD_PROVIDER=gcp. The GKE registration path relies on cluster metadata to discover the cluster name and related context instead of requiring you to inject CLUSTER_NAME manually into the Phase 1 script.

Phase 2: Start Saving

The GKE Start Saving flow runs the GKE-specific Phase 2 install script. This script validates:

kubectl, helm, jq, curl, and gcloud
the current Kubernetes context
the exact cluster location
GKE Workload Identity

It then installs the CloudPilot base components and the GCP optimizer stack, while:

creating or reusing the controller GSA
binding cloudpilot/cloudpilot-admin to that GSA through Workload Identity
reconciling the least-privilege custom IAM role used by the controller

Day-2 operations

Once the cluster is managed by CloudPilot AI, the console uses GKE-specific scripts for:

restoring original capacity
upgrading CloudPilot AI components
uninstalling CloudPilot AI safely

See GKE Getting Started and GKE Day-2 Operations for the operational steps.

GKE-specific operational notes

Workload Identity is required for Phase 2. The default flow uses the cloudpilot-admin Kubernetes service account plus a bound Google service account instead of a downloaded JSON key.
The exact location matters everywhere. Restore, upgrade, and uninstall all rely on the exact CLUSTER_LOCATION value.
Private nodes still need outbound access. If the source node pool resolves to private nodes, keep Cloud NAT or equivalent outbound image-pull access available.
CloudPilot uninstall is scoped. The GKE uninstall flow targets CloudPilot-managed NodeClaims, NodePools, and GCENodeClasses instead of deleting all Karpenter resources in the cluster.
GKE-managed kube-state-metrics can block Workload Autoscaler recommendations. If WA stays in Optimization: Recommending, review the GKE troubleshooting guidance linked above.