GKE Support Overview
CloudPilot AI includes a GKE integration path that covers the lifecycle from Phase 1 registration to Phase 2 optimization workflows. The current implementation spans the CloudPilot API server, the cluster agent, the console flows, and the GKE-specific onboarding scripts.
What is currently covered
The current GKE implementation includes the following building blocks:
- Phase 1 cluster registration through the standard
Add Clusterflow in the CloudPilot AI console. - Phase 2 optimization installation through the GKE-specific
Start Savingscript. - Restore / upgrade / uninstall lifecycle scripts for existing GKE clusters.
- NodePool schedule handling and rebalance configuration upload through the GCP runtime paths in
cloudpilot-agent. - Console support for GCP NodePool / NodeClass management, cluster removal, and lifecycle entrypoints.
- Workload Autoscaler support on GKE, with an important caveat for GKE-managed
kube-state-metrics; see Workload Autoscaler Installation Troubleshooting.
Key GKE concepts
| Variable | Meaning | Example |
|---|---|---|
GCP_PROJECT_ID | The Google Cloud project that owns the cluster and CloudPilot-managed GCP resources. | my-prod-project |
CLUSTER_NAME | The GKE cluster name. | prod-gke |
CLUSTER_REGION | The broad region used for pricing, discovery, and script wiring. | us-central1 |
CLUSTER_LOCATION | The exact GKE location. Use a zone for zonal clusters and a region for regional clusters. | us-central1-a or us-central1 |
CLUSTER_ID | The CloudPilot AI cluster ID assigned after Phase 1 registration. | gcp-xxxx |
For GKE, CLUSTER_LOCATION is not interchangeable with CLUSTER_REGION.
Use us-central1-a for a zonal cluster and us-central1 for a regional cluster.
When the cluster name is ambiguous across multiple locations, export CLUSTER_LOCATION explicitly before running lifecycle scripts.
How the GKE lifecycle fits together
Phase 1: Add Cluster
The console-generated GKE Add Cluster script installs the standard Phase 1 agent manifest and sets CLOUD_PROVIDER=gcp.
The GKE registration path relies on cluster metadata to discover the cluster name and related context instead of requiring you to inject CLUSTER_NAME manually into the Phase 1 script.
Phase 2: Start Saving
The GKE Start Saving flow runs the GKE-specific Phase 2 install script. This script validates:
kubectl,helm,jq,curl, andgcloud- the current Kubernetes context
- the exact cluster location
- GKE Workload Identity
It then installs the CloudPilot base components and the GCP optimizer stack, while:
- creating or reusing the controller GSA
- binding
cloudpilot/cloudpilot-adminto that GSA through Workload Identity - reconciling the least-privilege custom IAM role used by the controller
Day-2 operations
Once the cluster is managed by CloudPilot AI, the console uses GKE-specific scripts for:
- restoring original capacity
- upgrading CloudPilot AI components
- uninstalling CloudPilot AI safely
See GKE Getting Started and GKE Day-2 Operations for the operational steps.
GKE-specific operational notes
- Workload Identity is required for Phase 2. The default flow uses the
cloudpilot-adminKubernetes service account plus a bound Google service account instead of a downloaded JSON key. - The exact location matters everywhere. Restore, upgrade, and uninstall all rely on the exact
CLUSTER_LOCATIONvalue. - Private nodes still need outbound access. If the source node pool resolves to private nodes, keep Cloud NAT or equivalent outbound image-pull access available.
- CloudPilot uninstall is scoped. The GKE uninstall flow targets CloudPilot-managed
NodeClaims,NodePools, andGCENodeClassesinstead of deleting all Karpenter resources in the cluster. - GKE-managed
kube-state-metricscan block Workload Autoscaler recommendations. If WA stays inOptimization: Recommending, review the GKE troubleshooting guidance linked above.