GKE Day-2 Operations
This guide summarizes the operational scripts used after a GKE cluster has been connected to CloudPilot AI.
Shared environment variables
The GKE lifecycle scripts all rely on the same basic context:
| Variable | Required by | Notes |
|---|---|---|
GCP_PROJECT_ID | restore, upgrade, uninstall | The Google Cloud project that owns the cluster |
CLUSTER_NAME | restore, upgrade, uninstall | The GKE cluster name |
CLUSTER_REGION | restore, upgrade, uninstall | The broad region context |
CLUSTER_LOCATION | restore, upgrade, uninstall | The exact zonal or regional location |
CLOUDPILOT_API_KEY | upgrade, uninstall | Required when the script talks to CloudPilot AI |
CLUSTER_ID | upgrade, uninstall | The CloudPilot AI cluster ID |
If CLUSTER_LOCATION is not set and the same cluster name can be matched in more than one GKE location, the lifecycle scripts stop and ask you to export the exact location explicitly.
Restore cluster capacity
Use the GKE restore flow when you want to move workloads back to the original GKE node pools before uninstalling CloudPilot AI or before rolling back a test.
The current restore entrypoint is:
curl --silent "https://onboard.cloudpilot.ai/common/gke/restore.sh" | bashThe restore logic currently:
- verifies the current Kubernetes context and the exact GKE cluster location
- filters out CloudPilot-managed node pools and restores only the regular GKE node pools
- removes the CloudPilot rebalance taint from the selected node pools and their live nodes
- resizes each selected node pool and waits for the expected ready-node count to return
Upgrade CloudPilot AI on GKE
Use the GKE upgrade flow whenever the console prompts you to upgrade the cluster components.
The current GKE upgrade entrypoint is:
curl --silent "https://onboard.cloudpilot.ai/common/gke/upgrade.sh" | bashThe upgrade path currently:
- requires
gcloud,jq, and the exactCLUSTER_LOCATION - verifies the current deployed agent version
- upgrades Workload Autoscaler first when it is installed
- reruns the GKE Phase 2 installation path when Phase 2 is already present
- reapplies the latest Phase 1 manifest and waits for the CloudPilot namespace Pods to become ready
CloudPilot AI also uses an onboard versions index to control supported upgrade targets. If the console reports a manual-upgrade checkpoint, complete that checkpoint before retrying the automated path.
Uninstall CloudPilot AI from GKE
The GKE uninstall flow should always be executed in the same order used by the CloudPilot AI console:
- disable rebalance
- restore original node pool capacity
- drain and remove CloudPilot-managed nodes
- run the final uninstall script
The current uninstall entrypoint is generated from the versioned GKE uninstall path:
curl --silent "https://onboard.cloudpilot.ai/manifest/<version>/gke/uninstall.sh" | bashThe current GKE uninstall script is intentionally scoped. It attempts to:
- delete CloudPilot-managed
NodeClaims,NodePools, andGCENodeClassesinstead of deleting all Karpenter resources cluster-wide - remove the Workload Identity binding for
cloudpilot/cloudpilot-admin - remove the CloudPilot-managed controller GSA only when CloudPilot created it
- leave a custom controller GSA in place when you provided one explicitly
See the general Uninstall CloudPilot AI guide for the full step-by-step UI flow.
Common GKE operational issues
failed to auto-detect exact GKE cluster location
Export CLUSTER_LOCATION explicitly and rerun the script.
Use a zone like us-central1-a for zonal clusters and a region like us-central1 for regional clusters.
Workload Identity is not enabled on cluster
The current Phase 2 and upgrade flows require GKE Workload Identity. Enable Workload Identity on the cluster, refresh your cluster access if needed, and rerun the script.
Private-node image pulls fail during GKE phase2
If the source node pool uses private nodes, make sure outbound image-pull access is available through Cloud NAT or an equivalent egress path before rerunning the install or upgrade flow.
Workload Autoscaler stays in Optimization: Recommending
If you are using GKE-managed kube-state-metrics, review the GKE-specific guidance in Workload Autoscaler Installation Troubleshooting.