Skip to Content
GuideGKEGKE Day-2 Operations

GKE Day-2 Operations

This guide summarizes the operational scripts used after a GKE cluster has been connected to CloudPilot AI.

Shared environment variables

The GKE lifecycle scripts all rely on the same basic context:

VariableRequired byNotes
GCP_PROJECT_IDrestore, upgrade, uninstallThe Google Cloud project that owns the cluster
CLUSTER_NAMErestore, upgrade, uninstallThe GKE cluster name
CLUSTER_REGIONrestore, upgrade, uninstallThe broad region context
CLUSTER_LOCATIONrestore, upgrade, uninstallThe exact zonal or regional location
CLOUDPILOT_API_KEYupgrade, uninstallRequired when the script talks to CloudPilot AI
CLUSTER_IDupgrade, uninstallThe CloudPilot AI cluster ID

If CLUSTER_LOCATION is not set and the same cluster name can be matched in more than one GKE location, the lifecycle scripts stop and ask you to export the exact location explicitly.

Restore cluster capacity

Use the GKE restore flow when you want to move workloads back to the original GKE node pools before uninstalling CloudPilot AI or before rolling back a test.

The current restore entrypoint is:

curl --silent "https://onboard.cloudpilot.ai/common/gke/restore.sh" | bash

The restore logic currently:

  • verifies the current Kubernetes context and the exact GKE cluster location
  • filters out CloudPilot-managed node pools and restores only the regular GKE node pools
  • removes the CloudPilot rebalance taint from the selected node pools and their live nodes
  • resizes each selected node pool and waits for the expected ready-node count to return

Upgrade CloudPilot AI on GKE

Use the GKE upgrade flow whenever the console prompts you to upgrade the cluster components.

The current GKE upgrade entrypoint is:

curl --silent "https://onboard.cloudpilot.ai/common/gke/upgrade.sh" | bash

The upgrade path currently:

  • requires gcloud, jq, and the exact CLUSTER_LOCATION
  • verifies the current deployed agent version
  • upgrades Workload Autoscaler first when it is installed
  • reruns the GKE Phase 2 installation path when Phase 2 is already present
  • reapplies the latest Phase 1 manifest and waits for the CloudPilot namespace Pods to become ready

CloudPilot AI also uses an onboard versions index to control supported upgrade targets. If the console reports a manual-upgrade checkpoint, complete that checkpoint before retrying the automated path.

Uninstall CloudPilot AI from GKE

The GKE uninstall flow should always be executed in the same order used by the CloudPilot AI console:

  1. disable rebalance
  2. restore original node pool capacity
  3. drain and remove CloudPilot-managed nodes
  4. run the final uninstall script

The current uninstall entrypoint is generated from the versioned GKE uninstall path:

curl --silent "https://onboard.cloudpilot.ai/manifest/<version>/gke/uninstall.sh" | bash

The current GKE uninstall script is intentionally scoped. It attempts to:

  • delete CloudPilot-managed NodeClaims, NodePools, and GCENodeClasses instead of deleting all Karpenter resources cluster-wide
  • remove the Workload Identity binding for cloudpilot/cloudpilot-admin
  • remove the CloudPilot-managed controller GSA only when CloudPilot created it
  • leave a custom controller GSA in place when you provided one explicitly

See the general Uninstall CloudPilot AI guide for the full step-by-step UI flow.

Common GKE operational issues

failed to auto-detect exact GKE cluster location

Export CLUSTER_LOCATION explicitly and rerun the script. Use a zone like us-central1-a for zonal clusters and a region like us-central1 for regional clusters.

Workload Identity is not enabled on cluster

The current Phase 2 and upgrade flows require GKE Workload Identity. Enable Workload Identity on the cluster, refresh your cluster access if needed, and rerun the script.

Private-node image pulls fail during GKE phase2

If the source node pool uses private nodes, make sure outbound image-pull access is available through Cloud NAT or an equivalent egress path before rerunning the install or upgrade flow.

Workload Autoscaler stays in Optimization: Recommending

If you are using GKE-managed kube-state-metrics, review the GKE-specific guidance in Workload Autoscaler Installation Troubleshooting.

Last updated on