CloudPilot AI POC Process

This guide walks through the standard Proof of Concept (POC) process for evaluating CloudPilot AI’s cluster optimization capabilities. We recommend maintaining a dedicated tracking document for each customer engagement to capture decisions, metrics, and outcomes at every stage.

Process Overview

1. Classify Workloads and Define Optimization Scope

Start by surveying the cluster to determine which workloads are eligible for optimization and which should be excluded.

Workloads that are generally not safe to optimize include:

Single-replica StatefulSets: Kubernetes StatefulSets use a delete-before-create update strategy (OnDelete / RollingUpdate). With only one replica, the old Pod must fully terminate before a replacement is scheduled. This gap — between termination and the new Pod reaching Ready — leaves zero replicas available to handle traffic, creating an unavoidable service disruption.

StatefulSet single-replica downtime risk

Long-lived connection workloads: Services that rely on persistent connections — such as WebSocket or gRPC streams — are particularly sensitive to Pod evictions. When a Pod is terminated, every active connection is dropped at once. Clients then attempt to reconnect simultaneously, triggering a reconnection storm. For real-time workloads like online gaming, live streaming, or collaborative editing, even brief disruptions can cause noticeable lag spikes, dropped sessions, or data loss.

Long-lived connection disruption risk

Use Namespace or Node Group boundaries to organize workloads, depending on the customer’s cluster topology.

2. Isolate Non-optimizable Workloads

Once classified, use kubectl drain and cordon to migrate non-optimizable workloads onto a dedicated set of nodes. This keeps them safely out of scope when optimization begins on the remaining nodes.

Isolate non-optimizable workloads

3. Enable Workload Autoscaler

Enable the Workload Autoscaler (WA) for the target node group, covering all optimizable workloads that are currently in a Ready state. After activation, watch for any Pod failures or scheduling issues.

For Java workloads, check with the customer beforehand to confirm whether JVM parameter tuning is permitted.

Be sure to capture CPU and memory utilization metrics before and after enabling WA — this data is essential for demonstrating the value of right-sizing.

Workload Autoscaler overview

4. Configure NodeClass and NodePool

Set up the NodeClass and NodePool resources. Double-check that security groups and subnets are configured correctly — misconfigurations here are a common source of launch failures.

Also confirm whether burstable instance types (e.g., T-series) are acceptable. If the cluster doesn’t already use them, it’s safest to exclude them.

5. Trigger Node Optimization

After verifying that the NodeClass shows a Ready status, trigger optimization for the target node group. There must be at least one CloudPilot AI-managed node in the group for optimization to proceed.

Record the before-and-after cost comparison for the node group in the customer’s tracking document.

Node optimization overview

6. Monitor Stability

Allow the optimized configuration to run for at least 48 hours. If everything looks healthy — no Pod disruptions, no performance regressions — begin expanding optimization to additional nodes in the group.

7. Iterate

Repeat steps 3–6 for each remaining node group until the entire cluster has been optimized.

Support

If you run into any issues during the POC, reach out to the CloudPilot AI support team — we’re happy to help.