OOM Auto-Remediation
Overview
CloudPilot AI Workload Autoscaler automatically detects, classifies, and remediates Out-of-Memory (OOM) events. When a container is OOM-killed, the system raises the memory floor (and JVM heap for Java workloads) so that subsequent Pods receive enough resources to avoid repeated OOM crashes.
The OOM handler does not directly restart or evict Pods — instead, it records boosted resource values into the AutoscalingPolicyConfiguration (APC) status. The existing updater (drift detection) and webhook (Pod admission) apply the boosted values through their normal flows.
How OOM Events Are Detected
The system uses two detection rules based on Kubernetes Pod Status:
1. Cgroup OOM Kill
When a container’s total memory usage exceeds its cgroup limit, the Linux kernel’s OOM killer terminates the main process. Kubernetes reports this as:
lastState.terminated.reason: OOMKilled
lastState.terminated.exitCode: 137This covers all container types (Java, Go, Python, Node.js, etc.) and is the most common OOM scenario in Kubernetes.
2. JVM Exit on OutOfMemoryError
When a Java container is configured with -XX:+ExitOnOutOfMemoryError and the JVM’s internal heap is exhausted, the JVM exits with code 3:
lastState.terminated.reason: Error
lastState.terminated.exitCode: 3This catches Java heap OOM even when the container’s cgroup limit is not exceeded (i.e., the JVM heap is the bottleneck, not the container total memory).
Note: For the exit-code-3 detection to work, the JVM must have
-XX:+ExitOnOutOfMemoryErrorconfigured. The Workload Autoscaler automatically injects this flag for Java containers it directly manages. For containers using the env-var integration path, you should add this flag to your startup scripts. See Java Workload Optimization for details.
How OOM Events Are Classified (Java)
For Java containers, the system queries the cloudpilot-node-agent metrics in Prometheus to determine the specific OOM sub-type:
| Classification | Meaning | Remediation |
|---|---|---|
| heap / heap_inferred | JVM Heap space exhausted | Boost memory and JVM heap (-Xmx/-Xms) |
| metaspace | Class metadata area exhausted | Boost memory only (heap is not the issue) |
| direct_buffer | NIO direct buffer exhausted | Boost memory only |
| native_thread | Thread limit reached | No auto-fix — requires manual investigation |
| cgroup | Non-Java container OOM | Boost memory only |
When classification data is unavailable (e.g., node-agent not deployed, or data not yet in Prometheus), the system defaults to a conservative path: treat as potential heap OOM and boost both memory and heap.
What Happens After Detection
1. Boost Computation
The system computes a new memory floor:
- Memory: the maximum of
current × 1.5,current + 200Mi, and the current recommendation - Heap (Java heap OOM only): escalates from the existing boost or recommendation by 1.5×
Each subsequent OOM further escalates the boost, preventing repeated crashes at the same resource level.
2. APC Status Update
The OOM event and computed boost are recorded in the APC status:
- OOMRecords: history of OOM events (capped at 10, FIFO)
- ActiveOOMBoosts: per-container memory/heap floor with a 36-hour expiry
- OOMRemediation condition: signals that an OOM boost is active
3. Resource Application
The boosted values are applied through existing mechanisms — no special OOM-specific Pod operations:
- Updater: drift detection sees the Pod’s resources are below the boosted floor → triggers InPlace resize or recreate
- Webhook: when a new Pod is admitted, applies
max(recommendation, boost)as the resource request - OOM Recovery (safety net): if the updater cannot act (e.g.,
OnCreatemode, recommendation not ready), the recovery controller evicts the Pod so a new one is created with boosted resources
4. Boost Expiry
After 36 hours, the boost expires and normal recommendations resume. By this time, the recommender has typically adjusted its recommendations based on the post-boost usage patterns.
When OOM Remediation Does NOT Trigger
- Normal application crashes (exit code 1 without OOM): not treated as OOM
- Kubernetes eviction (
reason=Evicted): node memory pressure, not container OOM - Liveness probe failure (exit code 137 but
reason≠OOMKilled): not OOM - UpdateMode=Off: user explicitly disabled updates