OOM Auto-Remediation
Overview
CloudPilot AI Workload Autoscaler automatically detects, classifies, and remediates Out-of-Memory (OOM) events. When a container is OOM-killed, the system raises the memory floor (and JVM heap for Java workloads) so that subsequent Pods receive enough resources to avoid repeated OOM crashes.
The OOM handler does not directly restart, evict, or resize Pods. Instead, it records boosted resource values into the AutoscalingPolicyConfiguration (APC) status. The existing updater, Pod admission webhook, and OOM recovery controller then apply the boosted values through their normal flows.
Controller Responsibilities
OOM remediation is split across several components:
| Component | Responsibility | Pod operation? |
|---|---|---|
| OOM Handler | Detects OOM signals, classifies Java OOMs, computes boosted Memory/Heap floors, and writes oomRecords, activeOOMBoosts, and the OOMRemediation condition to APC status. | No. It only updates APC status. |
| Updater | Treats the active OOM boost as an effective recommendation floor and applies it through the configured update path (InPlace or ReCreate). | Yes, through the normal update pipeline. |
| Pod admission webhook | Applies max(recommendation, active OOM boost) when a replacement Pod is created. This can work even while fresh recommendations are not ready. | Yes, before the new Pod is created. |
| OOM Recovery controller | Safety net that evicts or rolls out Pods that still have an active OOM signal and remain below the boost floor. | Yes, only when recovery gates pass. |
How OOM Events Are Detected
The system uses two detection rules based on Kubernetes Pod Status:
1. Cgroup OOM Kill
When a container’s total memory usage exceeds its cgroup limit, the Linux kernel’s OOM killer terminates the main process. Kubernetes reports this as:
lastState.terminated.reason: OOMKilled
lastState.terminated.exitCode: 137This covers all container types (Java, Go, Python, Node.js, etc.) and is the most common OOM scenario in Kubernetes.
2. JVM Exit on OutOfMemoryError
When a Java container is configured with -XX:+ExitOnOutOfMemoryError and the JVM’s internal heap is exhausted, the JVM exits with code 3:
lastState.terminated.reason: Error
lastState.terminated.exitCode: 3This catches Java heap OOM even when the container’s cgroup limit is not exceeded (i.e., the JVM heap is the bottleneck, not the container total memory).
Note: For the exit-code-3 detection to work, the JVM must have
-XX:+ExitOnOutOfMemoryErrorconfigured. The Workload Autoscaler automatically injects this flag for Java containers it directly manages. For containers using the env-var integration path, you should add this flag to your startup scripts. See Java Workload Optimization for details.
How OOM Events Are Classified (Java)
For Java containers, the system queries the cloudpilot-node-agent metrics in Prometheus to determine the specific OOM sub-type:
| Classification | Meaning | Remediation |
|---|---|---|
| heap / heap_inferred | JVM Heap space exhausted | Boost memory and JVM heap (-Xmx/-Xms) |
| metaspace | Class metadata area exhausted | Boost memory only (heap is not the issue) |
| direct_buffer | NIO direct buffer exhausted | Boost memory only |
| native_thread | Thread limit reached | No auto-fix — requires manual investigation |
| cgroup | Non-Java container OOM | Boost memory only |
When classification data is unavailable (e.g., node-agent not deployed, or data not yet in Prometheus), the system defaults to a conservative path: treat as potential heap OOM and boost both memory and heap.
What Happens After Detection
1. Boost Computation
The system computes a new memory floor:
- Memory: the maximum of
current × 1.5,current + 200Mi, and the current recommendation - Heap (Java heap OOM only): escalates from the existing boost or recommendation by 1.5×
Each subsequent OOM further escalates the boost, preventing repeated crashes at the same resource level.
2. APC Status Update
The OOM event and computed boost are recorded in the APC status:
- OOMRecords: history of OOM events (capped at 10, FIFO)
- ActiveOOMBoosts: per-container memory/heap floor with a 36-hour expiry
- OOMRemediation condition: signals that an OOM boost is active
3. Resource Application
The boosted values are applied through existing mechanisms — the OOM handler itself does not mutate Pods:
- Updater: drift detection sees the Pod’s resources are below the boosted floor → triggers InPlace resize or recreate
- Webhook: when a new Pod is admitted, applies
max(recommendation, boost)as the resource request - OOM Recovery (safety net): if the updater cannot act (e.g.,
OnCreatemode, recommendation not ready), the recovery controller evicts the Pod so a new one is created with boosted resources
OOM Recovery only acts when the APC still has an active OOM boost, the update mode is OnCreate, ReCreate, or InPlace, proactive updates are not disabled, and the target Pod is not preempted, deleting, gone for scheduling, or still inside the Startup Boost window. For single-replica Deployments without PVCs, recovery can trigger a workload rollout instead of directly evicting the Pod to reduce disruption.
4. Boost Expiry
After 36 hours, the boost expires and normal recommendations resume. By this time, the recommender has typically adjusted its recommendations based on the post-boost usage patterns.
When OOM Remediation Does NOT Trigger
- Normal application crashes (exit code 1 without OOM): not treated as OOM
- Kubernetes eviction (
reason=Evicted): node memory pressure, not container OOM - Liveness probe failure (exit code 137 but
reason≠OOMKilled): not OOM - Java native-thread OOM (
native_thread): recorded for visibility, but no automatic resource boost is applied because adding memory or heap does not fix thread exhaustion - UpdateMode=Off: automatic update and recovery actions are skipped; OOM status may still be recorded for visibility
Status fields to check
Use the APC status to verify what the system decided:
| Field / condition | Meaning |
|---|---|
status.oomRecords | Recent OOM events, capped at 10 records. |
status.activeOOMBoosts | Active per-container Memory/Heap boost floors and their expiry time. |
status.conditions[type=OOMRemediation] | Whether an OOM boost is active, skipped, or cleared. |
status.recommendations[].adjustedRecommendation | The normal recommendation that the boost is compared against. |
If activeOOMBoosts exists but a Pod is not updated immediately, check the Pod’s update mode, Startup Boost annotation/window, proactive-update disable annotation, and whether the Pod is already deleting or preempted.