Skip to Content
GuideWorkload AutoscalerOOM Auto-Remediation

OOM Auto-Remediation

Overview

CloudPilot AI Workload Autoscaler automatically detects, classifies, and remediates Out-of-Memory (OOM) events. When a container is OOM-killed, the system raises the memory floor (and JVM heap for Java workloads) so that subsequent Pods receive enough resources to avoid repeated OOM crashes.

The OOM handler does not directly restart or evict Pods — instead, it records boosted resource values into the AutoscalingPolicyConfiguration (APC) status. The existing updater (drift detection) and webhook (Pod admission) apply the boosted values through their normal flows.

OOM Detection & Remediation Flow

How OOM Events Are Detected

The system uses two detection rules based on Kubernetes Pod Status:

1. Cgroup OOM Kill

When a container’s total memory usage exceeds its cgroup limit, the Linux kernel’s OOM killer terminates the main process. Kubernetes reports this as:

lastState.terminated.reason: OOMKilled lastState.terminated.exitCode: 137

This covers all container types (Java, Go, Python, Node.js, etc.) and is the most common OOM scenario in Kubernetes.

2. JVM Exit on OutOfMemoryError

When a Java container is configured with -XX:+ExitOnOutOfMemoryError and the JVM’s internal heap is exhausted, the JVM exits with code 3:

lastState.terminated.reason: Error lastState.terminated.exitCode: 3

This catches Java heap OOM even when the container’s cgroup limit is not exceeded (i.e., the JVM heap is the bottleneck, not the container total memory).

Note: For the exit-code-3 detection to work, the JVM must have -XX:+ExitOnOutOfMemoryError configured. The Workload Autoscaler automatically injects this flag for Java containers it directly manages. For containers using the env-var integration path, you should add this flag to your startup scripts. See Java Workload Optimization for details.

How OOM Events Are Classified (Java)

For Java containers, the system queries the cloudpilot-node-agent metrics in Prometheus to determine the specific OOM sub-type:

ClassificationMeaningRemediation
heap / heap_inferredJVM Heap space exhaustedBoost memory and JVM heap (-Xmx/-Xms)
metaspaceClass metadata area exhaustedBoost memory only (heap is not the issue)
direct_bufferNIO direct buffer exhaustedBoost memory only
native_threadThread limit reachedNo auto-fix — requires manual investigation
cgroupNon-Java container OOMBoost memory only

When classification data is unavailable (e.g., node-agent not deployed, or data not yet in Prometheus), the system defaults to a conservative path: treat as potential heap OOM and boost both memory and heap.

OOM Scenarios & Actions

What Happens After Detection

1. Boost Computation

The system computes a new memory floor:

  • Memory: the maximum of current × 1.5, current + 200Mi, and the current recommendation
  • Heap (Java heap OOM only): escalates from the existing boost or recommendation by 1.5×

Each subsequent OOM further escalates the boost, preventing repeated crashes at the same resource level.

2. APC Status Update

The OOM event and computed boost are recorded in the APC status:

  • OOMRecords: history of OOM events (capped at 10, FIFO)
  • ActiveOOMBoosts: per-container memory/heap floor with a 36-hour expiry
  • OOMRemediation condition: signals that an OOM boost is active

3. Resource Application

The boosted values are applied through existing mechanisms — no special OOM-specific Pod operations:

  • Updater: drift detection sees the Pod’s resources are below the boosted floor → triggers InPlace resize or recreate
  • Webhook: when a new Pod is admitted, applies max(recommendation, boost) as the resource request
  • OOM Recovery (safety net): if the updater cannot act (e.g., OnCreate mode, recommendation not ready), the recovery controller evicts the Pod so a new one is created with boosted resources

4. Boost Expiry

After 36 hours, the boost expires and normal recommendations resume. By this time, the recommender has typically adjusted its recommendations based on the post-boost usage patterns.

When OOM Remediation Does NOT Trigger

  • Normal application crashes (exit code 1 without OOM): not treated as OOM
  • Kubernetes eviction (reason=Evicted): node memory pressure, not container OOM
  • Liveness probe failure (exit code 137 but reason≠OOMKilled): not OOM
  • UpdateMode=Off: user explicitly disabled updates
Last updated on