OOM Auto-Remediation

Overview

CloudPilot AI Workload Autoscaler automatically detects, classifies, and remediates Out-of-Memory (OOM) events. When a container is OOM-killed, the system raises the memory floor (and JVM heap for Java workloads) so that subsequent Pods receive enough resources to avoid repeated OOM crashes.

The OOM handler does not directly restart, evict, or resize Pods. Instead, it records boosted resource values into the AutoscalingPolicyConfiguration (APC) status. The existing updater, Pod admission webhook, and OOM recovery controller then apply the boosted values through their normal flows.

OOM Detection & Remediation Flow

Controller Responsibilities

OOM remediation is split across several components:

Component	Responsibility	Pod operation?
OOM Handler	Detects OOM signals, classifies Java OOMs, computes boosted Memory/Heap floors, and writes `oomRecords`, `activeOOMBoosts`, and the `OOMRemediation` condition to APC status.	No. It only updates APC status.
Updater	Treats the active OOM boost as an effective recommendation floor and applies it through the configured update path (`InPlace` or `ReCreate`).	Yes, through the normal update pipeline.
Pod admission webhook	Applies `max(recommendation, active OOM boost)` when a replacement Pod is created. This can work even while fresh recommendations are not ready.	Yes, before the new Pod is created.
OOM Recovery controller	Safety net that evicts or rolls out Pods that still have an active OOM signal and remain below the boost floor.	Yes, only when recovery gates pass.

How OOM Events Are Detected

The system uses two detection rules based on Kubernetes Pod Status:

1. Cgroup OOM Kill

When a container’s total memory usage exceeds its cgroup limit, the Linux kernel’s OOM killer terminates the main process. Kubernetes reports this as:


lastState.terminated.reason: OOMKilled
lastState.terminated.exitCode: 137

This covers all container types (Java, Go, Python, Node.js, etc.) and is the most common OOM scenario in Kubernetes.

2. JVM Exit on OutOfMemoryError

When a Java container is configured with -XX:+ExitOnOutOfMemoryError and the JVM’s internal heap is exhausted, the JVM exits with code 3:


lastState.terminated.reason: Error
lastState.terminated.exitCode: 3

This catches Java heap OOM even when the container’s cgroup limit is not exceeded (i.e., the JVM heap is the bottleneck, not the container total memory).

Note: For the exit-code-3 detection to work, the JVM must have -XX:+ExitOnOutOfMemoryError configured. The Workload Autoscaler automatically injects this flag for Java containers it directly manages. For containers using the env-var integration path, you should add this flag to your startup scripts. See Java Workload Optimization for details.

How OOM Events Are Classified (Java)

For Java containers, the system queries the cloudpilot-node-agent metrics in Prometheus to determine the specific OOM sub-type:

Classification	Meaning	Remediation
heap / heap_inferred	JVM Heap space exhausted	Boost memory and JVM heap (`-Xmx/-Xms`)
metaspace	Class metadata area exhausted	Boost memory only (heap is not the issue)
direct_buffer	NIO direct buffer exhausted	Boost memory only
native_thread	Thread limit reached	No auto-fix — requires manual investigation
cgroup	Non-Java container OOM	Boost memory only

When classification data is unavailable (e.g., node-agent not deployed, or data not yet in Prometheus), the system defaults to a conservative path: treat as potential heap OOM and boost both memory and heap.

OOM Scenarios & Actions

What Happens After Detection

1. Boost Computation

The system computes a new memory floor:

Memory: the maximum of current × 1.5, current + 200Mi, and the current recommendation
Heap (Java heap OOM only): escalates from the existing boost or recommendation by 1.5×

Each subsequent OOM further escalates the boost, preventing repeated crashes at the same resource level.

2. APC Status Update

The OOM event and computed boost are recorded in the APC status:

OOMRecords: history of OOM events (capped at 10, FIFO)
ActiveOOMBoosts: per-container memory/heap floor with a 36-hour expiry
OOMRemediation condition: signals that an OOM boost is active

3. Resource Application

The boosted values are applied through existing mechanisms — the OOM handler itself does not mutate Pods:

Updater: drift detection sees the Pod’s resources are below the boosted floor → triggers InPlace resize or recreate
Webhook: when a new Pod is admitted, applies max(recommendation, boost) as the resource request
OOM Recovery (safety net): if the updater cannot act (e.g., OnCreate mode, recommendation not ready), the recovery controller evicts the Pod so a new one is created with boosted resources

OOM Recovery only acts when the APC still has an active OOM boost, the update mode is OnCreate, ReCreate, or InPlace, proactive updates are not disabled, and the target Pod is not preempted, deleting, gone for scheduling, or still inside the Startup Boost window. For single-replica Deployments without PVCs, recovery can trigger a workload rollout instead of directly evicting the Pod to reduce disruption.

4. Boost Expiry

After 36 hours, the boost expires and normal recommendations resume. By this time, the recommender has typically adjusted its recommendations based on the post-boost usage patterns.

When OOM Remediation Does NOT Trigger

Normal application crashes (exit code 1 without OOM): not treated as OOM
Kubernetes eviction (reason=Evicted): node memory pressure, not container OOM
Liveness probe failure (exit code 137 but reason≠OOMKilled): not OOM
Java native-thread OOM (native_thread): recorded for visibility, but no automatic resource boost is applied because adding memory or heap does not fix thread exhaustion
UpdateMode=Off: automatic update and recovery actions are skipped; OOM status may still be recorded for visibility

Status fields to check

Use the APC status to verify what the system decided:

Field / condition	Meaning
`status.oomRecords`	Recent OOM events, capped at 10 records.
`status.activeOOMBoosts`	Active per-container Memory/Heap boost floors and their expiry time.
`status.conditions[type=OOMRemediation]`	Whether an OOM boost is active, skipped, or cleared.
`status.recommendations[].adjustedRecommendation`	The normal recommendation that the boost is compared against.

If activeOOMBoosts exists but a Pod is not updated immediately, check the Pod’s update mode, Startup Boost annotation/window, proactive-update disable annotation, and whether the Pod is already deleting or preempted.