Best Practices for Ensuring Service Availability with AWS ALB

In an AWS EKS environment, using an ALB may lead to service downtime. The issue typically arises as follows:

problem-img

Time Point 1: When Kubernetes decides to terminate a pod, it sends a TERM signal. The pod may stop immediately or after completing all inflight requests, at which point its status is marked as Terminating.

Time Point 2: The ALB (Application Load Balancer) controller detects that the pod's status is Terminating and attempts to remove the pod's IP from the ALB.

Time Point 3: The ALB controller calls the appropriate API to remove the pod's IP from the ALB.

During this process, if any new requests arrive at the pod during Time Period 4, those requests will fail. This is a common issue, as highlighted in a related GitHub issue.

Key Insight: The TERM signal should only be sent after the pod's IP has been removed from the ALB.

Proposed Solution: While this solution may not be perfect, it is a recommended practice. By using a preStop lifecycle hook in the deployment, the TERM signal can be delayed by 5 seconds, allowing sufficient time for the operations at Time Points 2 and 3 to complete.

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: game-2048
  name: deployment-2048
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: app-2048
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: app-2048
    spec:
      containers:
      - image: public.ecr.aws/l6m2t8p7/docker-2048:latest
        imagePullPolicy: Always
        name: app-2048
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh","-c","sleep 5"]