Smooth Rolling Update of Ingress Controller Using AWS NLB - No Service Interruption

In the cloud, to allow external traffic to access your Kubernetes cluster, the typical approach is to use a LoadBalancer-type service, which directs external traffic through a load balancer into the Ingress Controller and then distributes it to various pods.

Ingress controllers often need updates or configuration changes, which usually require the program to be restarted. So, what happens when an Ingress Controller needs a rolling update? Will Ingress access be interrupted?

In the following scenario, we will self-host a Kubernetes cluster on AWS, use Traefik 2 as the Ingress Controller, and deploy AWS LoadBalancer Controller 2.2.0. We will install Traefik 2 from the official Helm repository version 9.19.1. We will use an NLB with 3 availability zones, IP mode targeting, and we won’t use LoadBalancer-type services. Instead, we’ll use headless ClusterIP-type services, which AWS LoadBalancer Controller supports. For the network plugin, we’ll use the Amazon VPC CNI plugin.

NLB operates at Layer 4, directly forwarding traffic without needing to parse the HTTP protocol, resulting in better performance and lower latency.

If the target type is “instance,” then the Kubernetes service type must be either NodePort or LoadBalancer.

If the target type is “ip,” NLB must go directly to the pods without NAT, but you must use the Amazon VPC CNI network plugin.

The default LoadBalancer-type service in Kubernetes allocates a node port. If the target type is “instance,” NLB traffic is routed through the node port to pods, adding response latency. If the target type is “ip,” it also allocates a node port by default, but NLB directly forwards traffic to pods without going through node port forwarding.

AWS LoadBalancer Controller supports target types “ip” and “instance,” while the cloud controller embedded in the controller-manager only supports “instance” type.

Using a headless service avoids allocating node ports and cluster IP, saving IP and node port port resources. Starting from Kubernetes 1.20, LoadBalancer-type services can be configured not to allocate node ports, allowing the LoadBalancer to communicate directly with pods. To enable this feature, you need to activate the ServiceLBNodePortControl feature gate and set spec.allocateLoadBalancerNodePorts to false.

service.yaml

text

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: traefik2
    meta.helm.sh/release-namespace: kube-system
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ping
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8082"
    service.beta.kubernetes.io/aws-load-balancer-name: k8s-traefik2-ingress
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=120,
      preserve_client_ip.enabled=true
    service.beta.kubernetes.io/aws-load-balancer-type: external
  creationTimestamp: "2021-06-04T14:37:20Z"
  finalizers:
  - service.k8s.aws/resources
  labels:
    app.kubernetes.io/instance: traefik2
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: traefik
    helm.sh/chart: traefik-9.19.1
  name: traefik2
  namespace: kube-system
  resourceVersion: "734023"
  uid: 65312c00-99dd-4713-869d-8b3cba4032bb
spec:
  clusterIP: None
  clusterIPs:
  - None
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: web
    port: 80
    protocol: TCP
    targetPort: web #This corresponds to 3080 in the pod
  - name: websecure
    port: 443
    protocol: TCP
    targetPort: websecure #This corresponds to 3443 in the pod
  selector:
    app.kubernetes.io/instance: traefik2
    app.kubernetes.io/name: traefik
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer:
    ingress:
    - hostname: k8s-traefik2-ingress-5d6f374f2146c85a.elb.us-east-1.amazonaws.com

When the target type is “ip,” NLB does not preserve the client IP by default. To enable client IP preservation, you need to manually enable this feature. The prerequisite for NLB IP preservation is that all targets are in the same VPC, meaning Traefik pods are in the same VPC.

Additionally, Traefik 2 supports both Proxy Protocol v1 and v2 versions (Traefik v1 only supports Proxy Protocol v1) to obtain the client’s IP address. You can enable the NLB’s proxy-protocol for this purpose.

traefik-rolling-out-with-nlb

In the diagram, you can see that during the deployment’s rolling update and LB registration/de-registration, they are not synchronized. If there are no available targets in an availability zone within the health check cycle (i.e., all targets in that availability zone have not completed health checks), the IP for that availability zone will be removed from NLB’s domain resolution. If no targets are available in all availability zones within the health check cycle (i.e., all targets in all availability zones have not completed health checks), the entire NLB becomes unavailable.

AWS LoadBalancer Controller creates target health configurations with parameters: HealthCheckTimeoutSeconds as 6, HealthCheckIntervalSeconds as 10, HealthyThresholdCount as 3, and UnhealthyThresholdCount as 3. It takes approximately 3 minutes from pod creation to the successful completion of health checks before it can receive traffic. This means that if the rolling update completes within 3 minutes, the NLB will be in an unavailable state. If within 3 minutes, all pods in an availability zone are still undergoing health checks, the IP for that availability zone’s NLB instance will be removed from domain resolution.

AWS LoadBalancer Controller supports setting the NLB as a readiness gate for pods. With this configuration, after a pod is created, it must pass both the readiness probe and the readiness gate for the deployment to terminate the pod. This ensures that there will always be available targets to receive traffic during a rolling update. Testing has shown that it takes approximately 5 minutes from pod creation to readiness gate passage.

To configure the readiness gate:

traefik-rolling-out-with-nlb-readinessgate

配置readiness gate

text

#Edit the mutatingwebhookconfigurations
kubectl edit mutatingwebhookconfigurations  aws-load-balancer-controller-webhook 

 objectSelector:
    name: mpod.elbv2.k8s.aws
    namespaceSelector:
      matchExpressions:
      - key: elbv2.k8s.aws/pod-readiness-gate-inject
        operator: In
        values:
        - enabled
    objectSelector:
    #Add the following lines to the objectSelector section
      matchLabels:
        elbv2.k8s.aws/pod-readiness-gate-inject: enabled
# Label the kube-system namespace
kubectl label namespace kube-system elbv2.k8s.aws/pod-readiness-gate-inject=enabled

#Set the Traefik pod label in your values.yaml:
deployment:
  podLabels:
    elbv2.k8s.aws/pod-readiness-gate-inject: enabled

Pod deletion is controlled by the ReplicaSet, which sends requests to the API server. After receiving the event, the AWS LoadBalancer Controller notifies the NLB to deregister targets. Consequently, pod deletion and NLB deregistration are not synchronized, which can lead to situations where a pod has been deleted but still appears as available in the target group. This can cause issues with traffic being directed to the IP of the deleted pod.

How can this issue be resolved?

One solution is to add a preStop hook in the pod lifecycle to sleep for 10 seconds. This prevents new connections from entering during the pod’s termination process. A 10-second delay should generally ensure that the NLB has stopped forwarding new connections to the IP of the pod. The AWS LoadBalancer Controller will receive the pod deletion event and perform deregistration in the target group, marking the target as either draining or unused.

As of Helm Chart version 9.19.1, it does not yet support setting the pod’s lifecycle directly. You would need to configure this manually or use Traefik’s parameters, such as transport.lifeCycle.requestAcceptGraceTimeout=10, to continue accepting connections for a specified grace period before gracefully closing. Note that the Traefik Helm Chart has a hardcoded terminationGracePeriodSeconds of 60 seconds, so make sure that the sum of requestAcceptGraceTimeout and graceTimeOut is less than terminationGracePeriodSeconds. The default transport.lifeCycle.graceTimeOut in Traefik (the time it waits for connections to close) is 10 seconds.

However, after configuring this, you may notice that the AWS Load Balancer Controller displays targets as already deregistered almost instantly, but the deleted pod can still receive new connections. This is because the NLB takes an additional 30 to 90 seconds to successfully deregister targets (during this time, the NLB continues to forward new connections to the target and only stops forwarding new connections once the connection termination occurs). Therefore, you would need to set requestAcceptGraceTimeout or the sleep duration to at least 90 seconds and manually increase terminationGracePeriodSeconds. Unfortunately, Traefik Helm Chart does not currently support setting the terminationGracePeriodSeconds value, but you would need to wait for PR411 to be merged.

text

{"level":"info","ts":1623221227.4900389,"msg":"deRegistering targets","arn":"arn:aws:elasticloadbalancing:us-east-1:474981795240:targetgroup/k8s-kubesyst-traefik2-3e41e3e1b5/74840e00b5fce25f","targets
":[{"AvailabilityZone":"us-east-1f","Id":"10.52.79.105","Port":3080}]}
{"level":"info","ts":1623221227.4903376,"msg":"deRegistering targets","arn":"arn:aws:elasticloadbalancing:us-east-1:474981795240:targetgroup/k8s-kubesyst-traefik2-5ce7e766f7/6e7b481251bf928e","targets
":[{"AvailabilityZone":"us-east-1f","Id":"10.52.79.105","Port":3443}]}
{"level":"info","ts":1623221227.524495,"msg":"deRegistered targets","arn":"arn:aws:elasticloadbalancing:us-east-1:474981795240:targetgroup/k8s-kubesyst-traefik2-3e41e3e1b5/74840e00b5fce25f"}
{"level":"info","ts":1623221227.5618162,"msg":"deRegistered targets","arn":"arn:aws:elasticloadbalancing:us-east-1:474981795240:targetgroup/k8s-kubesyst-traefik2-5ce7e766f7/6e7b481251bf928e"}

The corresponding configurations

text

#Configuration in the pod spec: 
  lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 90
 
#Or in Traefik's configuration:
--entryPoints.web.transport.lifeCycle.requestAcceptGraceTimeout=90

#Modify terminationGracePeriodSeconds in the deployment:
  terminationGracePeriodSeconds: 120

Process After Setting preStop:

traefik-rolling-out-with-nlb-readinessgate-prestop

Additionally, you’ll need to configure the NLB with deregistration_delay.timeout_seconds to wait for unclosed connections to close before removing the target from the target group. The deregistration_delay.timeout_seconds in the NLB must be greater than or equal to the time it takes for the pod to exit. The recommended value by AWS is 120 seconds. When Traefik exits, it will close all connections, so this value only needs to be greater than or equal to 120 seconds. To prevent situations where a pod may remain in a terminated state, causing established connections to hang, you can also configure deregistration_delay.connection_termination.enabled to forcefully close established connections before removal.

In Traefik 2, during graceful shutdown (including the request Accept Grace phase, where it accepts new connection requests, and the Termination Grace Period phase, where it ends established connections), Traefik remains in a healthy state. This can lead to inaccuracies in health checks if NLB uses TCP port-based health checks.

To accurately reflect Traefik’s running status, you can use Traefik’s health check /ping endpoint. During graceful shutdown, this endpoint returns a 503 status code.

To configure NLB health checks with the /ping endpoint, set the health check type to HTTP and configure the health check URL as /ping. The health check port should match the Traefik entry point port.

In your values.yaml chart configuration, you can set it like this:

text

service:
  annotations:    
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8082"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/ping"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"

To avoid situations where there are no healthy targets in an availability zone during a rolling update, you can follow these steps:

  1. Set maxUnavailable to a value smaller than the smallest number of Traefik pods in any availability zone and set maxSurge accordingly. In your case, you mentioned spec as 6, so you can set maxUnavailable to 0 and maxSurge to 1. Ensure that the number of pods in each zone is at least 2. This is because ReplicaSet scale-down behavior does not consider pod topology spread, which might lead to all pods in an availability zone being scaled down. This issue is tracked in #issue96748.
  2. Use topologySpreadConstraints to balance the number of Traefik pods in each availability zone. Starting from Kubernetes 1.18, you don’t need to enable the EvenPodsSpread feature gate, as it’s enabled by default. You can use the defaultConstraints provided in your values.yaml to achieve this:

text

defaultConstraints:
  - maxSkew: 3
    topologyKey: "kubernetes.io/hostname"
    whenUnsatisfiable: ScheduleAnyway
  - maxSkew: 5
    topologyKey: "topology.kubernetes.io/zone"
    whenUnsatisfiable: ScheduleAnyway

These constraints help ensure that each host has at most 3 pods difference and that the number of pods between zones differs by at most 5.

If you require stricter balancing, you can manually configure topologySpreadConstraints in your spec.template.spec. As of Traefik Helm Chart version 9.19.1 topologySpreadConstraints PR, this feature is not yet merged, so you’ll need to configure it manually.

text

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: "kubernetes.io/hostname"
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/instance: traefik2
    	app.kubernetes.io/name: traefik
  - maxSkew: 1
    topologyKey: "topology.kubernetes.io/zone"
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/instance: traefik2
    	app.kubernetes.io/name: traefik

You can also utilize AWS Auto Scaling groups or cluster scaling to increase node resource availability.

In summary, by using readiness gates, configuring preStop lifecycles or Traefik settings, topologySpreadConstraints, and maxUnavailable, you can address various issues related to rolling updates, target availability, and node resource balance. Additionally, enabling NLB’s client IP preservation functionality can help preserve client IPs for connections.

The helm chart file values.yaml

text

deployment:
  replicas: 6
  podLabels:
    elbv2.k8s.aws/pod-readiness-gate-inject: enabled

additionalArguments:
  - "--providers.kubernetesingress.ingressclass=traefik2"
  - "--providers.kubernetescrd.ingressclass=traefik2"
  - "--entryPoints.web.forwardedHeaders.insecure"
  #- "--entryPoints.web.proxyProtocol.insecure"
  - "--api.insecure=true"
  - "--metrics.prometheus=true"
  - "--entryPoints.web.transport.lifeCycle.requestAcceptGraceTimeout=90"

globalArguments:
  - "--global.sendanonymoususage=false"

ports:
  traefik:
    port: 8082
  web:
    port: 3080
  websecure:
    port: 3443

service:
  type: ClusterIP
  spec:
    clusterIP: None
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
    service.beta.kubernetes.io/aws-load-balancer-name: "k8s-traefik2-ingress"
    #service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8082"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/ping"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "deregistration_delay.timeout_seconds=120, preserve_client_ip.enabled=true"
    #service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "eregistration_delay.timeout_seconds=120, preserve_client_ip.enabled=true, deregistration_delay.connection_termination.enabled=true"

podDisruptionBudget:
  enabled: true
  minAvailable: 1

rollingUpdate:
  maxUnavailable: 0
  maxSurge: 1

Reference

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#target-group-attributes

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html

https://stackoverflow.com/a/51471388/6059840

https://github.com/kubernetes/kubernetes/issues/45509

https://github.com/kubernetes/kubernetes/pull/99212

https://github.com/kubernetes/kubernetes/pull/101080

https://github.com/kubernetes/enhancements/issues/2255

https://stackoverflow.com/questions/7665358/how-to-gracefully-shut-down-or-remove-aws-instances-from-an-elb-group

https://stackoverflow.com/a/67203212/6059840

https://serverfault.com/questions/919335/alb-connection-draining-is-always-reaching-the-deregistration-delay

https://aws.amazon.com/premiumsupport/knowledge-center/elb-fix-failing-health-checks-alb/

https://serverfault.com/questions/884501/aws-alb-connection-draining-not-closing-connections-after-deregistration-delay

https://serverfault.com/questions/919335/alb-connection-draining-is-always-reaching-the-deregistration-delay

https://stackoverflow.com/questions/47256085/aws-network-elb-take-4-minutes-to-recognise-target-as-healthy

https://stackoverflow.com/questions/33617090/kubernetes-scale-down-specific-pods

Related Content