Why Is HPA Scaling Slow

Recently, during a surge in business activity, we encountered a problem where the availability of our services decreased due to a sudden increase in resource usage caused by incoming traffic. There are various reasons for the unavailability of our services, and one of the direct causes is the sharp increase in resource usage when traffic surges, and the HPA does not scale in a timely manner.

This article aims to investigate this issue and primarily addresses it from three aspects:

  1. How slow is the scaling process?
  2. Why is the scaling process slow?
  3. What are the solutions to this problem?

To illustrate how slow the scaling process is, we conducted HPA scaling tests, recording the number of replicas and resource usage of pods at each moment. We compared the time when traffic increased and the time when the number of replicas first increased to approximate the scaling delay.

HPA’s data source configurations include “Object,” “Pods,” “Resource,” “ContainerResource,” and “External.” For simplicity, we used “Resource” for testing, and the Kubernetes version was 1.23.

Testing method: We prepared an Nginx deployment and service, then performed load testing on this service, recording the number of replicas at each moment.

The Nginx deployment has 2 replicas, with a CPU request of 20m, and the HPA’s target is set at 20% average utilization of the request.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx:1.18
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          requests:
            cpu: 20m
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-deployment
  namespace: default
spec:
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 20
        type: Utilization
    type: Resource
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: ngx-service
  namespace: default
spec:
  clusterIP: 10.252.211.253
  clusterIPs:
  - 10.252.211.253
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  sessionAffinity: None
  type: ClusterIP

We used the ab command to perform load testing on the cluster IP:

1
2
3
4
# date;ab -n 100000 -c 20 10.252.211.253/;date
Thu Nov  2 13:10:00 CST 2023
.....
Thu Nov  2 13:10:11 CST 2023

In another terminal, we recorded pod metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
while :; do date; kubectl get pods.metrics.k8s.io  -l app=nginx; echo; sleep 1; done

Thu Nov  2 13:10:23 CST 2023
NAME                                CPU   MEMORY   WINDOW
nginx-deployment-596d9ffddd-6lrhv   0     9604Ki   17.068s
nginx-deployment-596d9ffddd-w6cm2   0     2060Ki   17.634s

Thu Nov  2 13:10:25 CST 2023
NAME                                CPU          MEMORY   WINDOW
nginx-deployment-596d9ffddd-6lrhv   505634152n   9548Ki   13.763s
nginx-deployment-596d9ffddd-w6cm2   523202787n   2060Ki   13.914s

Thu Nov  2 13:10:27 CST 2023
NAME                                CPU          MEMORY   WINDOW
nginx-deployment-596d9ffddd-6lrhv   505634152n   9548Ki   13.763s
nginx-deployment-596d9ffddd-w6cm2   523202787n   2060Ki   13.914s

In another terminal, we recorded the number of replicas:

1
# while :; do date; kubectl get deployments.apps  nginx-deployment; sleep 1; echo; done

In another terminal, we recorded HPA resource changes:

1
kubectl get hpa nginx-deployment -o yaml -w

The complete test records are available at https://gist.github.com/wu0407/ebea8c0ee9ecbc15e94b3122f1a193dc.

  1. Load testing started at 13:10:00 and ended at 13:10:11.
  2. At 13:10:26, 2 replicas were scaled up. At 13:10:42, 4 replicas were scaled up. At 13:10:57, 2 replicas were scaled up.
  3. At 13:10:25, pod metrics showed an increase in resource usage.

replicas-result

Since the scaling threshold for pods is an average CPU usage of 4m, it can be roughly assumed that as long as there are requests, the average CPU usage of pods exceeds 4.2m (an additional --horizontal-pod-autoscaler-tolerance needs to be added, which defaults to 0.1). Therefore, the scaling delay in this experiment is approximately 26 seconds.

Scaling occurs in three stages, not all at once scaling to 10 replicas.

Even when there is no load testing traffic, scaling occurs. The currentMetrics in the HPA object shows resource usage as 0, but desiredReplicas and currentReplicas are not equal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
currentMetrics:
  - resource:
      current:
        averageUtilization: 0
        averageValue: "0"
      name: cpu
    type: Resource
  currentReplicas: 8
  desiredReplicas: 10
  lastScaleTime: "2023-11-02T05:10:57Z"

Why does the scaling behavior described above occur, and what causes the scaling delay?

To answer these questions, we first need to understand the scaling mechanism and scaling algorithm of HPA (Horizontal Pod Autoscaler).

The horizontal pod autoscaler controller is part of kube-controller-manager. It accesses various types of resource monitoring data by communicating with the API server. These monitoring data are provided by metrics-server, an Aggregated API Server that extends the metrics.k8s.io group. The custom.metrics.k8s.io and external.metrics.k8s.io group APIs are provided by the prometheus-adapter as Aggregated API Servers.

Here is an architecture diagram of HPA:

hpa-architecture

Image source: Weave Works Blog

The horizontal pod autoscaler controller by default performs an HPA object tuning every 15 seconds, meaning it calculates the desired number of replicas based on monitoring data every 15 seconds. If the desired number of replicas is not equal to the current number of replicas, scaling (up or down) is initiated.

The process involves the following steps:

  1. Access the API server to obtain monitoring data.
  2. Calculate the desired number of replicas.
  3. Control the scaling behavior.

Different types of data sources access different addresses to retrieve monitoring data. The specifics can be found in the HPA code annotations.

The following variables are used in the replicas calculation:

  • ratio: The current metric value relative to the target value.
  • tolerance: The --horizontal-pod-autoscaler-tolerance parameter, which specifies the range for tolerance during scaling and has a default value of 0.1.
  • Replicas: The number of replicas specified in the workload’s scale resource’s spec.replicas.
  • Current Replicas: The current number of replicas in the scale resource’s status.replicas.
  • Desired Replicas: The desired number of replicas.
  1. If the target type is “Value,” then ratio = MetricValue / spec.metrics[*].object.target.value.
    • If spec.replicas is 0, desiredReplicas is rounded up to ratio.
    • If spec.replicas is greater than 0, and ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and desiredReplicas remains at spec.replicas. Otherwise, desiredReplicas is rounded up to ratio * readyPodCount.
  2. If the target type is “AverageValue,” then ratio = MetricValue / (spec.metrics[*].object.target.averageValue * status.replicas).
    • If ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and desiredReplicas remains at spec.replicas.
    • Otherwise, desiredReplicas is calculated as MetricValue / spec.metrics[*].object.target.averageValue, rounded up.

hpa-object

Here, readyPodCount refers to the number of pods in a ready state.

  1. If the target type is “Value,” then ratio = totalValue / spec.metrics[*].external.target.value.
    • If spec.replicas is 0, desiredReplicas is rounded up to ratio.
    • If spec.replicas is greater than 0, and ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and desiredReplicas remains at spec.replicas. Otherwise, desiredReplicas is rounded up to ratio * readyPodCount.
  2. If the target type is “AverageValue,” then ratio = totalValue / (spec.metrics[*].external.target.averageValue * status.replicas).
    • If ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and desiredReplicas remains at status.replicas.
    • Otherwise, desiredReplicas is calculated as totalValue / spec.metrics[*].external.target.averageValue, rounded up.

hpa-external

Pod categorization is necessary because the status of pods and the presence of monitoring data affect the calculation of replica numbers. Pods are classified into different categories based on their status and monitoring data.

cpuInitializationPeriod: The value of --horizontal-pod-autoscaler-cpu-initialization-period, which is set to 5 minutes by default.

delayOfInitialReadinessStatus: The value of --horizontal-pod-autoscaler-initial-readiness-delay, set to 30 seconds by default.

Pods are categorized based on their status and monitoring data into four groups: “Ready with Monitoring Data,” “Unready Pods,” “Ignored Pods,” and “Missing Pods.”

Unready Pods:

  • Pods with a Phase of “Pending.”
  • Data sources of type “resource” or “containerResource” for CPU resource monitoring.
    • Pods with no “Ready” condition in their status or a pod.Status.StartTime value of nil (indicating the pod hasn’t been taken over by the kubelet).
    • If the calculated replica time hasn’t exceeded the pod’s startup time plus cpuInitializationPeriod and the ready condition is false.
    • If the pod is in a ready state, but the metric’s timestamp is before readyCondition.LastTransitionTime plus metric.Window.
    • If the calculated replica time has exceeded the pod’s startup time plus cpuInitializationPeriod, the ready condition is false, and readyCondition.LastTransitionTime is within the time period defined by pod.Status.StartTime plus delayOfInitialReadinessStatus.

Missing Pods: Pods that lack monitoring data.

Ignored Pods: Pods that have been deleted or have a Phase of “Failed.”

The two types (External-Type and Object-Type) of monitoring data mentioned above are aggregate data, meaning multiple pods correspond to a single set of monitoring data. The next three types (Pods and Resource and Container Resource) of monitoring data are not aggregated; each pod has its monitoring data. Pod anomalies and missing monitoring data can lead to inaccurate replica calculations. To avoid excessive scaling up and down, data correction is performed as follows:

  1. Calculate the desired number of replicas based on the number of ready pods and existing monitoring data. Determine whether scaling up or down is needed without considering tolerance.
    • For scaling up, correct the monitoring data for pods without monitoring data to 0.
    • For scaling down, correct the monitoring data for pods without monitoring data to the target value set in the HPA object.
  2. If scaling up is required and there are unready pods, correct the monitoring data for unready pods to 0. This is done to prevent new pods with high CPU usage during startup from triggering continuous scaling.
  3. The following table summarizes actions to take based on the presence of unready pods and missing pods:
unreadyPods > 0 missingPods > 0 action
Scale Up true true Fix unreadyPods and missingPods metrics value as 0
Scale Up true false No action
Scale Up false true Fix missingPods metrics value as 0
Scale Up false false No action
Scale Down true true Fix missingPods metrics as the target value
Scale Down true false No action
Scale Down false true Fix missingPods metrics as the target value
Scale Down false false No action
  • readyMissingPodMetricsCount is the count of pod metrics after removing ignoredPods and unreadyPods.
  • afterFixMetricsCount is the count of pod metrics after data correction.
  1. Calculate ratio = metricsTotal / (readyMissingPodMetricsCount * spec.metrics[*].pods.target.averageValue).
  2. If there are missingPods and ratio is less than 1 (scaling down), correct the monitoring data for missingPods to spec.metrics[*].pods.target.averageValue, and include missingPods in afterFixMetricsCount.
  3. If there are missingPods and ratio is greater than or equal to 1 (scaling up or no change), correct the monitoring data for missingPods to 0 and include missingPods in afterFixMetricsCount. If there are unreadyPods, correct their monitoring data to 0 and include unreadyPods in afterFixMetricsCount.
  4. If there are no missingPods and ratio is greater than 1+tolerance (scaling up) and unreadyPods exist, correct the monitoring data for unreadyPods to 0, and include missingPods in afterFixMetricsCount.
  5. If there are no missingPods and ratio is greater than 1+tolerance (scaling up) and unreadyPods do not exist, the desiredReplicas is calculated as ratio * readyPodCount, rounded up.
  6. If there are no missingPods and ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and the desiredReplicas remains at spec.replicas.
  7. If there are no missingPods and ratio is less than 1-tolerance (scaling down), the desiredReplicas is calculated as ratio * readyPodCount, rounded up.
  8. Recalculate the new ratio: newRatio = afterFixMetricsTotal / (afterFixMetricsCount * spec.metrics[*].pods.target.averageValue.
  9. If the new ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and the desiredReplicas remains at spec.replicas.
  10. If the new ratio is greater than 1+tolerance and the previous ratio was less than 1-tolerance (scaling down followed by scaling up), no scaling is performed, and the desiredReplicas remains at spec.replicas.
  11. If the new ratio is less than 1-tolerance and the previous ratio was greater than 1+tolerance (scaling up followed by scaling down), no scaling is performed, and the desiredReplicas remains at spec.replicas.
  12. Calculate the new number of replicas as ceil(afterFixMetricsTotal / spec.metrics[*].pods.target.averageValue).
  13. If the new ratio is greater than 1+tolerance and the previous ratio was also greater than 1+tolerance (scaling up followed by scaling up), and the new number of replicas is less than spec.replicas, no scaling is performed, and the desiredReplicas remains at spec.replicas.
  14. If the new ratio is less than 1-tolerance and the previous ratio was also less than 1-tolerance (scaling down followed by scaling down), and the new number of replicas is greater than spec.replicas, no scaling is performed, and the desiredReplicas remains at spec.replicas.
  15. In all other cases, the desiredReplicas is set to ceil(afterFixMetricsTotal / spec.metrics[*].pods.target.averageValue) rounded up.

hpa-pods

This process is similar to “Pods-Type” but with a different method for obtaining metric data, and the target value is spec.metrics[*].resource.target.averageValue.

  • readyMissingPodMetricsCount is the count of pod metrics after removing ignoredPods and unreadyPods.
  • afterFixMetricsCount is the count of pod metrics after data correction.
  1. Calculate ratio = metricsTotal / (readyMissingPodMetricsCount * spec.metrics[*].resource.target.averageValue).
  2. If there are missingPods and ratio is less than 1 (scaling down), correct the monitoring data for missingPods to spec.metrics[*].resource.target.averageValue, and include missingPods in afterFixMetricsCount.
  3. If there are missingPods and ratio is greater than or equal to 1 (scaling up or no change), correct the monitoring data for missingPods to 0, and include missingPods in afterFixMetricsCount. If there are unreadyPods, correct their monitoring data to 0, and include unreadyPods in afterFixMetricsCount.
  4. If there are no missingPods and ratio is greater than 1+tolerance (scaling up) and unreadyPods exist, correct the monitoring data for unreadyPods to 0, and include missingPods in afterFixMetricsCount.
  5. If there are no missingPods and ratio is greater than 1+tolerance (scaling up) and unreadyPods do not exist, the desiredReplicas is calculated as ratio * readyPodCount, rounded up.
  6. If there are no missingPods and ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and the desiredReplicas remains at spec.replicas.
  7. If there are no missingPods and ratio is less than 1-tolerance (scaling down), the desiredReplicas is calculated as ratio * readyPodCount, rounded up.
  8. Recalculate the new ratio: newRatio = afterFixMetricsTotal / (afterFixMetricsCount * spec.metrics[*].resource.target.averageValue.
  9. If the new ratio is within the range [1-tolerance, 1+tolerance], no scaling is performed, and the desiredReplicas remains at spec.replicas.
  10. If the new ratio is greater than 1+tolerance and the previous ratio was less than 1-tolerance (scaling down followed by scaling up), no scaling is performed, and the desiredReplicas remains at spec.replicas.
  11. If the new ratio is less than 1-tolerance and the previous ratio was greater than 1+tolerance (scaling up followed by scaling down), no scaling is performed, and the desiredReplicas remains at spec.replicas.
  12. Calculate the new number of replicas as ceil(afterFixMetricsTotal / spec.metrics[*].resource.target.averageValue).
  13. If the new ratio is greater than 1+tolerance and the previous ratio was also greater than 1+tolerance (scaling up followed by scaling up), and the new number of replicas is less than spec.replicas, no scaling is performed, and the desiredReplicas remains at spec.replicas.
  14. If the new ratio is less than 1-tolerance and the previous ratio was also less than 1-tolerance (scaling down followed by scaling down), and the new number of replicas is greater than spec.replicas, no scaling is performed, and the desiredReplicas remains at spec.replicas.
  15. In all other cases, the desiredReplicas is set to ceil(afterFixMetricsTotal / spec.metrics[*].resource.target.averageValue) rounded up.

hpa-resource-AverageValue

In this case, the calculation of the ratio has changed to the following formula: ratio = metricsTotal * 100 / (requestTotal * spec.metrics[*].resource.target.averageUtilization), where requestTotal is the sum of resource requests for all containers in the pods.

readyMissingPodMetricsCount represents the number of metrics after removing ignoredPods and unreadyPods from all pod metrics.

afterFixMetricsCount represents the number of metrics for pods after data correction.

  1. Calculate ratio = metricsTotal * 100 / (requestTotal * spec.metrics[*].resource.target.averageUtilization).
  2. If there are missingPods and the ratio is less than 1 (scaling down), correct the monitoring data for missingPods to spec.metrics[*].resource.target.averageUtilization, and include missingPods in the afterFixMetricsCount.
  3. If there are missingPods and the ratio is greater than or equal to 1 (scaling up or no change), correct the monitoring data for missingPods to 0, and include missingPods in the afterFixMetricsCount. If there are unreadyPods, correct their monitoring data to 0, and include unreadyPods in the afterFixMetricsCount.
  4. If there are no missingPods, and the ratio is greater than 1 + tolerance (scaling up) and there are unreadyPods, correct the monitoring data for unreadyPods to 0, and include missingPods in the afterFixMetricsCount.
  5. If there are no missingPods, and the ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  6. If there are no missingPods, and the ratio is less than 1 - tolerance, scaling down is performed, and desiredReplicas is calculated as ratio * readyPodCount rounded up.
  7. Recalculate the new ratio: newRatio = afterFixMetricsTotal * 100 / (requestTotal * spec.metrics[*].resource.target.averageUtilization).
  8. If the new ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  9. If the new ratio is greater than 1 + tolerance and the initial ratio was less than 1 - tolerance (scaling down to scaling up), no scaling is performed, and desiredReplicas is set to spec.replicas.
  10. If the new ratio is less than 1 - tolerance and the initial ratio was greater than 1 + tolerance (scaling up to scaling down), no scaling is performed, and desiredReplicas is set to spec.replicas.
  11. Calculate the new number of replicas: ceil(afterFixMetricsTotal * newRatio).
  12. If the new ratio is greater than 1 + tolerance, and the initial ratio was also greater than 1 + tolerance, and the new number of replicas is less than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  13. If the new ratio is less than 1 - tolerance, and the initial ratio was also less than 1 - tolerance, and the new number of replicas is greater than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  14. In all other cases, desiredReplicas is set to ceil(afterFixMetricsCount * newRatio) rounded up.

hpa-resource-AverageUtilization

The calculation for ContainerResource type data sources with the type set to AverageValue is similar to the Resource type with AverageValue, with the difference being that metricsTotal is the metrics value for each container (defined in spec.metrics[*].containerResource.container) in pod metrics.

readyMissingPodMetricsCount represents the number of metrics after removing ignoredPods and unreadyPods from all pod metrics.

afterFixMetricsCount represents the number of metrics for pods after data correction.

  1. Calculate ratio = metricsTotal / (readyMissingPodMetricsCount * spec.metrics[*].containerResource.target.averageValue).
  2. If there are missingPods and the ratio is less than 1 (scaling down), correct the monitoring data for missingPods to spec.metrics[*].containerResource.target.averageValue, and include missingPods in the afterFixMetricsCount.
  3. If there are missingPods and the ratio is greater than or equal to 1 (scaling up or no change), correct the monitoring data for missingPods to 0, and include missingPods in the afterFixMetricsCount. If there are unreadyPods, correct their monitoring data to 0, and include unreadyPods in the afterFixMetricsCount.
  4. If there are no missingPods, and the ratio is greater than 1 + tolerance (scaling up) and there are unreadyPods, correct the monitoring data for unreadyPods to 0, and include missingPods in the afterFixMetricsCount.
  5. If there are no missingPods, and the ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  6. If there are no missingPods, and the ratio is less than 1 - tolerance, scaling down is performed, and desiredReplicas is calculated as ratio * readyPodCount rounded up.
  7. Recalculate the new ratio: newRatio = afterFixMetricsTotal / (afterFixMetricsCount * spec.metrics[*].containerResource.target.averageValue).
  8. If the new ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  9. If the new ratio is greater than 1 + tolerance and the initial ratio was less than 1 - tolerance (scaling down to scaling up), no scaling is performed, and desiredReplicas is set to spec.replicas.
  10. If the new ratio is less than 1 - tolerance and the initial ratio was greater than 1 + tolerance (scaling up to scaling down), no scaling is performed, and desiredReplicas is set to spec.replicas.
  11. Calculate the new number of replicas: ceil(afterFixMetricsTotal / spec.metrics[*].containerResource.target.averageValue).
  12. If the new ratio is greater than 1 + tolerance, and the initial ratio was also greater than 1 + tolerance, and the new number of replicas is less than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  13. If the new ratio is less than 1 - tolerance, and the initial ratio was also less than 1 - tolerance, and the new number of replicas is greater than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  14. In all other cases, desiredReplicas is set to ceil(afterFixMetricsCount * newRatio) rounded up.

hpa-ContainerResource-AverageValue

The calculation process for ContainerResource type data sources with the type set to AverageUtilization is similar to the Resource type with AverageUtilization.

Here, totalRequest represents the resource requests of containers in pods (defined in spec.metrics[*].containerResource.container).

readyMissingPodMetricsCount represents the number of metrics after removing ignoredPods and unreadyPods from all pod metrics.

afterFixMetricsCount represents the number of metrics for pods after data correction.

  1. Calculate ratio = metricsTotal * 100 / (requestTotal * spec.metrics[*].containerResource.target.averageUtilization).
  2. If there are missingPods and the ratio is less than 1 (scaling down), correct the monitoring data for missingPods to spec.metrics[*].containerResource.target.averageUtilization, and include missingPods in the afterFixMetricsCount.
  3. If there are missingPods and the ratio is greater than or equal to 1 (scaling up or no change), correct the monitoring data for missingPods to 0, and include missingPods in the afterFixMetricsCount. If there are unreadyPods, correct their monitoring data to 0, and include unreadyPods in the afterFixMetricsCount.
  4. If there are no missingPods, and the ratio is greater than 1 + tolerance (scaling up) and there are unreadyPods, correct the monitoring data for unreadyPods to 0, and include missingPods in the afterFixMetricsCount.
  5. If there are no missingPods, and the ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  6. If there are no missingPods, and the ratio is less than 1 - tolerance, scaling down is performed, and desiredReplicas is calculated as ratio * readyPodCount rounded up.
  7. Recalculate the new ratio: newRatio = afterFixMetricsTotal * 100 / (requestTotal * spec.metrics[*].containerResource.target.averageUtilization).
  8. If the new ratio is within the range [1 - tolerance, 1 + tolerance], no scaling is performed, and desiredReplicas is set to spec.replicas.
  9. If the new ratio is greater than 1 + tolerance and the initial ratio was less than 1 - tolerance (scaling down to scaling up), no scaling is performed, and desiredReplicas is set to spec.replicas.
  10. If the new ratio is less than 1 - tolerance and the initial ratio was greater than 1 + tolerance (scaling up to scaling down), no scaling is performed, and desiredReplicas is set to spec.replicas.
  11. Calculate the new number of replicas: ceil(afterFixMetricsTotal * newRatio).
  12. If the new ratio is greater than 1 + tolerance, and the initial ratio was also greater than 1 + tolerance, and the new number of replicas is less than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  13. If the new ratio is less than 1 - tolerance, and the initial ratio was also less than 1 - tolerance, and the new number of replicas is greater than spec.replicas, no scaling is performed, and desiredReplicas is set to spec.replicas.
  14. In all other cases, desiredReplicas is set to ceil(afterFixMetricsCount * newRatio) rounded up.

hpa-ContainerResource-AverageUtilization

Once the previous process is completed, you obtain an expected number of replicas. However, this number is not the final number calculated by the HPA controller. It needs to go through scaling behavior control policies to determine the ultimate number of replicas.

Scaling behavior policies control the speed of scaling to prevent rapid and unstable scaling.

Scaling behavior control is divided into two types: when spec.behavior is not set in the HPA object (default scaling behavior) and when spec.behavior is set.

downscaleStabilisationWindow: Value is set to --horizontal-pod-autoscaler-downscale-stabilization, which is 5 minutes by default.

  1. Record the number of replicas calculated by the previous process and the time it was executed in memory.
  2. Find the maximum number of replicas in the downscaleStabilisationWindow window from memory, denoted as stabilizedRecommendation.
  3. The maximum scaling limit within this window, scaleUpLimit, is calculated as max(2*spec.replicas, 4).
  4. Normalize the stabilizedRecommendation (cap it at the upper limit if greater, or set it to the lower limit if less) to obtain the final number of replicas. The upper limit is min(scaleUpLimit, hpa.Spec.MaxReplicas), and the lower limit is minReplicas (defaulting to hpa.Spec.minReplicas or 1 if not set), ensuring that the desired number of replicas falls within [minReplicas, min(max(2*spec.replicas, 4), hpa.Spec.maxReplicas)].

Every time the HPA controller scales, it records the change in the number of replicas for the corresponding workload and the scaling time in memory.

  1. Record the desiredReplicas and execution time from the previous process in memory.

  2. Find the minimum number of replicas, upRecommendation, from the hpa.spec.behavior.scaleUp.stabilizationWindowSeconds window (including desiredReplicas ).

  3. Find the maximum number of replicas, downRecommendation, from the hpa.spec.behavior.scaleDown.stabilizationWindowSeconds window (including desiredReplicas ).

  4. Normalize the spec.replicas to obtain the stabilized window’s number of replicas, stabilizedRecommendation.

    • If spec.replicas is greater than downRecommendation, then stabilizedRecommendation is set to downRecommendation.

    • If spec.replicas is less than upRecommendation, then stabilizedRecommendation is set to upRecommendation.

    • In summary, stabilizedRecommendation is within the range [upRecommendation, downRecommendation], and scale up is only possible if spec.replicas is less than the minimum value in the hpa.spec.behavior.scaleUp.stabilizationWindowSeconds window, scale down is only possible if spec.replicas greater than the maximum value in the hpa.spec.behavior.scaleDown.stabilizationWindowSeconds window.

  5. In the case of scaling up (stabilizedRecommendation is greater than spec.replicas):

    • If hpa.spec.behavior.scaleUp.selectPolicy is set to Disabled, no scaling is performed, and the final number of replicas is set to spec.replicas.

    • If hpa.spec.behavior.scaleUp.selectPolicy is set to Max, the following steps are performed for each policy in hpa.spec.behavior.scaleUp.policies:

      • Find the cumulative change in the number of replicas, replicasAddedInCurrentPeriod, within the policy.periodSeconds policy window. The number of replicas at the start of the window is periodStartReplicas = spec.replicas - replicasAddedInCurrentPeriod.

      • If the policy type policy.Type is “Pods,” the upper limit for the policy window is policyLimit = periodStartReplicas + policy.Value.

      • If the policy type policy.Type is “Percent,” the upper limit for the policy window is policyLimit = Ceil(periodStartReplicas * (1 + policy.Value/100)), rounded up.

      • The maximum scaling limit within this window, scaleUpLimit, is the maximum of all the policy window upper limits: scaleUpLimit = max(policyLimit1, policyLimit2, ...).

    • If hpa.spec.behavior.scaleUp.selectPolicy is set to Min, the following steps are performed for each policy in hpa.spec.behavior.scaleUp.policies:

      • Find the cumulative change in the number of replicas, replicasAddedInCurrentPeriod, within the policy.periodSeconds policy window. The number of replicas at the start of the window is periodStartReplicas = spec.replicas - replicasAddedInCurrentPeriod.

      • If the policy type policy.Type is “Pods,” the upper limit for the policy window is policyLimit = periodStartReplicas + policy.Value.

      • If the policy type policy.Type is “Percent,” the upper limit for the policy window is policyLimit = Ceil(periodStartReplicas * (1 - policy.Value/100)), rounded up.

      • The maximum scaling limit within this window, scaleUpLimit, is the minimum of all the policy window upper limits: scaleUpLimit = min(policyLimit1, policyLimit2, ...).

    • The final number of replicas is calculated as min(stabilizedRecommendation, min(scaleUpLimit, hpa.Spec.maxReplicas)).

  6. In the case of scaling down (stabilizedRecommendation is greater than spec.replicas):

    • If hpa.spec.behavior.scaleDown.selectPolicy is set to Disabled, no scaling is performed, and the final number of replicas is set to spec.replicas.

    • If hpa.spec.behavior.scaleDown.selectPolicy is set to Max, the following steps are performed for each policy in hpa.spec.behavior.scaleDown.policies:

      • Find the cumulative change in the number of replicas, replicasAddedInCurrentPeriod, within the policy.periodSeconds policy window. The number of replicas at the start of the window is periodStartReplicas = spec.replicas + replicasAddedInCurrentPeriod.

      • If the policy type policy.Type is “Pods,” the lower limit for the policy window is policyLimit = periodStartReplicas - policy.Value.

      • If the policy type policy.Type is “Percent,” the lower limit for the policy window is policyLimit = Ceil(periodStartReplicas * (1 - policy.Value/100)), rounded up.

      • The maximum scaling limit within this window, scaleUpLimit, is the maximum of all the policy window lower limits: scaleUpLimit = max(policyLimit1, policyLimit2, ...).

    • If hpa.spec.behavior.scaleDown.selectPolicy is set to Min, the following steps are performed for each policy in hpa.spec.behavior.scaleDown.policies:

      • Find the cumulative change in the number of replicas, replicasAddedInCurrentPeriod, within the policy.periodSeconds policy window. The number of replicas at the start of the window is periodStartReplicas = spec.replicas + replicasAddedInCurrentPeriod.

      • If the policy type policy.Type is “Pods,” the lower limit for the policy window is policyLimit = periodStartReplicas - policy.Value.

      • If the policy type policy.Type is “Percent,” the lower limit for the policy window is policyLimit = Ceil(periodStartReplicas * (1 - policy.Value/100)), rounded up.

      • The maximum scaling limit within this window, scaleUpLimit, is the minimum of all the policy window lower limits: scaleUpLimit = min(policyLimit1, policyLimit2, ...).

    • The final number of replicas is calculated as max(stabilizedRecommendation, max(scaleUpLimit, hpa.Spec.maxReplicas)).

Slow scaling involves two aspects: response time and the number of replicas per scaling, i.e., the scaling speed.

Slow scaling involves three aspects: response time, the number of replicas added during each scaling event (scaling speed), and the sensitivity of the scaling process.

Due to the periodic collection of monitoring information by the metrics-server on kubelet, with default cycles of 15s, and the cadvisor within kubelet, with a cycle of 30s, the HPA controller calculates the workload replicas every 15 seconds.

For resource and containerResource data source types, the scaling delay ranges from 0 to 60s, with a maximum delay of 60s.

For other scaling types, the delay is influenced by the monitoring components (e.g., Prometheus, VictoriaMetrics) and ranges from 0 to 15 + monitoring collection cycle.

Understanding the HPA controller’s replica calculation process, the final number of replicas is determined by monitoring data and scaling behavior control. Analyzing the initial test results, let’s examine the status field when there’s a change in HPA resources for the first time after the load test begins. Here, the monitoring shows averageUtilization as 2575 and averageValue as 515m.

Based on this data, the expected number of replicas is 258 = ceil(spec.replicas * averageValue * 100 / request * target.averageUtilization) = ceil(2 * 515 * 100/20 * 20).

As hpa.spec.maxReplicas is 10, this value is stored in memory. Since hpa.spec.behavior is not configured, the scaling upper limit for this window is 4 = max(2 * spec.replicas, 4) = max(4, 4). Therefore, the final number of replicas for the first time is 4.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
status:
  conditions:
  - lastTransitionTime: "2023-11-02T03:27:06Z"
    message: the HPA controller was able to update the target scale to 4
    reason: SucceededRescale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2023-11-02T03:37:07Z"
    message: the HPA was able to successfully calculate a replica count from cpu resource
      utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2023-11-02T05:01:38Z"
    message: the desired replica count is increasing faster than the maximum scale
      rate
    reason: ScaleUpLimit
    status: "True"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 2575
        averageValue: 515m
      name: cpu
    type: Resource
  currentReplicas: 2
  desiredReplicas: 4
  lastScaleTime: "2023-11-02T05:10:26Z"

The second time the HPA resources change, the monitoring data in the status field, averageUtilization and averageValue, are both 0. Since the minimum number of replicas is 1 (as hpa.spec.minReplicas is not configured, defaulting to 0), the expected number of replicas is 1, and it is stored in memory. Since the maximum number of replicas in memory for this window is 10, scaling is still performed. Therefore, the final number of replicas is 8 = max(2 * spec.replicas, 4) = max(8, 4). Subsequent scaling is similar, and we won’t analyze it here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
status:
  conditions:
  - lastTransitionTime: "2023-11-02T03:27:06Z"
    message: the HPA controller was able to update the target scale to 8
    reason: SucceededRescale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2023-11-02T03:37:07Z"
    message: the HPA was able to successfully calculate a replica count from cpu resource
      utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2023-11-02T05:01:38Z"
    message: the desired replica count is increasing faster than the maximum scale
      rate
    reason: ScaleUpLimit
    status: "True"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 0
        averageValue: "0"
      name: cpu
    type: Resource
  currentReplicas: 4
  desiredReplicas: 8
  lastScaleTime: "2023-11-02T05:10:41Z"

Due to the --horizontal-pod-autoscaler-tolerance parameter, which controls the acceptable range of fluctuations during scaling, it is designed to prevent unexpected scaling behaviors caused by jitter in monitoring data. However, it also reduces the sensitivity of the scaling process. The default value is 0.1, meaning it can tolerate a 10% change in monitoring data.

For example, in the scenario above, scaling would only occur when the average CPU utilization of pods reaches 22% of the request.

In version 1.23, the HPA controller runs only one goroutine to handle all HPA resources in the cluster, creating a performance bottleneck in clusters with numerous HPA objects. Therefore, in version 1.26, the --concurrent-horizontal-pod-autoscaler-syncs command-line option is introduced to configure the number of goroutines PR#108501.

Due to the existence of unready pod monitoring data repair issues (where data is fixed to 0 during scaling), the scaling speed can be slowed.

The shorter the time needed for pod readiness, the faster the scaling. The time from pod startup to readiness depends on pod scheduling, kubelet response to completed pod scheduling, image downloading, container creation, application startup, and application readiness.

Constituents-of-lag-in-autoscaling

Image Source: medium.com/expedia-group-tech

  1. Shorten the Monitoring Chain Length

    • Simplify the monitoring chain length to reduce the response time for scaling.
    • Projects like Knative and KEDA also handle horizontal scaling. Knative simplifies the monitoring chain to address scaling response time issues, supporting QPS and TPS-based scaling for near-instant elasticity.
    • While KEDA replaces Prometheus-adaptor and provides external and custom metrics, it maintains HPA mechanisms and addresses the sensitivity of event-driven scaling (knative also support).
  2. Shorten Pod Ready Time

  3. HPA Controller Performance Improvement

    • Enhance the performance of the HPA controller by utilizing versions 1.26 and above, which support multiple goroutines. Use sufficient resources to run kube-controller-manager.
  4. Predictive Scaling or Scheduled Scaling

    • Adopt a different approach by considering predictive scaling or scheduled scaling.
    • Solutions like AHPA by Alibaba Cloud and Kapacity by Ant Financial offer proactive scaling strategies.
    • Historical traffic-based predictive scaling can be implemented with tools such as Crane’s EHPA, Kapacity by Ant Financial, and AHPA by Alibaba Cloud.
  5. Configuring Sensible Scaling Behavior Policies

    If the behavior field is not set, the scaling quantity during each scaling operation is constrained by the maximum replicas within the downscaleStabilisationWindow window and the current replica count spec.replicas (the maximum scaling limit scaleUpLimit is max(2*spec.replicas, 4)).

    Therefore, without setting the behavior field, it’s essentially impossible to increase the scaling speed. This is because downscaleStabilisationWindow determines the duration of the window with the maximum replicas, and its purpose is to prevent unstable replica counts caused by sudden increases followed by decreases, while spec.replicas remains fixed during the tuning cycle of the HPA object.

    Since the configuration of the behavior field affects scaling, a reasonable configuration of the behavior field can improve scaling speed.

    1. Increase Scaling Speed: Decrease hpa.spec.behavior.scaleUp.stabilizationWindowSeconds (default is 0 if not set) and increase hpa.spec.behavior.scaleUp.policies[*].Value, and decrease hpa.spec.behavior.scaleUp.policies[*].PeriodSeconds.
    2. Increase Shrinking Speed: Decrease hpa.spec.behavior.scaleDown.stabilizationWindowSeconds (default is the value of --horizontal-pod-autoscaler-downscale-stabilization if not set) and increase hpa.spec.behavior.scaleDown.policies[*].Value, and decrease hpa.spec.behavior.scaleDown.policies[*].PeriodSeconds.
  6. Configuring Sensible Tolerance

    Set a reasonable value for --horizontal-pod-autoscaler-tolerance. It is a double-edged sword; if not adjusted properly, it can lead to frequent scaling behaviors.

The delays in scaling caused by the lengthy monitoring data acquisition chain and HPA controller performance issues can be addressed by optimizing pod ready time and considering proactive scaling strategies. The duration of application readiness and the impact of hpa.spec.behavior also play crucial roles in scaling speed.

To achieve faster pod ready times, optimization efforts should be directed at each stage from pod creation to pod readiness. Different projects, such as Knative, Crane’s EHPA, and Kapacity, offer diverse solutions to tackle these challenges.

我们如何构建生产级HPA:从有效算法到无风险的自动扩展 | How We Build Production-Grade HPA: From Effective Algorithm to Risk-Free Autoscaling - Ziqiu Zhu & Yiru Guo, Ant Group

Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?

Kubernetes 1.27: updates on speeding up Pod startup

Related Content