The Horizontal Pod Autoscaler or the HPA service scales pods within a given Deployment or ReplicaSet object. It increases or decreases the number of pods per Service based on traffic and load conditions. Typical metrics include CPU utilization or custom metrics. The HPA is implemented as an API resource and a controller. Replica adjustment is the job of the controller and it bases its actions on the metric being used to autoscale. Eg CPU, Memory, number of connections etc.
HPA is the most used form of application workload scaling in Kubernetes and K8s based platforms. Support for HPA is provided by the two components – the HorizontalPodAutoscaler resource and a controller embedded into the kube-controller-manager. The most common metrics used to autoscale workloads are CPU and Memory. The Kubernetes Metrics Server is used to provide visibility into the Pod metrics to the HPA. The server collects CPU and Memory usage from the kubelets and makes them available through an API.
The HPA Autoscaler Algorithm
The autoscaler algorithm works as described below.
The HPA is internally implemented as a loop with a refresh period controlled by the controller manager’s –horizontal-pod-autoscaler-sync-period flag. The default value is 15 seconds.
Every 15 seconds, the controller manager queries for the metrics specified in each HorizontalPodAutoscaler definition. It then obtains the metrics from either the resource metrics API or the customer metrics API (if custom metrics that are application specific are used). If a target utilization value is provided in the definition, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers running in the pod. The controller then takes a mean of the utilization across all targeted pods and then uses it to arrive at a ratio which is used to scale the number of target replicas. The HPA controller thus maintains a constant watch on the HorizontalPodAutoscaler resource type. The application is usually defined with a Deployment resource type and when the algorithm has determined that the number of replicas needs to be adjusted it calls the relevant Deployment object via the API server. The Deployment controller updates the ReplicaSet leading to a change in the number of pods.
It can sometimes be noticed that due to dynamic workloads, scaling thresholds are exceeded and then met. This results in the number of replicas fluctuating rapidly. This is called thrashing. Beginning K8s 1.12, the cluster autoscaler mitigates this problem by setting a downscale stabilization time window. The default is 5 mts 0 sec.
Key Considerations in using HPA
- While the HPA is the most commonly used auto scaling method, its effectiveness depends on the application that is being scaled. Stateful workloads and application runtimes that have leader election methods wont work with HPA.
- HPA needs to be typically used with Cluster autoscaling for it to work well.
- Use custom metrics such as I/O requests per second, HTTP requests etc which are more relevant to an application when compared to CPU or Memory
- Configure cooldowns appropriately so that Pods don’t keep coming online and going offline due to threshing
- HPA needs a metrics server installed for custom metrics collection. Prometheus now provides a custom adapter on clusters that already run Prometheus
Conclusion
While the concept is straightforward, Horizontal pod autoscaling requires time and effort to implement. More often than not, applications need to undergo load testing to tune HPA configuration to determine the best scaling and capacity management settings. The next blogpost will discuss Vertical Pod Autoscaling.