Halving Kubernetes Compute Costs With Vertical Pod Autoscaler

Published in

Compass True North

8 min readAug 31, 2022

A Focus on Features

Compass has been growing rapidly over the last couple of years. As we became the largest independent brokerage in terms of 2021 closed sales volume as of March 25, 2022, our tech team has been focusing heavily on shipping products to improve agent productivity. From a compute platform perspective, the Cloud Engineering team has been working to organize and scale our operations effectively as the number of clusters and applications per cluster increases. To put our growth in perspective, the number of Deployments in our production Kubernetes clusters almost doubled in the past year alone.

Scaling from 500 to 1000 Kubernetes Deployments

This put a focus on being better governed: utilizing a consistent production deployment schedule, building visibility in our daily cluster updates, safely upgrading core cluster components, and widening our scope of cluster policies. While our clusters utilize Cluster Autoscaler to dynamically scale as applications request more or fewer resources, one area that our platform hasn’t addressed is the rightsizing of actual application requests. Below is our experience identifying the opportunity to rightsize, building safeguards to safely implement a solution, implementing a vertical scaling solution, and finally — our results!

Identifying an Opportunity to Improve

As our feature-set grew, there was less focus on appropriately sizing the amount of requested CPU and Memory for our workloads deployed to our Kubernetes clusters. Some product teams would visit weekly Office Hours to discuss resource requirements, and while our infrastructure teams had general guidelines based on the app, there were still many variables. The metric below helps scope the size of the problem in one of our non-production clusters, where applications running on the cluster were utilizing less than one-tenth of the requested CPU.

Resource Requests & Usage from a non-production cluster

Before panicking, it may make sense to add some context around why this came to be. As I mentioned, Compass has been in a high-growth environment for the past couple of years, and the number of applications has scaled substantially. The focus has been on shipping more and better products for our agents.

One cause of the separation between requests and usage is outdated default CPU and Memory requests for any newly created application created through our templating service. These pre-set quantities may have been based on regularly used applications, and spinning up new applications may not need those levels of requests. It’s become clear that some of these templated values need to be revisited.

Another cause is that application teams sometimes attempt to keep parity between non-production and production. Obviously, non-production is likely to need fewer resources than production, but it may not be that simple. Teams often conduct load testing in lower environments, and in these cases, it makes sense to scale beta applications like production to have more realistic load test results.

A final cause is a blurred line of responsibility between product teams and infrastructure teams. It’s expected that product teams have less knowledge of the infrastructure as one of the goals of an Infrastructure organization is to abstract this away. However, we give product teams full flexibility in their Kubernetes workload manifests, which blurs that line of responsibility for application resources.

Regardless of the past issues, the Cloud Engineering team took action to pull requests closer to the actual usage and reduce the costs of our Kubernetes clusters.

Exploring Implementation Options

After some initial research, we narrowed our approach to four different options:

Horizontal Pod Autoscaler
Vertical Pod Autoscaler
Multi-Dimensional Scaling
Rightsizing Manually

Horizontal Pod Autoscaler had some solid pros including that it is stable in the latest Kubernetes versions, and we’ve actually experimented with it before at Compass a year ago. Yet when assessing our applications, many of them only use two pods already, which is the minimum number of replicas we would scale down to for redundancy. After some back-of-the-envelope math, we concluded that HPA would be most effective for us in production, where workloads had a greater number of replicas, but would have very limited results for us in non-production.

Vertical Pod Autoscaler looked to be more of a fit for our use case in non-production. It offered the ability for us to scale requests based on past usage, and dynamically scale as needed if requests change over time as the application changes. The downside is that it is still in Beta and our team would have to make tradeoffs that introduce new risks, which I will elaborate on in the “Accepting Risks” section below.

Multi-Dimensional scaling looked optimistic during our initial readings, but we wanted to scale in one dimension before adding additional variables. We see our team incorporating this in the future.

Finally, rightsizing manually is not a good option because we have thousands of deployments across tens of clusters at Compass. Unless we relied on product teams to make manual changes, this would be difficult to do as a lean infrastructure team. Additionally, when applications change, these values would need to change as well, creating a never-ending rightsizing task.

Why Vertical Pod Autoscaler

We ultimately chose Vertical Pod Autoscaler as we believed it would make the most impact on our situation. Given that we had many thousands of applications that were using a fraction of the resources requested, and many apps that already had a reasonable number of replicas, vertical scaling offered the most immediate bang for our buck while allowing still us to take future steps like multidimensional scaling.

One key feature that gave us confidence was the ability to scale only requests. Any existing limits used by applications will remain intact, meaning that if apps spike above their new requests, the pod will remain running unless it hits the resource limits.

Accepting Risks

The primary concern with only scaling requests is that more pods will fit on fewer nodes. This can result in an overcommitted state, where the total amount of resource limits scheduled on a single node may be greater than the node’s resources.

If all containers on a given node were to use more resources than requested, some containers' resource usage can be throttled. Additionally, pods can be evicted when resource usage exceeds requests to prevent starvation. Kubernetes describes these scenarios in the Node-pressure Eviction documentation.

Ultimately, we were willing to accept this risk across our non-production clusters. To combat these risks, we built monitoring into our dashboards to highlight resource throttling — which is expanded on further in the next section: Safeguards.

Prerequisites: Safeguards

When building our implementation plan, we assessed several risks. Some of these risks are stated in the VPA documentation, and other risks are Compass-specific.

For example, the documentation warns not to use Vertical Pod Autoscaler with Horizontal Pod Autoscaler as both utilities take scaling action with the same metrics: CPU and Memory. As CPU and Memory utilization change, we don’t want two different scaling utilities acting on the same metrics, possibly over-scaling in either direction.

A second concern was that VPA can provide CPU and Memory recommendations that are greater than the cluster node capacity. If this case were to occur, pods requesting too many resources would be stuck in a pending state and never get scheduled. To address this problem, our team implemented LimitRanges, which prevents pods from being created if the maximum requested usage is greater than the limit set in the limit range. Even though we decided to only scale requests, this was important in case a deployment did not specify a limit.

From a business perspective, we also wanted to be sure that Tier 1 applications were not affected. Thus, we worked to identify the critical apps and exclude them from scaling.

Finally, we recognized the need for monitoring both VPA components as well as resource throttling. Since VPA is installed in the kube-system namespace, many concerns were taken care of with existing monitoring. With the accepted risks of scaling the requests without limits, we built widgets into our DataDog dashboards to monitor throttling by container name with metrics like kubernetes.cpu.cfs.throttled.seconds as well as metrics on how many apps were utilizing more resources than their requests.

Implementation

The first step was to install all of the components required to run Vertical Pod Autoscaler. Conveniently, the autoscaling GitHub repository provides a setup script: hack/vpa-up.sh. This script runs kubectl commands to install and configure components including the VPA CRDs, admission controller, recommender, and updater. It also generates a CA cert for the admission controller to communicate with the API server.

While the setup script was a great start for early testing, we still needed to integrate this setup script into our internal cluster management tooling. Our tooling focuses on consistent cluster state management and idempotent scripting for installing cluster-wide add-ons like VPA. Thus, we adapted the generated manifests and certificate generation into an add-on that we can install cluster by cluster. Below is an architecture diagram of the components installed, and how they work together.

Source: https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md#architecture-overview

The second step was to find a way to generate a VPA object for each of our deployments. Traditionally, this would be an extra spec file in every team’s application folder like a deployment spec and a service spec, and created during deployments through our CI/CD platform.

However, since the rightsizing effort was prioritized by Cloud Engineering, we chose to use a Generate Kyverno policy to create VPA objects for each deployment in our cluster. As a safeguard, we excluded system namespaces and specific Tier 1 applications that we did not want to risk.

The Kyverno policy looked similar to the one below:

Our Results

In the four clusters that we’ve utilized Vertical Pod Autoscaler so far, we’ve significantly reduced both the CPU and Memory in each cluster. This has reduced our need for over 50 percent of the total nodes in each cluster, halving our compute costs.

Below are metrics for one of our clusters that span from our usage of VPA recommendations (updateMode: Off) to actively resizing based on recommendations (updateMode: Auto). This cluster dropped from around 44 nodes to 22 nodes, all while running the same applications!

In the first time-series graph, light blue represents the total CPU requested by applications in the cluster, and purple represents the total CPU usage. The second time-series graph represents the total number of nodes in the cluster, which scales down as the total requests from the first diagram scale down.

We’re thrilled with the success we’ve seen using Vertical Pod Autoscaler so far. Over the coming weeks, we will be expanding VPA to more of our clusters and evaluating its use in Production clusters.

– The Cloud Engineering Team @ Compass

Follow True North for future posts from the Compass Cloud Engineering team: Bobby Singh, Rajat Aggarwal, Pratik Somanagoudar, Garrett Sweeney, and Shuo Yang