Governing Multi-Tenant Kubernetes Clusters with Kyverno

Garrett Sweeney
Compass True North
Published in
7 min readJul 13, 2022

--

“Can you do it with Kyverno?”

At Compass, our DevOps teams do their best to start our developers off on the right foot. We do this by establishing the development lifecycle best practices into a Golden Path.

But what good is a Golden Path without guardrails that prevent our developers from getting stuck in the mired pits of non-standard, future tech debt? And what good are guardrails that can’t change as the Golden Path changes?

While there are many layers of a Golden Path and guardrailing techniques, we’re going to take a look at how Compass created flexible guardrails in our compute platform: Kubernetes. Let’s dive into how we use ClusterPolicies to validate, mutate, and generate application resources, running on Kubernetes clusters, in line with our codified best practices.

Problems with the Golden Path

At Compass, our Golden Path for developing application resources starts with a code generator that scaffolds a Hello World program, encapsulated as a docker container, with corresponding Kubernetes manifests. While code generation with best practices is an excellent place to start, it is not enough to ensure our best practices get maintained when the scaffolded code inevitably changes. Specifically, our DevOps team has run into scenarios where we need to:

  1. Validate resources before creating them
  2. Mutate resources with the latest best practices
  3. Generate complementary resources when a new resource is created

These scenarios are made problematic in that we, as Devops teams, don’t own code after it has been scaffolded. While there were a few different solutions we could take, we wanted one that respected that service owners owned their code and DevOps owned infrastructure. From this, we asked ourselves how we would implement a governance solution in our compute platform.

And, like my manager likes to ask with every governance problem we face: “Can you do it with Kyverno?”

Solution 1: Validate Resources

Validate resources before creating them

A scenario to validate resources arose when we found a gap in our Kubernetes Ingress validation on our frontend Kubernetes clusters.

In upgrading our cluster ingress, Nginx Ingress Controller, we experienced an issue where multiple Ingress rules with the same hosts and HTTP paths routed to different applications; this caused the ingress controller to fail on startup with an error: nginx: [emerg] ... duplication location. This scenario happens when a new application is deployed onto the cluster with the same host and path matching rules as an existing app on the cluster.

To fix this, we prevent the creation of invalid Ingresses that don’t have a unique HTTP host + path combination. Luckily for us, Kyverno offers the ability to query other relevant resources in the cluster and determine if the new resource will cause conflicts.

In this case, we use a modified version of the Unique Ingress Host and Path policy that validates that all new Ingresses have unique HTTP host + path combinations. The policy first checks for the uniqueness of the requested HTTP paths across all ingresses, regardless of host. Then, if any of the paths are not unique, it then validates that the paths are unique for each Ingress rule’s specific host. Any Ingress rule that does not have a unique HTTP Path for the host is blocked with an error message.

Our policy looks similar to the one below:

With this policy in place, when a user tries to create an Ingress that would create a duplicate HTTP path, we now bubble up the following error in the CI/CD platform:

Solution 2: Mutate Resources

Mutate resources with the latest best practices

In upgrading our Kubernetes Clusters past 1.19, we realized readiness and liveness probes were failing for some applications on the newer clusters. This was an interesting issue because not every app was failing their probes, but the ones that were failing had been running successfully on the clusters before 1.20. This was particularly baffling because the failing applications’ Kubernetes manifests were identical on the old and new clusters.

After some investigation, we found the following “note” in the Kubernetes docs:

Before Kubernetes 1.20, the timeoutSeconds field was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

Configure Liveness, Readiness and Startup Probes

Indeed, this was the very problem many of our applications were facing: they did not define timeoutSeconds. Since the applications' scaffolded Kubernetes manifests did not include this field for their probes, the applications were not responding to the probes within the default timeout.

While there were multiple solutions to fix this long-term, all of them required involvement from the product teams that owned the incompatible applications. Because each application is unique, teams would have needed to identify an appropriate timeout specific to their application and their specific needs, which would been time-consuming. Fortunately, Kyverno offered a solution: the ability to mutate resources and set larger default probe timeouts if the teams had not already explicitly set them.

In this case, our cluster policy looked like:

Note this example policy is specific to deployments’ readiness probes and does not cover liveness probes.

As a result of this policy, we were able to mutate incompatible applications with low overhead and quickly unblock our Kubernetes cluster upgrades past 1.19.

Solution 3: Generate Resources

Generate complementary resources when a new resource is created

Tuning applications resources running on Kubernetes clusters are crucial to achieving the economies of scale that Kubernetes affords. Traditionally, we used a custom Datadog dashboard to pinpoint the minimum/maximum CPU and Memory resource usage for an application over time and made tuning recommendations. However, this required product teams to consider and enact these recommendations, which would change over time. In some cases, teams lacked confidence in making changes to their application resources and often came back to us for guidance. This process did not scale well for a lean DevOps team: we needed a way to dynamically tune applications on the cluster without asking product teams.

Enter Kyverno + VerticalPodAutoscaler: VPA enabled us to dynamically tune CPU and memory configuration for applications on our clusters. However, VPA required creating custom Kubernetes objects for each application needing tuning — a new burden for product teams. To avoid this, we used Kyverno: it enabled us to centrally and dynamically create the custom objects on our clusters for each application. As a result, we were able to leverage VPA without burdening product teams.

Our policy looked like:

Note the comment on updateMode for toggling between recommendation + auto modes.

Because of this policy, we were able to quickly adopt VPA in some of our non-production clusters and pave the way for future use of this functionality more broadly.

What Could We Have Done Instead of Kyverno?

All three of these problems could have been solved without using Kyverno’s cluster policies.

For Problem 1, we could have included additional logic to address the duplicate Ingress path issue in our CI/CD Golden Path platform, before the apps get deployed onto our Kubernetes clusters.

But, if anything was deployed onto our Kubernetes clusters outside of the CI/CD Golden Path, it could create an Ingress with a duplicate HTTP path, the cluster would accept the resource, and the applications with conflicting Ingress paths would steal each other’s traffic — probably causing a major incident. Instead, we govern at the cluster level with Kyverno: it gives us a stronger guarantee that no matter how resources end up on our clusters, they conform to the rules.

For Problem 2, we could have required each product team to set the timeoutSeconds field in their Kubernetes specs.

But, not every team had the bandwidth to drop roadmap work, understand our recommendations, and make the appropriate changes. Since speed was critical to meet our Kubernetes cluster upgrade timelines, we leveraged mutating admission webhooks via Kyverno to patch tenant applications at the cluster level. This ensured that every container deployed on the clusters would have a beefed up default timeoutSeconds and gave product teams the time they needed to understand and apply the appropriate recommendations.

For Problem 3, we could have templated and created VPA resources through our CI/CD platform with each deployment of an application’s Kubernetes resources.

But, this would have required greater overhead by building support for VPAs in our CI/CD platform. And in doing so, it would have limited our development speed and added code complexity. Considering our original goal was to first evaluate VPAs in our non-production environments to determine if VPA would work for us, we didn’t want to do needless work for something we weren’t sure would last. Instead, we used Kyverno’s cluster policies to centrally create VPA objects for each application. Because we only needed a single YAML file to define these policies, the development effort was very low.

In Summary: Why Kyverno?

Kyverno is great because it allows us to govern our clusters consistently at the cluster level with no real code. Invalid resources can be blocked with helpful errors bubbling up to users, misconfigured resources can be corrected on the fly, and new resources can be dynamically generated to augment existing workloads. We’ve had a great experience with Kyverno for cluster governance thus far, and we’re just getting started.

– The Cloud Engineering Team @ Compass

Follow True North for future posts from the Compass Cloud Engineering team: Bobby Singh, Rajat Aggarwal, Pratik Somanagoudar, Garrett Sweeney, and Shuo Yang

--

--