Managing Multi-Cloud Production Kubernetes Clusters at Brandwatch
Limited Time Offer!
For Less Than the Cost of a Starbucks Coffee, Access All DevOpsSchool Videos on YouTube Unlimitedly.
Master DevOps, SRE, DevSecOps Skills!
Source:-infoq.com
Brandwatch’s engineering team wrote about their experience with managing multi-cluster Kubernetes across EKS, GKE and self-managed clusters. Brandwatch runs around 150 production services across 6 independent Kubernetes clusters, which run on AWS EKS, GKE, and self-managed data centers in London. In early 2016, they started running on GKE, and found that it was a good fit for many of their services. They have automated deployments using ConcourseCI and some custom tooling, and use the nginx ingress controller for more control over HTTP request/responses.
InfoQ reached out to Andy Hume, Director of Engineering, Application Infrastructure at Brandwatch, to learn more.
Although GKE and EKS remove the headache of managing the Kubernetes control plane, Brandwatch still have to manage and upgrade their self-managed Kubernetes cluster. Hume talks about the tools they use for this:
We use Rancher to manage Kubernetes clusters in our own DCs. We had experience running Rancher 1 prior to Kubernetes in 2015/2016, and so when Rancher 2 migrated to Kubernetes this seemed like a sensible starting point for our initial implementation. However, while we plan out how to scale Kubernetes across more of our existing bare metal infrastructure we’ll likely look for solutions with less complexity and that give us more flexibility for integrating with our existing network design.
Brandwatch teams have adopted K8S widely. Hume explains that they have not enforced any specific tools across teams:
Our development teams are quite diverse in their approaches to developer workflow, and the application platform doesn’t impose any particular tools or workflows for local development. Teams have used a range of tools to streamline development, including Skaffold, Telepresence, Docker Compose.
However, their CI/CD workflows are standardized, and “all teams publish container images and metadata to a central system which is used to manage deployments and change control”, says Hume.
CI/CD is managed on ConcourseCI which itself runs on Kubernetes, with some custom tools that automate the deployment. The team wrote a declarative wrapper – editable as a YAML – around kubectl which captures metadata about the service. Helm is used for templating (but not for production rollouts) with Kustomize to manage all cluster-level services. The team keeps all Kubernetes manifests in a repository separate from the application source code, which “simplifies the mechanics of how changes are rolled out”. To roll back to an older version of the app, the team just needs to deploy an older manifest. Hume elaborates:
When teams deploy a new version (image) of the source code, they do that by updating the image reference in the Kubernetes manifest in Git. This is automated to allow deployments to be as simple as possible for developers, but it means that there isn’t in fact a separate mapping between the application source code and the Kubernetes manifest. The manifests in Git are the single source of truth, and so reverting to an old version will also roll back the image tag/version.
A Kubernetes cluster needs an Ingress for accepting traffic from the internet – cloud vendors that have managed Kubernetes incorporate their load balancers as part of their Ingress implementations. It is also possible to use your own load balancer here, or if you are managing your Kubernetes cluster yourself. Brandwatch’s engineering team runs the nginx-ingress-controller on all their clusters, including ones on EKS and GKE, bypassing the managed load balancers. This lets them manipulate the HTTP response headers to add security specific headers and remove internal ones. In addition, Hume says that:
It’s nice not to have to rely on the applications themselves adding these headers, and to do it in one central place that our security team can audit or adjust if they need to. We also found some other constraints with the cloud HTTP LBs. For example the GCP LB limits the size of HTTP headers on incoming requests, which caused problems with some of our older applications that we wanted to move into Kubernetes.
Brandwatch runs Prometheus on each cluster using the Prometheus operator for monitoring and alerting. The primary goal of alerting for them is “concerned with the availability and correctness of our customer facing applications, and individual development teams are typically responsible for creating and responding to these using the prometheus/alertmanager stack in each of our clusters”, according to Hume. Apart from this, the general health of Kubernetes workloads is also monitored as an indicator of the overall cluster’s health. Hume adds:
In general we have found the managed nodes of GKE and EKS to be resilient and straight-forward to manage in case of failures. If workloads are failing on a particular node the immediate action is to terminate it and let the cluster autoscaling start a new one. We do have monitoring at the node level, but we don’t proactively trigger alerts for any node-level metrics.
Although Brandwatch uses the Cluster Autoscaler to spin up nodes when required, this is often too slow. They deal with this by using application logic that queues or retries work while the nodes (and pods) come up, and also by pre-scaling for expected, known workload spikes.