Observability: Solving The Hidden Cost of Kubernetes Applications
Limited Time Offer!
For Less Than the Cost of a Starbucks Coffee, Access All DevOpsSchool Videos on YouTube Unlimitedly.
Master DevOps, SRE, DevSecOps Skills!
Source: containerjournal.com
Kubernetes, in many ways, has allowed software organizations to realize the benefits of microservices by providing a convenient and powerful abstraction for deploying, scaling and running distributed software systems. Those benefits, however, have come at a cost for traditional software operations. Indeed, as microservices have grown in complexity and scale, teams have often struggled to adapt to the hidden costs that these powerful new technologies have uncoveredâspiraling monitoring costs, misunderstanding or not understanding service dependencies, the creation of single human points of failure (aka the bus factor) and more.
If you havenât yet encountered or experienced some of this pain, itâs very likely that it will impact you in some way in the future. But, this doesnât have to be a cause for alarm or panicâthe practice of observability can prevent these from happening.
What Is Observability?
Observability is a set of practices and processes that allow you to understand, and control, the internal state of the systems that you build and maintain. This control can be organizational or technical: using error budgets to prioritize certain development work or understanding the performance profile of your system under load. The concept of observability comes from the discipline of control theory, which seeks to quantify how you can observe and control a system using only its outputs. Observability isnât simply a set of dashboards displaying time-series metrics or a searchable index of your application logs; itâs a holistic approach to understanding your entire system, and how it operates.
What it Really Costs to Monitor a System
Observability provides a solution to the âhidden costsâ mentioned earlier. First, observability provides a comprehensive approach to monitoring your distributed system. Thereâs two primary costs you can associate with traditional host-based monitoring systems:
The capital costs required to collect and store reams of events and other telemetry data.
The operational costs of engineering time to comprehend and investigate that data in the event of a service disruption, or when profiling the performance of a service or services in your application.
The capital cost can be quite high and shouldnât be dismissed out of hand.
Consider the amount of storage, compute and network traffic required to aggregate and index log statements from a single service. Then multiply that by the count of instances of that service. Then multiply that by your desired retention period. Then multiply that by the amount of services involved in a single request through your application. That adds up to some serious money. The operational costs are potentially even higher, howeverâstorage, compute and network bandwidth are cheap compared to hours or days of engineering time spent on unproductive âguess-and-checkâ correlation of spikes in time-series metric events with unstructured logs, especially when compared to the value of implementing and shipping new features.
Holistic Observability Addresses Hidden Costs
The practice of observability seeks to resolve these dilemmas by focusing on the production of high-quality, ubiquitous telemetry data in the form of metric, log and distributed trace events. An observability platform centers itself around these distributed tracing events, using them to reduce the search space for anomalies, outliers and other errors that are either contributing to an overall degradation of system performance or a general outage.
Holistic observability also addresses the issue of service dependency management and the siloing of information across teams. Traditional monitoring approaches often lead to circumstances in which any given team has a mismatch between what it can explicitly control and what itâs implicitly responsible for. This mismatch can breed frustration, burnout and a poor end user experienceâyour users arenât aware of your architecture, after all, and arenât likely to be more understanding of poor performance on the front end just because the root cause of that performance degradation was due to bugs in some lower-level service. The traditional response to ensuring high availability for microservices is to alert on everything, but this approach is so harmful to developers and operators that it has a name: alarm fatigue.
Observability seeks to combat both alarm fatigue and this responsibility-control mismatch by focusing on key indicators of the health of individual servicesâpopularly known as service level indicators (or SLIs, and their friend the service level objective or SLO)âas well as providing telemetry data that can be used to generate contextual âservice mapsâ of dependencies between individual services. The goal here is to focus on a few key indicators such as latency, error rate and throughput rather than on a scattershot of metrics with little contextual information about why theyâre important.
Getting Started With Observability
How should you implement observability in your organization? The first and most critical component of observability is telemetry data. OpenTelemetry provides an open source, vendor-neutral telemetry API and SDK designed specifically to generate high-quality telemetry data. Youâll be able to use it to not only easily generate this data from existing services through automatic instrumentation but also to build custom metrics or traces from your application. Since it provides backward-compatibility with OpenTracing and OpenCencus, you can benefit from the extensive collection of integrations with existing frameworks and libraries that exist while looking forward to future integration into more parts of the cloud-native ecosystem, such as Kubernetes.
Second, you need to create an âobservability platformâ rather than a disjointed collection of dashboards and alarms. Distributed traces should be the heart of your platform. Their ability to impart context to requests as they flow through your application dramatically reduces the overhead required to identify performance regressions and makes it trivial to understand service dependencies. Use common SLIs such as latency, throughput and error rate across your entire system to provide clear apples-to-apples comparisons of performance across releases and provide clear context for teams as they communicate.
Finally, ensure that people are at the center of your observability strategy. Behind the numbers, think about what your goals are. Are they to ensure that people can enjoy their weekends, even when theyâre on-call, because they know getting paged wonât derail the rest of their day? Are they to give you the confidence to release a critical fix on a Friday afternoon, because you know youâll be able to rapidly identify any regressions and rollback if required? Are they to make sure that you can quickly and accurately target specific end users that are having issues using your software, so you can fix it and make them happy? Your goals should be all of these reasons, and uncountable reasons more, that you decide to adopt observability.