To properly manage and monitor an application, you need a goal for defining where you are and how you are doing so you can adjust and improve over time. This reference point is known as a service level objective (SLO). Taking the time to define clear SLOs will make life easier for service owners as well as for the internal or external users who depend on your services. 

However, before you can define an SLO you need an objective, quantitative metric you can look at to determine performance or reliability for your application. These metrics are known as service level indicators (SLIs).

Service level indicator—SLI

A good way to determine what metrics you should use for your SLIs is to think about what directly impacts your user’s happiness in terms of your application’s performance. This could include things such as latency, availability, and accuracy of the application. On the other hand, CPU utilization would be a bad SLI because your users don’t really care about how your server’s CPU is doing, as long as it isn’t impacting their experience with your app.

Additionally, the SLIs you choose will depend on what type of application you are running. For a typical request/response type application you will probably focus on availability, request latency, and successful requests per second capacity. You might look at availability and the consistency of the data being served for data storage. For a data pipeline, your SLIs might be whether the expected data is returned and how long it takes for the data to be processed, especially in an eventual consistency model.

Service level objective—SLO

An SLO is a performance threshold measured for an SLI over a period of time. This is the bar against which the SLI is measured to determine if performance is meeting expectations. A good SLO will define the level of performance your application needs, but not any higher than necessary. This is a crucial point and will require some testing over time. If your users are fine with 99% availability, there’s no reason to make the massive investment that would be required to hit 99.999% availability.

Some example SLOs for latency could be the 95th percentile latencies, which would tell you the latency for the 5% slowest requests being made by users. This is far better than simple latency averages that could be easily skewed by outliers.

. As a result, we have large quantities of data covering myriad aspects of our systems. While there’s operational value in highly granular metrics, those metrics did not speak well to the customer experience and certainly left service owners wanting more. So we took the approach of examining each microservice and its consumers, establishing reasonable success criteria and achievable goals.

The resulting outputs are consistent measurements we can apply across our entire fleet, providing insight into availability and error rate that serves as a proxy to customer experience. Not only is this beneficial for service owners as a means to achieve operational excellence and inform error budgets, but it allows for insight into our engineering organization for all levels of the business.

InfluxData

An important thing to note is that SLOs don’t have to be perfect on the first implementation. An SLO is always a work in progress that can be iterated as you get more data and learn more about user needs and expectations. Remember, the most valuable thing about implementing SLOs is the general mindset shift in monitoring your applications.

Tim Yocum is director of operations at , where he is responsible for site reliability engineering and operations for InfluxData’s multi-cloud infrastructure. He has held leadership roles at startups and enterprises over the past 20 years, emphasizing the human factor in SRE team excellence.

.

Copyright © 2021 IDG Communications, Inc.

LEAVE A REPLY