Imagine you are driving a car on a freeway. Your speedometer is telling you you’re going 62 mph. But you “gotta go fast”. Faster than then 65 mph speed limit. So you go for it: first 68mph, then 75mph, then 80mph.
Then you pass a police officer hiding in a speed trap. To your dismay, they pull you over and give you a ticket.
All is not lost: there is a silver lining here. It’s the perfect analogy is to understand how indicators, objectives, and agreements all work with each other. With Site Reliability Engineering practices gaining more traction, we’re starting to see service level acronyms appear more often. These are:
SLI - Service level Indicator
SLO - Service Level Objective
SLA - Service Level Agreement
In the example our speedometer is an indicator. It tells us how fast we are actually going. The speed limit is our objective (to stay under), and our speeding ticket is the consequence of the agreement we have with the state.
Indicators, Objectives, and Agreements all rely on each other in this order. Indicators rely on metrics (discussed below) and are relevant to different levels of your organization: a high level indicator is whether or something is performing as required, a low level indicator might be high-CPU usage on a particular host. Without indicators we can’t define objectives define what’s normal or abnormal. And without objectives we can’t create agreements, which is what we do when we miss our objectives.
In the post we’re going to learn how we can create a simple accountability system using these methods for reimbursing restaurants when our online ordering website malfunctions.
Create your objectives first
You should always start with an objective you are trying to meet as an organization. A great objective is crystal clear and not up for interpretation. When you define a clear objective, you are able to create and measure with the proper indicators. Objectives should almost always revolve around customers and their needs. Instead of creating an objective for failed http requests, create an objective called “Failed Restaurant Orders”. Your indicators should measure the “discomfort” of your users (in this case, a bad request) while your objective should be the tolerable amount of discomfort you are willing to let them experience.
In many cases, an objective is defined with a time frame in mind. For example, an SLO might be considered violated 2% of orders fail within a 5 minute timeframe. So we could write this as:
This objective is met if the failure rate of restaurant orders processed is less than 0.01% measured every 5 minutes
Define your Service Level Agreement
It’s important to make sure you define the ramifications that will occur in the event of a breach of SLO. When you’re defining an SLA, step away from the engineering and product teams and into the world of sales, executives, and customers. Defining an SLA should have the customer as the primary focal point to ensure you create something that makes sense for both you and them.
For our use case, our leadership team decides that we’ll credit restaurants whatever their average revenue is divided by the total number of minutes we are in breach of our SLO. We can state this as:
In the event of our online order system not meeting our SLO, customers are entitled to a reimbursement of their average revenue per minute times the number of minutes of downtime.
As an example, let’s imagine we break our failure rate SLO for 9 minutes. The restaurant has a monthly average revenue of $10,000 through our platform. If we use 43,800 as the number of minutes in a month, this means the restaurant has an average revenue of $0.22 a minute. So for a nine minute outage, we would owe our customer 9 * $0.22, or $1.98.
At this point, you should also make sure that the agreement and objective make sense for each other. Experiment with different numbers and see if you’re going to break your companies bank in the event of activating an SLA. Objectives and Agreements should be refined and rubber stamped before moving on to creating the final piece of our equation.
Create your indicators
Once you have an objective and agreement well defined and have agreed to it within the whole team (product, engineers, etc) you are ready to create Service Level Indicators that inform you if you are meeting that objective or not. An indicator requires one thing: metrics. Metrics are not always the same as indicators, although sometimes they might be. Metrics should be queryable in a system like Grafana or SignalFx and be collected for every part of your application’s stack. If nothing else, you should be collecting the “RED” metrics: Rate, Errors, and Durations. These are enough to create and measure a few different service level objectives.
For example, if you have a web application where users are able to place orders online for a restaurant near them your RED metrics would be:
Number of orders
Number of order errors (A 5xx HTTP response code)
How long each order request is taking to complete
Since we have agreed to an objective of Restaurant Orders, we can figure what metrics to measure to accurately capture that objective, in this case we can create a Service Level Indicator called “Order Failure Rate”.
Failure rate is simply the total number of errors divided by the total number of orders. For example, if our restaurant application receives 500 orders and 10 of them fail we have an order error rate of 2%. This is the output of our indicator (our speedometer in initial example).
Our restaurant ordering system now has an easy-to-understand accountability system using Service level indicators, objectives, and agreements. Each layer of the organization has a clearly defined responsibility as well. These objectives are also a great way to define alerts and pages for your teams as well because they’re measuring user impact in a clear way that responders can act quickly.
Create a few SLx’s in your organization today and create an accountability system for yourself.
FireHydrant takes you from oops to ops
Manage deploys, incidents, and post mortems like it's no big deal.