Objectives in defining SLOs
• Mark Eschbach
Moving forward as an Site Reliability Engineer (SRE) is a bit interesting. Our small group is just getting our bearings but comfort does not stop the ball from needing to roll forward. As apart of our goals we need to aid in defining what the Service Level Objectives (SLOs) are for each service in the company. Surprisingly there is little prior art on best practices for various types of services. Obviously there will be some variance based on the business context however some guidelines would be helpful. For example there is a general rule of thumb services should respond within 200ms in most cases so that should be pinned to 90th percentile.
The idea behind Service Level Objectives is to communicate the product organization’s tolerance of reliability and performance or rather the lack thereof. The basic premise is 100% up time is very costly to build across the whole software lifecycle while many services have little to no impact with small periods of being unavailable. Certain classes of software definitely need to be always available such as aircraft safety and nuclear reactor safety systems. The Engineering organization I am apart of is a byproduct of a lack of tools being generally available. As such we may have a number of systems deemed critical too.
A component of prior art appears as an artifact of SLOs being bound to a business context. Expectations of users vary wildly on how often the system needs to be a available and responsive. As a general rule of thumbs systems appear unresponsive if they do not react within a single second. Sure progress dialogs help by extending the window but in spirit if an action isn’t fully completed with control returned to the user within a second you defeat the person. In order to abstract out the usage patterns and general variability in user responsiveness the SRE book recommends defining these around percentiles. For example user requests must complete in the following percentiles: p50% 200ms, p90% 1 second, p95% 2 seconds.
With my inexperience with SLOs I am currently recommending the following SLOs:
Recommended SLOs for Web Applications
| Description | Be Above | | — | —: | | Service should be available | 99% | | Service should be responsive | 1s |
Recommended SLIs for a Web Application
| Description | p-Low | p-Medium | p-High | | — | —: | —: | —: | | Status codes: 500s | 1% | - | - | | Request-Response time | p50: 100ms | p90: 1s | p99: 2s | | Request/sec/vCPU | p50: 40 | p90: 200 | p99: 400 | | Memory/sec | p50: 320M | p90: 512M | p99: 768M |
I am hoping these serve us well. I am really struggling with status codes as I keep trying to shove it into the envelope of low-medium-high. Part of me wants to say they need some errors within their service, otherwise they are concentrating on make it too reliable.