Site reliability
29 September, 2021
BackError budget is the downtime of an app allowed before breaching the SLO.
If the reliability is expected to be 99.95% (22 mins), a 4 min downtime will spend the error budget down to 18 mins left.
Burn rate is the rate the error budget is depleting. Teams should setup alerts on burn rate to monitor the exhaustion.
White-box monitoring observes the internal metrics of systems (CPU usage, HTTP requests, logs). Provides a diagnosis to the problem. Requires technical expertise.
Black-box monitoring observes the visible behaviour that impacts users. This provides the symptoms to the problems.
Four Golden Signals
- Latency. The time taken for a request to be completed.
- Traffic. Demand placed on the network or app.
- Errors. Status codes returned by the app.
- Saturation. Utilization taken up by the server or network.
Back