SLOs, SLIs & Error Budgets
Define measurable, user-centric reliability targets and tie them to deployment velocity and prioritization decisions.
Discovery Questions
- Do all critical services have defined SLIs and SLOs?
- Who owns defining and reviewing them—product, engineering, or operations?
- How are SLOs measured, tracked, and reported?
- Are SLOs visible to teams in real time?
- How often are SLOs revisited or recalibrated?
- Are SLO violations linked to error budgets that inform roadmap decisions?
- How are trade-offs between velocity and reliability made?
Evidence to Collect
- SLO dashboards and reports
- SLI query definitions
- Reliability review notes
SLI/SLO Framework
Design SLIs around user journeys and automate SLO compliance reporting.
Implementation Steps
- →Instrument availability and latency SLIs with Prometheus recording rules.
- →Use Sloth or Pyrra to codify SLOs and generate alerting burn-rate policies.
- →Publish shared dashboards showing real-time error budget status.
- →Automate compliance reports for stakeholders and product teams.
Error Budget Policy
Align release velocity with error budget consumption through explicit policy gates.
Implementation Steps
- →Define budget states (healthy, watch, exhausted) with clear actions.
- →Freeze feature work and trigger a reliability swarm when budgets exhaust.
- →Integrate budget checks into deployment pipelines and change approvals.
