Building High-Performing SRE Teams
A B2B SaaS Journey: From Chaos to Predictable Reliability
Published: Sep 18, 2022
The Starting Point: Where We Were
Year 1 of building a world-class SRE team at a mid-market B2B SaaS platform. The platform was experiencing explosive growth: 30% YoY user growth, expanding to new regions, but the infrastructure team was struggling. We had smart engineers, but they were drowning in reactive work.
Three Core Principles That Changed Everything
Product Thinking
We reframed reliability as a product feature with measurable business outcomes. Every reliability initiative had to answer: "What business outcome does this drive?"
- Tied SLO improvements to revenue retention
- Measured reliability ROI like a product feature
- Aligned with GTM for regional expansion
Clear Ownership
We eliminated handoffs and assigned clear domain owners for each critical system. This reduced cognitive load and made accountability explicit.
- One engineer per domain (auth, data pipeline, APIs)
- Ownership rotation every 18 months
- Cross-training built into runbooks
Autonomy with Guardrails
Engineers had autonomy within clear guardrails. We built paved paths for common tasks and sane defaults for everything.
- Self-service deployment within SLO limits
- Pre-approved Terraform modules
- Automatic rollback for violating guardrails
One-Year Evolution: The Transformation
Months 1-3: Foundation
Establish visibility and baseline
Defined SLIs/SLOs for 5 critical services. Implemented error budget policies. Built golden signals dashboard. Created incident response playbooks.
Months 4-6: Culture Shift
Build blameless, learning-focused culture
Implemented blameless incident reviews. Established on-call rotation rules. Created postmortem action tracking. Started chaos engineering experiments.
Months 7-9: Scale & Automation
Automate toil, focus on high-impact work
Built self-healing automation. Deployed AIOps for anomaly detection. Implemented capacity planning automation. Created observability platform.
Months 10-12: Leverage & Impact
Multiply impact through enablement
Extended SRE practices to platform teams. Built internal SRE academy. Achieved quiet on-call as norm. Delivered multi-quarter roadmaps.
Practices That Moved the Needle
Error Budget Policies: Turning Chaos Into Calm
Instead of reactive firefighting, teams predictably deprioritize features to focus on reliability. This single rule changed how engineering leadership thought about trade-offs. Sales stopped being blindsided by outages. Product understood the cost of reliability.
Blameless Incident Reviews: From Blame to Action
We tracked 100+ action items with 92% completion rate. This became the heartbeat of continuous improvement. We were not hunting culprits. We were hunting systemic weaknesses.
Golden Signals and High-Signal Alerts
For each service, we picked four golden signals: latency, traffic, errors, saturation. We built composite alerts that fired only when multiple signals indicated real problems. Result: on-call went from 200+ alerts per week to 15. That's not a typo.
Quiet On-Call as a Feature
We gamified this. Quiet on-call weeks were celebrated. We tracked them. By year end, on-call was on alert only 8% of the time. When it did fire, it mattered. Engineers stopped dreading the rotation. This was transformational for retention.
Metrics That Actually Matter
SLO Attainment and Burn Rate
This is your north star. If you're at 99.5% target and you're at 99.1%, you have exactly so much room to burn before violating your commitment. Our baseline: 95% SLO targets across all critical services. We achieved 96-97% through the year, giving us buffer for improvements.
Ticket Age and Flow Efficiency
Work that sits is work nobody owns. Mean age of work tells you if toil is piling up. Year start, median ticket age was 34 days. Year end, it was 4 days. This told us automation was working.
MTTR vs MTBI
Don't just measure how fast you recover. Measure how long systems run between incidents. Year start: 47 P1 incidents per month, 156 minute MTTR. Year end: 3 P1 incidents per month, 28 minute MTTR. We had fewer crises.
The One-Year Trajectory
Hiring and Growing the Team
Systems Thinking
We hired for the ability to connect dots across systems. "Walk me through a recent production issue you debugged" revealed whether candidates thought systemically or just at the code level.
Empathy
We looked for engineers who asked "Why?" instead of "Who broke it?" Empathy for users, for operators, for the business separated great SREs from good ones.
Curiosity
We valued the engineer who stayed after hours reading Brendan Gregg blogs. Intrinsic motivation to understand system internals. This was non-negotiable.
Career Ladders
We built explicit ladders valuing enablement and multiplier impact over individual heroics. Senior SREs were measured by how many engineers they leveled up.
Year-End Lessons Learned
The Principle That Unlocked Everything: Product Thinking
In Month 1, we measured reliability for its own sake. In Month 12, we measured reliability because it drove customer retention. That shift changed everything. Executives who didn't care about MTTR suddenly cared about SLO attainment because we tied it to revenue.
Culture Beats Tools Every Time
We spent less on tools than you'd think. Our competitive advantage was the blameless culture. Engineers who trusted that postmortems were learning events, not witch hunts. That culture made the difference.
Quiet On-Call Changes Everything
Month 1, on-call was hell. We fixed the technical problems, sure. But the bigger win was making on-call a non-event. When engineers are not woken up 3x per night, they stay. Retention went from 71% to 94% in on-call rotations.
Ownership Is Underrated
The moment we assigned clear domain ownership, accountability emerged naturally. No more "that's not my system." Engineers owned their domains. They learned them deeply. They took pride. It was a shift from engineers to operators.
