Reliability
Intro:
In some ways, the topic I’m least excited to write about. For some reason it’s become extremely opinionated, derisive area of discussion, but not to its benefit: there’s a lot to learn here, and much of the disagreement is a distraction from making the clear improvements.
Topics:
- On-call rotation
- Observability
- How reliability process drives learning
- Incident Response – coordinate response
- communication, coordination, mitigation
- when should you have multiple roles?
- after you’ve mitigated, move to remediation
- Incident Reports – drive learning, support aggregation
- Incident Report Clustering: understand where the recurring areas to invest are
- Remediations
- I find that many incident processes generate bad directions despite being run by well-meaning, smart people
- motion isn’t the same as progress
- the incident response process isn’t the outcome, the outcome is the outcome
- if you’re spending more energy feeding the process than fixing systemic weaknesses, your process is eating you
- what are technical and system changes to significantly prevent incidents or reduce their impact
- process changes are easy to think about and leaned on too heavily. only make process changes when they genuinely make sense and when they’re genuinely better than the alternative. So many incident processes iterate meaninglessly on small process improvements instead of serving to focus eneregy on highest impact outcomes
- Some things to avoid
Resources:
- ..