I’ve never heard of a company that has a business, that doesn’t also occasionally have things go wrong. Something going wrong might turn into a support ticket, an angry email, or an alert popping up on an on-call engineer’s phone. If there is user or business impact, and an engineer might need to respond, then it becomes an incident.
After the incident, the folks involved in mitigation write an Incident Review Template, and the that document is discussed in this meeting, the Incident Review.
Other Approaches to Incident Review
Incident Reviews are a cultural carrier meeting for most engineering organizations. They are a rare meeting where you will see a wide mix of teams and seniority-levels arguing about something that the business cares about deeply: customer and employee impact. A well-run Incident Review helps new employees quickly understand how your culture works when things really matter.
An effective Incident Review facilitates these goals:
- Foster and socialize learning about what caused an incident: incidents have a certain inherent rhythm, and the only way to change it is to ensure others are aware. The most valuable thing this meeting does is create awareness of what has actually happened in a given incident, which is the precursor to preventing a repeat
- Surface missing context across teams and functions: customer success might mention an impact to users, an infrastructure engineering team might mention that the incident had a wider impact than initially recognized, a product engineering team might explain the business cost of delayed message processing
- Inform investments on work that will best contribute to increased reliability: broaden an ongoing investment project to support a new edgecase, cancel a previous mitigation effort based on improved understanding of the underlying issue, recognize that similar issues are repeating without being successfully addressed
Because of this cultural significance, Incident Reviews also have a predictable tendency to become ideological arenas, and to attract participants with ideological goals about the right way to foster reliability, run reviews, etc. Your goal as the senior leader who owns this meeting is to prevent it from becoming an open ideological discussion forum, and to instead focus it on the specific agenda at hand.
Several patterns to be wary of:
- Ensuring adherance to documented process: some review meetings become focused on driving adherance to the specified incident response or review process. That is valuable work, but ineffective to conduct in a large, learning-oriented forum. Instead, drive adherance before the meeting
- Pedantic or status-oriented: a surprising number of incident discussions end up orienting around policing correct nomenclature rather than encouraging learning and growth. Effective reviews are progress-oriented, with practioners who explain important context when additive, but don’t orient around policing correctness
- Public performance of a one-person play: effective learning meetings don’t spend much time reading materials or reports out loud. The entire time should be devoted to discussion, perhaps with a short initial window for attendees to read the report. Learning is a group activity, wbhereas readouts as a solitary performance
- Public performance of two-person play: some meetings adopt a consistent chorus across sessions. A certain set of questions, e.g. “How did you first become aware of this issue?”, will be asked and answered at each session, consuming much of the time. That feels useful, but it implicitly silences the wider group, who are not able to contribute their context and encourage group learning
Finally, like any important, large meeting, there may sometimes be individuals who are more focused on their personal ideological goals rather than the meeting’s goals, and it’s your responsibility to either anchor them on the meeting’s goals or get them out of the meeting so work can be done.
Agenda, Scheduling, and Scaling
The agenda for every incident review is discussion of one to two individual incidents or a cluster of related incidents. The agenda should be decided one to two days ahead of the review, and shared out with attendees to allow them to prepare. Because most learning occurs in discussion, I recommend against trying to include more than two incidents (or one batch of related incidents) in a given session.
Run these on a weekly cadence, canceling ahead of time when there are no incidents to review.
If you start to have backlog of incidents to review, then you have three options:
- Batch related incidents if you have a cluster of incidents with shared contributing causes. For example, you might have a streak of incidents related to database instability caused by unindex queries, which would benefit from one curated, joint discussion rather than treating each as an independent incident
- Extend review time for one week to have more incident review bandwidth. This works best when you have a short-term spike in incidents. Generally speaking, it is an organizatonal smell to permanently extend incident review beyond an hour a week for a large audience, as it’s an expensive investment of time
- Stop discussing lower severity incidents in the review. For example, only discuss incidents with “significant” customer or internal impact, coupled with a simple definition of what incidents would fall beneath the line
Roles & Attendance
There are five key roles in an Incident Review:
- Facilitator who coordinates the agenda and the conversation
- Presenter who filled in the Incident Review Template for a given incident
- Notetaker who ensures notes from the discussion are captured
- Attendee who share context, ask questions, and learn from the discussion
- Sponsor who provides organizational weight to the meeting through their participation, this is generally either the head of engineering or the head of infrastructure. It is reasonable for the Sponsor to occasionally miss, but I believe it’s essential for them to attend the majority of incident reviews
The Incident Reviews goals, particularly around learning and surfacing missing context, encourage a wide audience of attendees. I recommend allowing anyone to participate so long as they read–and abide by–the meeting’s goals and anti-goals. Ensuring folks act in accordance with the meeting’s goals is a joint responsibility of the Facilitator and the Sponsor.
Is it working?
Some questions to ask yourself if you’re unsure if your meeting is useful:
- Are they getting scheduled? If that’s because you’re truly not having incidents, great! Conversely, if it’s because folks are not filling in the template, then dig into why not. Often these templates get overloaded with many questions to please many stakeholders, and consequently become difficult to use
- Are key personnel attending? Particularly the sorts of folks who have important context to bring into the discussion. If the meeting is working, these should be an exceptionally high-leverage opportunity to grow the organization
- Are the discussions resulting in a modified reliability strategy or roadmap? If these discussions are driving learning, then they should alter the shape of your roadmap
- Do you enjoy attending?