An Incident Management Tool
Service A is unexpectedly flooded with requests, bottlenecking the data throughput. It’s 5 pm and the IT folks have already left for the day. The stream is rapidly growing until a built-in queue limit is hit, and data is dropped. The data is collected for a time-sensitive project that depends on full series for correct display of statistics. When the data team arrives, the statistics are largely skewed, and the batch is useless. Everything worked the day before, and suddenly, the service couldn’t handle incoming requests. Clearly, someone has done something. Who pushed the change? And why hasn’t anyone implemented any alert triggers?
Incidents take an infinite number of shapes. Not all scenarios can be tracked to a persons’ specific action, but even if it’s not, it’s still easy to put blame on the people, team, or managers who let things happen or not happen. If the goals for implementing monitoring had been reached, if the program had been updated properly, if the push hadn’t been implemented at 5 pm.
Every incident, no matter how minor, should be followed up with an event called Post Mortem. The Post Mortem are the actions and reporting of the timeline of the incident, an analysis of what went well and what could have been done differently, and a list of action points to prevent the incident from occurring again. The Post Mortem can be a meeting, a presentation, or a mail conversation, resulting in the writing of a Post Mortem report. It is considered good practice to define a format for the Post Mortems. Well performed, the Post Mortem turns into a valuable learning opportunity for anyone with an interest in the system and processes involved - not just the ones who might have worked with the incident directly.
Ok, so the Post Mortem routine is defined, you created a new repository for Post Mortem Reports and you are a little eager to implement the procedure at the next incident. But you hesitate. The idea is fine, we should learn from mistakes. But putting peoples’ failures into printing is so invasive. Who would ever want to stand up and get their name on a report of failure?
Enter Blameless
The thing is no one will have to do that. There is a twist still uncounted for, as originally defined by the LEAN framework:
People don’t fail. Processes do.
People define processes, and people also redefine, adapt, and improve on processes. To identify why a process failed given the information, environment, and tools at hand at that specific point in time, there is another component from the LEAN framework that is useful:
The 5 Whys:
Investigation A (blaming):
Who navigated the boat into the iceberg? (Awkward silence, not a very productive conversation.)
Investigation B (1 WHY):
Why did the ship sink? - It hit an iceberg and broke. (Ok, check, don’t hit icebergs and don’t break. But maybe there’s more to investigate?)
Investigation C (5 WHY):
Why did the ship sink? - Water filled the hull.
Why? - The hull was ruptured.
Why? - Titanic hit an iceberg.
Why? - Ship turn and maneuver inefficiency
Why? - Ship speed
You can dig deeper with more whys or branch out to a different approach (Titanic hit an iceberg > lookout reported iceberg too late > lookout couldn't see iceberg, it was dark, OR Titanic sank > water filled hull > openings in hull > steel plates pulled apart on hull > strength of overlapping joints and rivets). All these approaches lead to conclusions that are accurate, in their different ways, and can be used to define action points to prevent a specific event from reoccurring.
Benefits if you do use it
A Blameless Post Mortem teaches what processes that are just right, what could be improved, and most importantly, what processes to avoid that would otherwise lead to exactly the same situation the next time the problem occurs.
Risks if you don’t use it
Consider the effects of your team, or an individual, being paralyzed by fear and keeping learnings from an incident process to themselves. The risk of the incident reoccurring increases by the number of process failures.
Tools to use
There are on-call tools that come with a built-in Post Mortem feature, conveniently integrated with systems for alerts and ticket creation. The two market leaders, PagerDuty and Atlassian’s’ OpsGenie, both have a Post Mortem feature.
On the Open Source side, there is etsy/morgue, a Post Mortem organizer that integrates with IRC and Jira.
Google provides a Post Mortem template deriving from their internal blameless Post Mortem process: https://sre.google/sre-book/example-postmortem/
Then what happened?
As for service A, the engineer who investigated the issue remembered to write down the steps taken to narrow down and mitigate the problem. He noted any out-of-the-ordinary observations on the way, as well as areas of improvement that might not have been the direct cause of the issue but would be good to tighten up. He took some fast notes of what alert points to implement to be able to react timely the next time a data batch would be at stake. Three days later, a Post Mortem meeting occurred, and the teachings were shared both in an internal memo, and as a light-themed post on the internal Friday share meeting.
By planting a distance between the engineer and the incident and conducting information on the incident as a “show and tell”, a great deal of the stigma of failure is removed. When there’s less space for fear, there’s more space for focus on improvement of the infrastructure and the incident response process.