These key questions are always answered inside a document that we call the post-mortem report. Any time there is a critical incident, a post-mortem report is required. Of course, there are more points covered for that report (in an up coming article I will give a deeper insight on this document), for example it also includes Lessons Learned. But today, let’s focus on the three key questions:
1. What has happened? (or better yet, Why has it failed?)
Well in fact, the answer to this question is not difficult. One can say that the server started to perform in an anomalous way (when in fact, one probably has no idea what the hell happened). But, there’s a more tricky question which can arise: Why has this happened?. The answer to this question is not easy. Has someone pushed the wrong button?. When the Customer asks you Why, you need to think very carefully about your answer and try to find a smart response lest you get into more trouble.
2. What have you done? (to solve it)
The answer to this question is also very easy. You only need to enumerate what your service team has done.
These are all past actions. The only important detail is to accurately include detailed information for the customer, so that he or she understands that you have done everything possible to solve it quickly and effectively.
Include the onset time of the outage, the detection time, and each of the actions taken by the groups/roles that performed them. Lastly include the results that had each of these actions.
3. What are you going to do? (for not repeating it again)
In fact, this is one of the most difficult answers if you don’t know what has happened.
In this case, we are talking about future – that is, new actions to be performed for the sake of prevention. But, many times one might think that the strange/weird thing will not happen again – since there was no explanation why it happened in the first place.
Another difficult point is that many times there isn’t any way to know how to reproduce the problem. Because when it happened, all the focus was on solving it (for example, the system was reset).
But as a Service Manager, one should not think about all these things. One only needs to draw a good roadmap of all the actions required to manage in a professional way the aftermath of an IT outage. This is like a post-outage project, where a list of configuration elements need to be verified. That’s all. If this verification can be made together with the supplier (IBM, HP, Oracle, etc.) so much the better – because the customer will be more confident with the aftermath revision.
What is your best advice to manage the aftermath of an IT Outage?
Not subscribed yet?, do it NOW!