A lot of operation managers in large organizations usually confuses between incident & problem management. In most cases, they consider them both as one and this is not correct. In order to make everything clear, let us differentiate between the two concepts and define each process.
Incident means unplanned interruption of any service that impacts the quality of service (QoS). What we need to do is to make investigation in order to identify the root cause of this incident, so that we will be able to take the proper action that will help to restore the service to its normal conditions. Based on this, we can say that incident process is the way in which the operation team will interact with each other to collect as much information as they can in order to isolate the effect of the incident (not always the root cause of the problem) and to take the proper action to solve the issue for the service users.
Critical Success Factors in Incident Management
- Incident Description: Here you need to understand the pain of your customer. Usually the customer speaks from his own high level language and does not speak from the technical point of view. Asking the customer a lot of questions at this stage will lead you to understand what is the problem exactly at his side.
- Time : Incident is very sensitive to time since you need to solve it AS QUICK AS POSSIBLE
- Service (Impact & Urgency): In order to set the right attention to the incident, you need to know what is the importance of the impacted service, and what is the size of this impact.
Sample Priority Coding System
- Root cause: It might be defined during or after the incident. However, the root cause in some cases could not be defined.
- Corrective Action: Based on the root cause of the incident you need to take the proper action. However, a proper action might not be necessary in all incidents. Some incidents might be solved without any action or with unrelated one. For instance, a second reboot might solve a lot of incidents in IT systems even though it is not the best corrective action.
*** Related topics to the incident management: (KPI, SLA, OLA)
A problem is a result of one or more incidents. The cause is not usually known at the time of the incident then a problem record is created. Based on this, we can say that the problem management is the process in which the operation team will make further investigations about the incidents in order to define the root cause at the final stage.
The key objectives of Problem Management are to eliminate recurring incidents and to minimize the impact of incidents that cannot be prevented. Problem management process usually be and should be effective for any service in the complicated environment. The complicated environment means complicated setup which comes as a result of many interactions between many systems. So when any service is impacted, then the root cause might be from the service itself or from the background. The problem management process is a reaction of the incident management process and should be at a level of importance so that it will be supported by the top management in the organization,
In problem management, there should be an interaction between the top professional people in the operation team that we usually call “technical committee”, and the problem management committee. It is recommended for technical committee to contain people from operation team who run the service and people from engineering team who design the service. However, technical committee should be managed by a problem manager who owns the problem ticket and will communicate with the problem management committee. On the other hand, it is recommended for the problem management committee to be consisted of the top management of both technical and business. After investigation, technical committee represented by the problem owner, should provide the management committee with their recommendations. Then the management committee should take the best decision.
Critical Success Factors in Problem Management
- Service setup: the complexity of the setup may produce a complex problem.
- Incident logs: the technical committee should be provided with all incidents that might trigger the problem
- Skills and experience: Technical expertise will add a valuable role in discovering the root cause and recommending the best action
- Power of problem management committee: the committee should apply the recommendations provided by the technical committee
- Budget availability: Solving problems sometimes requires budget more than incident resolutions
Results from Problem Management Committee
- Changes: To make some changes on the setup
- Projects: New deployment for the service which usually takes long time and requires high cost “ everything should be defined in details ”
- Release the service: In case it was very old and complicated setup. A new service would be better than keeps running the current service.
- Survive with the incidents: No solutions or no budget and this is the worst result that might be obtained.
Hint: Incident management is high sensitive to the time while problem management is high sensitive to the root cause.