Below are my personal thoughts on some of the key questions or information an IT incident manager needs to collect and assess when investigating a critical system outage.
Description & Detail:
- Provide a high-level description that states the issue.
- Provide a detailed description of what happened, what went wrong or what failed.
- Who first identified the incident and how was it reported?
Incident Time Table:
- When did the incident begin?
- When was the incident reported?
- When was the incident resolved (service restored)?
Impact Analysis:
- Which systems were impacted?
- Which business areas were impacted by the failure?
- What were the business areas unable to get to, what did they not receive or which business function was unable to be performed due to failure?
- How many external customers were impacted by the failure?
- What were the external customers unable to get to, what didn’t they receive or which service was unavailable due to failure?
- What is the current and potential financial risk or impact to the organization?
Ownership & Resources:
- Who (individual) was the owner of the incident (service restored)?
- What groups were involved in incident resolution (service restored)?
Incident Resolution:
- What steps were performed to restore functionality, provide deliverables, or restore access?
Proactive Problem Management:
- What additional follow up items have been identified?
- What groups need to be involved to accomplish the follow-up items?
- What is being done to make sure this incident does not occur again?
- What would be done differently or what lessons were learned during incident resolution?
- What else should be included as a focus area for ensuring high availability for the service?
