open
  1 (866) 866-2320 Resources Events Blog

Take Your Incident Investigation Into the Details for Root-cause Analysis

Blog

Take Your Incident Investigation Into the Details for Root-cause Analysis


 

A major system has broken down again, customer's orders are stuck on the processing line and IT operations has no idea how long it will take to bring the system back online.

Been there before?

In general, system failure translates into projects going over budget, dissatisfied customers, frustrated employees, reputation damage, financial losses, legal liabilities, and possibly a full re-organization. According to Information Management "Average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages. Downtime costs also vary significantly within industries." 

Problems Keep Coming Back

Recurring outages and major incidents can prove to be a nightmare for IT organizations. One of the crucial and most important steps towards preventing such occurrences is to focus on root-cause analysis of the failure in order to not only resolve the incident but to head off a recurrence, especially early on in the troubleshooting phase of the incident. This is critical for after the fact in order to really understand what went wrong, and why, so you can take actions toward rectifying it for the future.

Having a well defined postmortem for cause analysis process in place is one aid in IT action plans, but you should also be proactively seeking to reduce the potential for additional incidents, together these approaches can serve as powerful building blocks for the IT department to increase customer satisfaction and reduce its support costs at the same time.

While ITIL addresses this issue by describing the process for Problem Management and Incident Management and how they play important roles in reducing user downtime, ITIL does not actually provide the system to make this happen. Traditionally, Incident Investigation consists of a manual, process, leading to a long mean time to recovery (MTTR). With the stakes high, and pressure on every second, you need to not only be able to resolve incidents quickly and but also make sure the root cause was addressed so that the issue doesn't come up again. 

Looking into All Causes of Failure

A good incident management effort doesn't just mean just a postmortem review session, but actively looks for answers to resolve the failure, restore operations, and ensure the incident doesn't repeat.Writing off a system breakdown or outage to just one single high level cause without an in-depth cause analysis investigation is similar to a coroner pronouncing the cause of death even before conducting an autopsy. On the one side, just like an autopsy, a thorough cause analysis review will look into all the possible causes of the failure, including detailed configuration parameters which may have contributed to the failure.

Finding the True Root-Cause

IT organizations find themselves challenged when assessing system failure and tracking down the root cause, such as if a patch wasn't deployed or a server failed. Even when they manage to suppress a failure, and operations can return to 'normal', the true root cause may still remain unresolved, leaving the organization exposed to further havoc. By probing into the deepest levels for the real causes of the incident, valuable information can be uncovered. 

The well designed root-cause analysis should delve deep and identify the factors which brought about the incident leading to  downtime and impact to the business. The process takes into account the chronology of all the events contributing to the incident including factors like undocumented changes, human error, and non-validated deployments.

The Devil is In the Details

In Incident Management, quickly finding the root-cause of environment incidents (including changes, differences, environment configuration and bill of material) can cut incident investigation time and restore normal service operations quickly, minimizing the impact on business operations. Yet more often, the devil is in the details, and this means identifying the cause of a devastating incident amongst abundant configuration information deep at a granular level. 

An configuration management solution that  can  stay on top of the constantly changing, overwhelming amount of configuration data in IT operations, and quickly deliver actionable information, accelerates the incident management process. 1st Level Support could then use visualized information to identify change areas that could potentially trigger an incident, without diving into the change details. If no ad-hoc solution can be achieved they can hand off appropriate detailed information to relevant 2nd or 3rd Level Support. In the case of Major Incident analysis, an automated configuration management solution would provide a single picture of the environment bill-of-material and configuration, its' drift and consistency. This information could be used by the Major Incident Team to quickly map the cause of the incident.

Really having a comprehensive automated change management solution for handling root-cause analysis will result in an identifying the cause of the failure through the complex relationships of processes, systems and environments which characterize the state of IT environments today.

About the Author
Martin Perlin