I Want Advanced Notice of Any Unplanned Outages!
5 Insights into Reducing Unplanned Outages from Dilbert
Dilbert's Pointy-Haired boss may seem to be demanding the impossible, but there are ways to take on the 'unknown' and effectively manage unplanned IT outages.
Unplanned IT outages are most importantly business issues. In 2008, the average revenue cost of an unplanned application outage was estimated to be nearly US$2.8 million dollars per hour, according to a report by IBM Global Services. IT disruptions can affect the entire delivery chain of the products and/or services provided by a business.
According to the IT Process Institute, on average downtime can reach 75 min/month. For medium outage, this figure is reported at 200 min/outage. Gartner reports that average downtime per year comes to 87 hours.
Twenty-four percent of organizations surveyed by KPMG state that an unplanned outage of greater than two hours is unacceptable. Another 48 percent state that they cannot manage when unplanned outages exceed 24 hours, according to an article on managing business continuity at Data Based Advisor.
Unplanned IT outages always cost money. A gap exists, and is widening, between the cost of unplanned downtime and an organization's ability to be effective with traditional response activities, according to the aforementioned report. Traditional response activities - including server reboots, retrying activities to restore services that have failed in previous attempts, incident management's inability to understand the technical environment or any other set of circumstances - do, indeed, slow restoration and cost the business money.
Being able to quickly identify those critical configuration differences that could trigger an incident easily empowers incident teams to focus their efforts on a small group of changes, rather than branching out in many different directions and wasting valuable time, on trial and error approaches.
- As the number of underlying changes grows and the IT landscape evolves, the ability to keep track of everything happening in IT environments has diminished. A typical environment includes thousands of different configuration parameters, and when so much as a single setting is mis-configured or disregarded, then the result can lead to an incident that spurs unplanned outages. To confront this, IT teams can compare
- the most granular level of environments configuration and content, in order to capture the root-cause of issues and incidents, no matter how minute
- the entire environment and cover a wide variety of applications and their underlying infrastructure
Then have the information analyzed for criticality and impact of changes and differences
- To investigate environment incidents and quickly identify configuration changes and differences that are the incident's root-cause, incident teams should execute comparisons of the problematic environment with:
- a previous, historical snapshot of the environment under investigation (when it was working well) – to identify the granular changes that might trigger the incident comparing with
- the last good baseline - to identify configuration and bill-of-material drift that could lead to the incident comparing
- a working (comparable) environment – to identify environment content of configuration causing the incident comparing with
- The Incident Management process involves numerous steps and stakeholders (best defined by ITIL). When the IT team has the change and difference information, they can analyze the incidents efficiently at each step of this process for each particular role. So for example, Tier I personnel can use change and difference information to identify change areas that could potentially trigger an incident without diving into the change details. They can hand off appropriate detailed information to relevant Tier II and Tier III specialists that can use it to decide if the incident is triggered by the environment.
- In the case of Major Incident analysis, when the IT team has a single picture of an environment bill-of-material and configuration, its' drift and consistency, then this information can be used by the "war room" team to quickly map the cause of the incident.
- To prevent environment incidents and unplanned outages, IT teams should run a "Drift Comparison" to proactively identify undesired changes and difference, before they turn into environment incidents. This could be carried out as a daily scheduled comparison during the maintenance window.
Want to learn more about how to slash your incident investigation time and effort in order to minimize system downtime and prevent unplanned outages from impacting operations?
See the Evolven solution for accelerating Incident Investigation and find out how to keep your IT environment available and performing as planned.