1 (866) 866-2320 Resources Events Blog

A 5 Point Game-Plan For Reducing IT Incidents


A 5 Point Game-Plan For Reducing IT Incidents

Every time there's an IT incident, there's an impact on IT's resources and on IT's reputation. 2nd level support gets involved, managers get involved (when there's a problem or there's an escalation) and, no matter how you look at it, the customer (whether external or internal) feels the impact.

Every incident carries the risk of IT's reputation being eroded and confidence in IT being lost.
The best way to improve the support you provide to your customer is…to reduce the need to provide support in the first place. By proactively reducing the number of incidents experienced each month, the IT department can increase customer satisfaction and reduce its support costs at the same time.

So, what can you do to reduce the volume of IT incidents in your IT department?

Here's a 5 point game-plan for ways to do it. How many of them are you doing already?

#1 Enhance ITIL Change Management with Granular Configuration Automation

According to the itSMF (IT Service Management Forum), 80% of incidents are caused by changes made to the IT environment. ITIL Configuration Management is a quality control that helps to stop people doing whatever they like in production. 

When ITIL Configuration Management is done well (the configuration items and their relationships are accurate and meaningful, and the CMDB  - the tool where the config info is stored and managed - is actually used by people) configuration management can help reduce incidents.  However CMDB only gives an understanding of the big picture, while lacking visibility and control over the complex interdependent mesh of datacenter applications and infrastructure components. 

The CMDB complemented with a granular configuration automation solution gives insight to taking action about even the most minute changes in the environment.

By applying an automated analysis of granular configuration activity, you can focus on a different challenge: comparing environments at the most granular level to intelligently identify the smallest changes and differences that put environment stability at risk.

#2 Validate Releases

Multiple transitions between environments exist along the application lifecycle, including the most challenging one: release into production. Those releases are highly visible and typically entail stabilization periods and can even cause downtime with a direct impact on revenue and profitability.Solid management of releases (both their preparation and the way they are introduced into production) will prevent incidents before they happen. Release Managers should validate that the integrity of the live environment is protected and that execution adheres to the release plan.

By ensuring detailed visibility of what is being released down to the most granular level of the configuration parameter, you can:

  • Compare production and pre-production environments prior to release in order to ensure that pre-production sufficiently emulates production 
  • Detect changes introduced by release in pre-production to ensure that testing and release teams are aware of the exact release content
  • Compare production and pre-production after release to verify the accurate transition of release content and configuration

This can bring certainty to releases and reduce the high-impact risks associated with them.

#3 Proactively Watch for Drift

By generating alerts that warn IT operations of looming problems, i.e. when a messaging platform is turned off or a particular value is changed on an application server, you can take steps to prevent incidents. By running a drift comparison, comparing the environment to an earlier point on the timeline, you can proactively identify undesired changes and differences, before they turn into environment incidents. Event Management technology can be used to proactively detect critical and non-critical changes in an environment over an ongoing basis. By applying a daily scheduled comparison during the maintenance window, you could then review critical changes that occur to the environment, and validate the change, before problems occur. Domain experts can be notified before service is impacted and they can take steps to return things to normal before there is an incident.

#4 Analyze Root Cause

After a major incident, root cause analysis should be conducted to understand what caused the incident and how it can be prevented from happening again. This would usually be done as part of the Major Incident Review.

Focus on the most granular level of configuration parameters. To do this, you need to drill deep and uncover the most minute mis-configurations, which often are the root causes of high impact environment incidents.

#5 Grow and Share Knowledge

Environments include thousands of configuration parameters that impact environment stability. If not made actionable this abundance of data is just noise for IT Teams. 

So how do you turn configuration parameters into actionable information?

You need to enrich your configuration knowledge base and add your in-house expertise by defining the criticality and significance of relevant configuration parameters and identified changes, and what is the potential impact of those changes for your applications. With a central customizable knowledge base you can determine the impact and severity of each difference identified.

This knowledge base should be easily customizeable by your various stakeholders, allowing the organization to nurture and leverage knowledge spread among the various stakeholders. This will allow your organization to smoothly share knowledge among various stakeholders and domain specialists, helping to tap into existing organizational knowledge, and improving efficiency in maintaining stability.

Next Steps

Every incident carries the risk of IT's reputation being eroded and confidence in IT being lost.

About the Author
Martin Perlin