Staying Ahead of the Next IT Nightmare
Recently we saw in InfoWorld a list of the 5 rules that for better troubleshooting downed mission-critical systems. On the front lines of IT, chances are you've seen a mission-critical production system falls flat on its face, and have absolutely no idea why or how to begin to fix it, with end users demanding answers and pressure to get everything up and running.
Staying cool under pressure isn't easy no matter how many times you've been tossed into the IT operations fire. Often IT professionals face seemingly incomprehensible problems. The only thing worse than not knowing how to fix a problem, is actually fixing it but not knowing how or why. For many, the first instinct is to dive into the problem and just start making changes, trying to find a quick fix. These random shots may even fix the problem. However, on the other hand, these very changes may threaten the chances of recovering data, and eliminate the possibility of determining the root cause, which can extend the outage.
In addition to the tips that Matt Prigge put together in his article for troubleshooting downed mission-critical systesms, we’ve put together a few more staying ahead of the next IT performance incident.
Watch for Drift Proactively
When a messaging platform is turned off or a particular value is changed on an application server, you need to know about this matter as early as possible. By receiving automatically generated alerts that warn IT operations of such problems, you can take steps to prevent incidents. So by proactively identifying undesired changes and differences you can confront situations, before they turn into environment incidents.
When you apply Event Management technology to your IT operations, you can then proactively detect critical and non-critical changes in an environment for an ongoing basis. This should allow you to review critical changes identified in the environment, and validate the change, before problems occur. Domain experts can be notified before service is impacted and they can take steps to return things to normal before an incident even threatens operations. A good incident management effort doesn't just mean just a postmortem review session, but actively looks for answers to resolve the failure, restore operations, and ensure the incident doesn't repeat.
Incident Management and Resolution
A manual incident investigation process is inefficient, leading to long MTTR. Today’s IT support staff have enough on their plates already, just chasing down known problems, and really don't have time to carefully evaluate every single change that occurs for every single supported system.
Whenever an incident occurs, there is a possibility that an unauthorized change took place. It worked before. Now it doesn't. So, what happened? Like, what change took place in the last hour?
With incident management tools that analyze the configuration parameters to a detailed level, you can correlate the incident to changes or re-configurations that occurred in the environment and cut mean-time-to- resolution (MTTR).
Analyze Root Cause
Recurring outages and major incidents are horror stories for IT organizations. One of the crucial and most important steps towards preventing such occurrences is to focus on root-cause analysis of the failure in order to not only resolve the incident but to head off a recurrence, especially early on in the troubleshooting phase of the incident.
After a major incident, as part of the Major Incident Review, root cause analysis needs to be carried out to properly understand what caused the incident and, more importantly, how to prevent it from happening again. To get to the bottom of the matter, you need to focus on the most granular level of configuration parameters, drilling deep and uncovering the most minute mis-configurations, often the root cause of high impact environment incidents.
When you have the situation of a working server and a non-functioning server, you can analyze to isolate the change. This may have been caused by an undocumented change that was introduced to the server environment. By analyzing and comparing working and non-working environments, you can identify if a change wasn't captured, boosting the incident resolution process.
By leveraging analytics, you can accelerate incident resolution by analyzing validated configuration that always worked, with the current situation, you can see the differences between the problematic environment and a "golden baseline" of the working configuration, and identify discrepancies that could be the root-cause of the incident.
Auditing the ProblemEnsuring compliance to corporate governance is critical for smooth functioning of the organization. Audits provide key feedback for process improvement, since you can monitor history and see what worked, was effective and what wasn't. This can prevent you from going in circles and trying the same things over and over. Without this, you'll often end up repeating the troubleshooting steps , costing you more time in the end. By analyzing configuration changes, you can carry out audits to meet compliance requirements.
Grow and Share Knowledge
Environments include thousands of configuration parameters that impact environment stability. If not made actionable this abundance of data is just noise for IT Operations Teams.
So how do you turn piles of configuration parameters into actionable information?
You need to enrich your configuration knowledge base and add your in-house expertise by defining the criticality and significance of relevant configuration parameters and identified changes, and what is the potential impact of those changes for your applications.
Some very complex applications require IT operations to set tens of thousands of parameters. The frequency of change is so great that it can result in chaotic modifications. As we outlined here, training and change management solutions must be used, at a fine level of granularity that provides both control and validation of configuration parameters, and detecting unauthorized changes before they impact production.