Stuck In Reactive Firefighting Mode? (Part 2)
This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.
- Having Trouble Finding The Root Cause
- Stuck In Reactive Firefighting Mode?
- Not Sure What is Important
- Can not See the Forest for the Trees
- Overwhelmed with false alarms?
Part 2 of a 5 part series
- - - -
How can one actually prevent incidents from happening? Machine learning can help in this area as well. In a typical incident time line, something was changed in the system and then after a while an incident happens. Then a monitoring tools such as APM or log errors sends alerts that something is wrong. Only then an incident-resolution team is organized into a war room to address the incident, and introduce the fix to resolve this problem. There is a lot of time between when a change is introduced and the incident is resolved. The risk to critical business applications increases over this time.
Frequent Pattern Mining
How can the time to resolution be cut down? Frequent pattern mining can be applied. The goal of this approach is to identify events that frequently appear together, for instance if there is an incident every time a new version of the application is deployed and a small fraction of users are affected. The system can automatically pick up such patterns and avoid the issues. It can notify the operators even before they start deployments, with an alert like “Be careful when the deployment starts since a segment of users will be affected”. Another instance is when the firewall is changed and some applications fail, so before changing the firewall settings, an alert is triggered saying “This change will affect the connectivity of the following web applications”.
The other technique that is useful is Classification. The system can learn to identify which components will be affected by specific changes in the system For instance, in a Windows update deployment, the system can learn which components usually impact performance. Are there some specific DLLs that some applications depend on? Are there some other components or issues?
Forecasting methods can be applied to estimate any performance issues and when an incident might happen and to what magnitude. This means that:
- there is no increase in data processing time
- there is a significant increase in data processing time
- happens within minutes or will happen within weeks
A typical example is when the application tool size is changed. This change can typically have an impact within weeks or when changing the firewall the effect will be seen in minutes.
It is important to understand which components are affected, when they will be affected and to what magnitude.
Such machine learning techniques can significantly cut down the mean time to a resolution. Instead of waiting for an incident to happen and alerts to come in, machine learning can be deployed immediately after detecting the change, estimating the effect on performance and introducing a fix before the incident appears.