1 (866) 447-2526 Resources Events Blog

Stuck In Reactive Firefighting Mode? (Part 2)

Blog

Stuck In Reactive Firefighting Mode? (Part 2)


 

This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.

 

  • Having Trouble Finding The Root Cause
  • Stuck In Reactive Firefighting Mode? 
  • Not Sure What is Important
  • Can not See the Forest for the Trees
  • Overwhelmed with false alarms?

Part 2 of a 5 part series

-  -  -  -

How can one actually prevent incidents from happening? Machine learning can help in this area as well. In a typical incident time line, something was changed in the system and then after a while an incident happens. Then a monitoring tools such as APM or log errors sends alerts that something is wrong. Only then an incident-resolution team is organized into a war room to address the incident, and introduce the fix to resolve this problem. There is a lot of time between when a change is introduced and the incident is resolved. The risk to critical business applications increases over this time.

Frequent Pattern Mining

How can the time to resolution be cut down? Frequent pattern mining can be applied. The goal of this approach is to identify events that frequently appear together, for instance if there is an incident every time a new version of the application is deployed and a small fraction of users are affected. The system can automatically pick up such patterns and avoid the issues. It can notify the operators even before they start deployments, with an alert like “Be careful when the deployment starts since a segment of users will be affected”. Another instance is when the firewall is changed and some applications fail, so before changing the firewall settings, an alert is triggered saying “This change will affect the connectivity of the following web applications”.

Classification

The other technique that is useful is Classification. The system can learn to identify which components will be affected by specific changes in the system For instance, in a Windows update deployment, the system can learn which components usually impact performance. Are there some specific DLLs that some applications depend on? Are there some other components or issues?

Forecasting

Forecasting methods can be applied to estimate any performance issues and when an incident might happen and to what magnitude. This means that:

  • there is no increase in data processing time
  • there is a significant increase in data processing time
  •  happens within minutes or will happen within weeks 

A typical example is when the application tool size is changed. This change can typically have an impact within weeks or when changing the firewall the effect will be seen in minutes.

It is important to understand which components are affected, when they will be affected and to what magnitude.

Such machine learning techniques can significantly cut down the mean time to a resolution. Instead of waiting for an incident to happen and alerts to come in, machine learning can be deployed immediately after detecting the change, estimating the effect on performance and introducing a fix before the incident appears.

See Evolven in action!
Unlock the power of actual changes. Register now for a live demo.

About the Author
Bostjan Kaluza, PhD

Boštjan Kaluža is the Chief Data Scientist at Evolven. He's also a hardcore researcher who's done a lot of research into artificial intelligence and intelligent systems, machine learning, predictive analytics and anomaly detection. Prior to Evolven, Boštjan served as a senior researcher in the Department of Intelligent Systems at the Jozef Stefan Institute, the leading Slovenian scientific research institution and led research projects involving pattern and anomaly detection, machine learning and predictive analytics.

 

Focusing on the detection of suspicious behavior and data analysis, Boštjan has published numerous articles in professional journals and delivered conference papers. In 2013, Boštjan published his first book on data science, Instant Weka How-to, exploring how to leverage machine learning using Weka. Boštjan is now working on his second book Practical Machine Learning in Java, scheduled to be published later this year. Boštjan is also the author and contributor to a number of patents in the areas of anomaly detection and pattern recognition.

 

Boštjan earned his PhD at Jožef Stefan International Postgraduate School in Ljubljana, Slovenia, rigorously defending a doctoral dissertation entitled Detection of Anomalous and Suspicious Behavior Patterns.