open
  1 (866) 447-2526 Resources Events Blog

Having Trouble Finding the Root Cause (Part 1)

Blog

Having Trouble Finding the Root Cause (Part 1)


 

This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.

 

  • Having Trouble Finding The Root Cause 
  • Stuck In Reactive Firefighting Mode?
  • Not Sure What is Important
  • Can not See the Forest for the Trees
  • Overwhelmed with false alarms?

Part 1 of a 5 part series

-  -  -  -

 

This challenge addresses how to speed up root cause analysis. When root cause analysis starts, operators usually ignore the change. Since, the change is not actually the initial cause, but usually there is something else that triggered the change.

Take IT data. There is so much context for IT data.  First of all, changes are planned, then executed either through application infrastructure, that is automated or someone else manually introduces changes into the system. When changes have been deployed, then the effect of these changes is monitored - either by watching log errors, network activity or by APM alerts.

Only by looking at all of the data sources together can one get an idea of what is really going on in the system. The changes are the critical ingredient linking together IT context and the symptoms that indicate that something is wrong with the system.

Effective Root Cause Analysis

In this case, to apply effective root cause analysis, one needs to follow these steps.

  • Step One: Detect all actual changes in the system.
    These are changes in configuration, capacity, code, data and workload. It is also really helpful to estimate the risk associated with each change.
  • Step Two: Correlate the data sources.
    This data contains the changes, so the context that caused the change is known. For instance, when an alert appears then it will be known where this alert came from, whether from a database or from application error? What caused this alert?

Probabilistic Reasoning

Once this correlation is established, then Probabilistic Reasoning can be applied, and actually build an environment dependency diagram as a belief network.

This is a really simplistic example of what Probabilistic Reasoning looks like. The figure shows that the incident depends (is caused by) an automatic deployment that may go wrong and the execution of a change request. The incident leads to a log error indicating where something went wrong or an APM alert that notifies the operator about the issue.

However, as is well known concerning log errors, log lines are added all the time for various reasons. Logs cannot be relied upon whenever a new line appears. Similarly, APM alerts are not perfect and there are many, many APM alerts but only a few are truly pointing to an incident.

To apply Probabilistic Reasoning for resolution, one needs to know the probability for when a log error indicates an incident or an APM alert indicates an incident. For instance, when log errors and APM alerts are triggered but no deployment or change request is observed, then one can consider the likelihood of an incident quite low, even though there was a log error or an APM alert. But if a log error appears after a change request was executed, the likelihood of an incident increases.

Dependency and belief networks for much larger structures can be created by taking into account not just automated deployments or change requests but also environment type, dependency between environments and other relationships.

See Evolven in action!
Unlock the power of actual changes. Register now for a live demo.

About the Author
Bostjan Kaluza, PhD

Boštjan Kaluža is the Chief Data Scientist at Evolven. He's also a hardcore researcher who's done a lot of research into artificial intelligence and intelligent systems, machine learning, predictive analytics and anomaly detection. Prior to Evolven, Boštjan served as a senior researcher in the Department of Intelligent Systems at the Jozef Stefan Institute, the leading Slovenian scientific research institution and led research projects involving pattern and anomaly detection, machine learning and predictive analytics.

 

Focusing on the detection of suspicious behavior and data analysis, Boštjan has published numerous articles in professional journals and delivered conference papers. In 2013, Boštjan published his first book on data science, Instant Weka How-to, exploring how to leverage machine learning using Weka. Boštjan is now working on his second book Practical Machine Learning in Java, scheduled to be published later this year. Boštjan is also the author and contributor to a number of patents in the areas of anomaly detection and pattern recognition.

 

Boštjan earned his PhD at Jožef Stefan International Postgraduate School in Ljubljana, Slovenia, rigorously defending a doctoral dissertation entitled Detection of Anomalous and Suspicious Behavior Patterns.