Having Trouble Finding the Root Cause (Part 1)
This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.
- Having Trouble Finding The Root Cause
- Stuck In Reactive Firefighting Mode?
- Not Sure What is Important
- Can not See the Forest for the Trees
- Overwhelmed with false alarms?
Part 1 of a 5 part series- - - -
This challenge addresses how to speed up root cause analysis. When root cause analysis starts, operators usually ignore the change. Since, the change is not actually the initial cause, but usually there is something else that triggered the change.
Take IT data. There is so much context for IT data. First of all, changes are planned, then executed either through application infrastructure, that is automated or someone else manually introduces changes into the system. When changes have been deployed, then the effect of these changes is monitored - either by watching log errors, network activity or by APM alerts.
Only by looking at all of the data sources together can one get an idea of what is really going on in the system. The changes are the critical ingredient linking together IT context and the symptoms that indicate that something is wrong with the system.
Effective Root Cause Analysis
In this case, to apply effective root cause analysis, one needs to follow these steps.
- Step One: Detect all actual changes in the system.
These are changes in configuration, capacity, code, data and workload. It is also really helpful to estimate the risk associated with each change.
- Step Two: Correlate the data sources.
This data contains the changes, so the context that caused the change is known. For instance, when an alert appears then it will be known where this alert came from, whether from a database or from application error? What caused this alert?
Once this correlation is established, then Probabilistic Reasoning can be applied, and actually build an environment dependency diagram as a belief network.
This is a really simplistic example of what Probabilistic Reasoning looks like. The figure shows that the incident depends (is caused by) an automatic deployment that may go wrong and the execution of a change request. The incident leads to a log error indicating where something went wrong or an APM alert that notifies the operator about the issue.
However, as is well known concerning log errors, log lines are added all the time for various reasons. Logs cannot be relied upon whenever a new line appears. Similarly, APM alerts are not perfect and there are many, many APM alerts but only a few are truly pointing to an incident.
To apply Probabilistic Reasoning for resolution, one needs to know the probability for when a log error indicates an incident or an APM alert indicates an incident. For instance, when log errors and APM alerts are triggered but no deployment or change request is observed, then one can consider the likelihood of an incident quite low, even though there was a log error or an APM alert. But if a log error appears after a change request was executed, the likelihood of an incident increases.
Dependency and belief networks for much larger structures can be created by taking into account not just automated deployments or change requests but also environment type, dependency between environments and other relationships.