Not Sure What is Important (Part 3)
This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.
- Having Trouble Finding The Root Cause
- Stuck In Reactive Firefighting Mode?
- Not Sure What is Important
- Can not See the Forest for the Trees
- Overwhelmed with false alarms?
Part 3 of a 5 part series
- - - -
This use case depicts the scenario where one is simply not sure what data is important.
This example looks at an average one minute of CPU consumption as shown in the next image.
A typical algorithm based on dynamic thresholds looks at the variation of the signal and sets a threshold that covers most of the variance. An example of such a transformation is shown in the next figure.
This figure shows that every couple of hours there is a spike in CPU consumption. While this may appear to be a cause for concern, this large spike is actually caused by an automated backup script. Yet, suddenly when there is an issue, this monitoring figure still appears exactly like all the other backups, even though there is in fact an incident.
Typical reporting helps set a dynamic threshold, and then when a value falls above this dynamic threshold, an alert is reported. Reporting such an alert every 12 hours will cause lots of alerts that no one actually cares about. The backup script was expected to run, causing the CPU to spike. In fact, it is even more interesting to know if the backup did not happen! Also, what is also important to know is whether there was actually a critical incident that caused the spike in CPU consumption.
Self-Adaptive Learning
Instead of relying upon dynamic thresholds, Machine Learning techniques like Self-Adaptive Learning can be applied to focus on the anomalies that could indicate an incident. This approach starts by using a learning period to establish a baseline for how the system functions, then classifies new behaviors. If a new behavior is similar to what was already seen, then everything is ok, and any spikes that appear are considered expected and part of the normal routine.
However, when something different happens, for example, average CPU memory usage goes from 20% to 80% or there is a spike over a longer period of time, then a different behavior is noted as compared to what was seen before. After a certain period of time this behavior change is considered permanent and the system can ‘learn’ it and understand that this is a new pattern in the expected behavior, and no alerts are delivered.
What is found to be unusual, deserving of an operator’s attention, are the unusual changes, showing how the resource is consumed.
By looking at specific KPIs, Self-Adaptive Learning is crucial for avoiding false alarms and for learning and determining what is typical system behavior, to let operators focus on what is truly important.