Can Not See the Forest for the Trees (Part 4)
This article is part of a 5 part series covering some of ways to deal with the top challenges for IT Operations and how machine learning techniques can be applied to address them.
- Having Trouble Finding The Root Cause
- Stuck In Reactive Firefighting Mode?
- Not Sure What is Important
- Can not See the Forest for the Trees
- Overwhelmed with false alarms?
Part 4 of a 5 part series
- - - -
How does one distinguish and recognize high-level situations and what’s really going on? A use case that demonstrates this is where an organization has 600,000 events per hour across 40,000 severs. These generate in the range of 47,000 help desk tickets, which results in approximately 2000 Level-2 escalations per year. In other words, this is 66 Level-2 escalations per day. That is a significant amount of escalations to deal with.
A typical Level-1 enterprise monitoring experience consists of the following:
There are different applications, different tools and many alerts indicating where the issue is and what to focus on. While this provides a lot of data, nevertheless it is not clear what is going on, and what are these dots referring to. How are the results correlated? Are they just repeated alerts? What is going on?
Machine Learning can help in this situation, with a Clustering technique. In clustering, there are two fundamentally different approaches: Bottom-up Clustering, and Top-down Clustering.
Bottom-up Clustering
The first approach is Bottom-up Clustering. In this approach, the algorithm examines all the data while trying to group them into reasonable chunks. Once similar events are found, they are grouped together. The procedure is repeated until all the remaining grouped events are too different from each other. Finally, a common description that explains the significance of this chunk of data is assigned.
One of the advantages of the Bottom-up Clustering approach is that this is completely unsupervised. That means that it doesn’t need any human intervention. The algorithm runs on this data and automatically extracts interesting groups of data.
Top-down Clustering
The other approach is Top-down Clustering. This approach is based on the notion that the operators already know what might happen in the system, and they can try to match events to correspond to this. For example, in a manual deployment, the operators might expect some changes to take place in the system, as well as some alerts to appear.
This requires some human intervention, and some rules to be specified or some other templates to be applied.
For instance, at 8 am a manual server migration began that caused a couple of alerts. By using clustering, this group of alerts can be identified. First, they are aggregated by a specific application layer using Bottom-up Clustering. Next, these clusters can also be correlated using a Top-down approach. By combining both clustering approaches, it is evident that these alerts came from the same action in the IT system – the server migration.
Then a few hours later, someone implements a manual change request, after the server was migrated, again generating alerts. Finally, when a new version of the application is deployed, many new alerts appear. Just looking at a corner of specific dots (on the diagram), one would have no idea about what's going on.
Instead, one needs to group these events into meaningful chunks to get a good idea of what happened.
That would result in something looking like this:
The server was migrated, some changes were implemented and the new version was deployed. Using Machine Learning techniques, one gets much better insight into what happened to the system.
Instead of just looking at the trees, you can see the forest.