Data and Problem Definition
Data mining is a creative process best applied by progressing through a workflow consisting of a number of different stages. This article is part of a series that started with the overview article Applied Machine Learning Workflow. This post aims to help you define your problem by looking at how to represent the target variables, the data collection and annotation and how to define the scales of measurement like risk levels at small, medium, or high risk.
The first stage is to understand and put greater focus on the problem we want to solve. Competing objectives and other constraints can influence this effort. Yet, neglecting this stage can mean that a great deal of energy would be put into producing the right answers to the wrong questions.
To frame the problem, we should try describing the problem and defining tasks from an operational perspective. For example, the primary goal might be to maintain current performance by predicting when issues might impact availability. Related questions might be "How do we identify anomalous events" or "How can we detect failures earlier?" or "How can we make the volume of alerts and distributed notifications more useful by reducing false alarms?"
Looking deeper, we want to ask ourselves why this problem is important. This could be specific and measurable, for example finding the specific component or issue that is unusual in the system and if it is an outlier or it might be general and subjective, such as "how to get more useful insights out of collected the data."
Defining the DataData is simply a collection of facts. These facts can be numbers, words, measurements, observations, descriptions of things, images and so on.
The most common way to represent the data is attribute-value pair. For example:
A set of data can be simply presented as a table, where colons correspond to attributes or features and rows to particular data entries or instances. In supervised machine learning, the attribute we want to predict is denoted as class or target variable.
Setting Measurement Scales
The first thing we notice is how varying the attribute values are. For instance, height is a number, eye color is text, hobbies are a list and so on. To gain a better understanding of the value types let's take a closer look at the different types of data or measurement scales. Stevens (1946) defined the following four scales with increasingly more expressive properties:
- Nominal data correspond to categories with no particular order or direction. Examples include eye color, martial status, type of car owned etc.
- Ordinal data correspond to categories where order or rank is important, for example, student letter grade, week day, service quality rating etc.
- Interval data describe the differences between measurements, but there is no concept of zero, for instance, standardized exam score, temperature in Fahrenheit, IMDB movie score,
- Ratio data also describe the differences between measurements, but here the true zero value exists. This could describe height, age, stock price, weekly food spending and so on.
Why shall we care about measurement scales? Well, machine learning heavily depends on the statistical properties of the data; hence, we should be aware of the limitations each data type possesses. The next table summarizes the main operations and statistics properties for each of the measurement types.
Large-scale software systems and IT environments suffer when the system fails to function properly. It is often difficult to determine which part of the system set off the problem, where the symptoms of a failure appear as end-to-end failures in the operation of the system as a whole, and don't manifest as failures in the system's individual pieces. So, just the realization that something is wrong is not enough to know where to focus support efforts.
To best leverage the ability for machine learning to mine information and relationships from data collections, IT operations should start by focusing on the specific problems to fix for early and more efficient resolution, such as analyzing for root cause or detecting operational anomalies of the components.
This article is excerpted from my upcoming book: Practical Machine Learning in Java, scheduled to be published later this year.