open
  1 (866) 447-2526 Resources Events Blog

Data and Problem Definition

Blog

Data and Problem Definition


 

Data mining is a creative process best applied by progressing through a workflow consisting of a number of different stages. This article is part of a series that started with the overview article Applied Machine Learning Workflow. This post aims to help you define your problem by looking at how to represent the target variables, the data collection and annotation and how to define the scales of measurement like risk levels at small, medium, or high risk. 

The first stage is to understand and put greater focus on the problem we want to solve. Competing objectives and other constraints can influence this effort. Yet, neglecting this stage can mean that a great deal of energy would be put into producing the right answers to the wrong questions. 

To frame the problem, we should try describing the problem and defining tasks from an operational perspective. For example, the primary goal might be to maintain current performance by predicting when issues might impact availability. Related questions might be "How do we identify anomalous events" or "How can we detect failures earlier?" or "How can we make the volume of alerts and distributed notifications more useful by reducing false alarms?" 

Looking deeper, we want to ask ourselves why this problem is important. This could be specific and measurable, for example finding the specific component or issue that is unusual in the system and if it is an outlier or it might be general and subjective, such as "how to get more useful insights out of collected the data."

Defining the Data

Data is simply a collection of facts. These facts can be numbers, words, measurements, observations, descriptions of things, images and so on.

Representing Data

The most common way to represent the data is attribute-value pair. For example: 

A set of data can be simply presented as a table, where colons correspond to attributes or features and rows to particular data entries or instances. In supervised machine learning, the attribute we want to predict is denoted as class or target variable. 

Setting Measurement Scales

The first thing we notice is how varying the attribute values are. For instance, height is a number, eye color is text, hobbies are a list and so on. To gain a better understanding of the value types let's take a closer look at the different types of data or measurement scales. Stevens (1946) defined the following four scales with increasingly more expressive properties:

  • Nominal data correspond to categories with no particular order or direction. Examples include eye color, martial status, type of car owned etc.
  • Ordinal data correspond to categories where order or rank is important, for example, student letter grade, week day, service quality rating etc.
  • Interval data describe the differences between measurements, but there is no concept of zero, for instance, standardized exam score, temperature in Fahrenheit, IMDB movie score,
  • Ratio data also describe the differences between measurements, but here the true zero value exists. This could describe height, age, stock price, weekly food spending and so on.

Why shall we care about measurement scales? Well, machine learning heavily depends on the statistical properties of the data; hence, we should be aware of the limitations each data type possesses. The next table summarizes the main operations and statistics properties for each of the measurement types. 

Furthermore, nominal and ordinal data correspond to discrete values, while interval and ratio data correspond to continuous values. In supervised learning the measurement type of the value we want to predict dictates what kind of machine algorithm can be used. For instance, discrete values from a limited list can be predicted with decision trees, while continuous values are predicted with regression models.

Summary

Large-scale software systems and IT environments suffer when the system fails to function properly. It is often difficult to determine which part of the system set off the problem, where the symptoms of a failure appear as end-to-end failures in the operation of the system as a whole, and don't manifest as failures in the system's individual pieces. So, just the realization that something is wrong is not enough to know where to focus support efforts. 

To best leverage the ability for machine learning to mine information and relationships from data collections, IT operations should start by focusing on the specific problems to fix for early and more efficient resolution, such as analyzing for root cause or detecting operational anomalies of the components. 

Credit
This article is excerpted from my upcoming book: Practical Machine Learning in Java, scheduled to be published later this year.

Your Turn
What problems are you using machine learning for?

About the Author
Bostjan Kaluza, PhD

Boštjan Kaluža is the Chief Data Scientist at Evolven. He's also a hardcore researcher who's done a lot of research into artificial intelligence and intelligent systems, machine learning, predictive analytics and anomaly detection. Prior to Evolven, Boštjan served as a senior researcher in the Department of Intelligent Systems at the Jozef Stefan Institute, the leading Slovenian scientific research institution and led research projects involving pattern and anomaly detection, machine learning and predictive analytics.

 

Focusing on the detection of suspicious behavior and data analysis, Boštjan has published numerous articles in professional journals and delivered conference papers. In 2013, Boštjan published his first book on data science, Instant Weka How-to, exploring how to leverage machine learning using Weka. Boštjan is now working on his second book Practical Machine Learning in Java, scheduled to be published later this year. Boštjan is also the author and contributor to a number of patents in the areas of anomaly detection and pattern recognition.

 

Boštjan earned his PhD at Jožef Stefan International Postgraduate School in Ljubljana, Slovenia, rigorously defending a doctoral dissertation entitled Detection of Anomalous and Suspicious Behavior Patterns.