open
  1 (866) 447-2526 Resources Events Blog

Applied Machine Learning Workflow

Blog

Applied Machine Learning Workflow


 

Machine Learning projects are implemented by following a knowledge discovery process. This post describes a recommended approach for an applied machine learning workflow that you can use to tackle problems, for example, identifying critical anomalies that pose risk to your IT environment. While not every step in this process is strictly mandatory, this approach will help you to consider problems in a more scientific way.

Inspired by the industry-established Cross Industry Standard Process for Data Mining (CRISP-DM) methodology, the steps outlined here are slightly different, focusing more on technical steps in typical workflow.

Looking at all the data, this process has been designed to guide you to a high quality solution for the problem you are dealing with, helping you reach the best results. While, you should note that this approach doesn't necessarily ensure a solution to your problem, the lessons gained through this process can trigger new, often more focused questions to be applied to subsequent data mining iterations.

Typical Workflow

A typical workflow in applied machine learning applications consists of answering a series of questions, summarized in five steps:

Step 1. Data and problem definition

The first step is to ask interesting questions. What is the problem you are trying solve? Why is it important? Which format of result answers your question? Is this a simple yes/no answer? Do you need to pick one of the available questions?

Step 2. Data collection

Once you have a problem to tackle, you will need the data. Ask yourself what kind of data will help you answer the question? Can you get the data from available sources? Will you have to combine multiple sources? Do you have to generate the data? Are there any sampling biases?

Step 3. Data preprocessing

The first data preprocessing task is data cleaning, for example, filling missing values, smoothing noisy data, removing outliers, resolving consistencies.
This is usually followed by integration of multiple data sources and data transformation to a specific range (normalization), to value bins (discretized intervals), and to reduce the number of dimensions.

Step 4. Data analysis and modelling with unsupervised and supervised learning

Data analysis and modeling includes unsupervised and supervised machine learning, statistical inference and prediction. A wide variety of machine learning algorithms are available, including k-nearest neighbors, naïve Bayes, decision trees, support vector machines, logistic regression, k-means etc. The choice of method to be deployed depends on the problem definition discussed in the first step and the type of collected data. The final product of this step is a model inferred from the data.

Step 5. Generalization and evaluation

The last step is devoted to model assessment. The main issue models built with machine learning face is how well they model the underlying data – if a model is too specific, that is, it strongly follows the data used for training, it is quite possible it will not perform well on a new data. On the other hand, the model can be too specific, for instance, when asked what the weather is in California, it always answers sunny, which is indeed correct most of the time. However, such a model is not really useful for making valid predictions. The goal of this step is to correctly evaluate the model and make sure it will work on new data as well. Evaluation methods include separate test and train set, cross validation, and leave-one-out validation.

Summary

The true value of data science lies in the ability to extract useful insights and find interesting trends and correlations to support decision making. This overall approach describes at a high level how to extract knowledge from raw data. The most beneficial part of the process described here is how this forces you to think in a more focused way for tackling a data-driven problem, putting you in the right direction. 

You have learned a simple process that you can follow to apply the machine learning workflow. In the upcoming posts, we will go into more detail about each of the steps mentioned here and try to understand the type of questions that must be answered and issues to be considered for arriving at the best outcome for your Data Mining project. 

Credit

This article is excerpted from my upcoming book: Practical Machine Learning in Java, scheduled to be published later this year.

Your Turn
How are you applying machine learning to home in on your IT operations problems?

About the Author
Bostjan Kaluza, PhD

Boštjan Kaluža is the Chief Data Scientist at Evolven. He's also a hardcore researcher who's done a lot of research into artificial intelligence and intelligent systems, machine learning, predictive analytics and anomaly detection. Prior to Evolven, Boštjan served as a senior researcher in the Department of Intelligent Systems at the Jozef Stefan Institute, the leading Slovenian scientific research institution and led research projects involving pattern and anomaly detection, machine learning and predictive analytics.

 

Focusing on the detection of suspicious behavior and data analysis, Boštjan has published numerous articles in professional journals and delivered conference papers. In 2013, Boštjan published his first book on data science, Instant Weka How-to, exploring how to leverage machine learning using Weka. Boštjan is now working on his second book Practical Machine Learning in Java, scheduled to be published later this year. Boštjan is also the author and contributor to a number of patents in the areas of anomaly detection and pattern recognition.

 

Boštjan earned his PhD at Jožef Stefan International Postgraduate School in Ljubljana, Slovenia, rigorously defending a doctoral dissertation entitled Detection of Anomalous and Suspicious Behavior Patterns.