Applied Machine Learning Workflow
Machine Learning projects are implemented by following a knowledge discovery process. This post describes a recommended approach for an applied machine learning workflow that you can use to tackle problems, for example, identifying critical anomalies that pose risk to your IT environment. While not every step in this process is strictly mandatory, this approach will help you to consider problems in a more scientific way.
Inspired by the industry-established Cross Industry Standard Process for Data Mining (CRISP-DM) methodology, the steps outlined here are slightly different, focusing more on technical steps in typical workflow.
Looking at all the data, this process has been designed to guide you to a high quality solution for the problem you are dealing with, helping you reach the best results. While, you should note that this approach doesn't necessarily ensure a solution to your problem, the lessons gained through this process can trigger new, often more focused questions to be applied to subsequent data mining iterations.
Typical WorkflowA typical workflow in applied machine learning applications consists of answering a series of questions, summarized in five steps:
Step 1. Data and problem definitionThe first step is to ask interesting questions. What is the problem you are trying solve? Why is it important? Which format of result answers your question? Is this a simple yes/no answer? Do you need to pick one of the available questions?
Step 2. Data collectionOnce you have a problem to tackle, you will need the data. Ask yourself what kind of data will help you answer the question? Can you get the data from available sources? Will you have to combine multiple sources? Do you have to generate the data? Are there any sampling biases?
Step 3. Data preprocessingThe first data preprocessing task is data cleaning, for example, filling missing values, smoothing noisy data, removing outliers, resolving consistencies.
This is usually followed by integration of multiple data sources and data transformation to a specific range (normalization), to value bins (discretized intervals), and to reduce the number of dimensions.
Step 4. Data analysis and modelling with unsupervised and supervised learningData analysis and modeling includes unsupervised and supervised machine learning, statistical inference and prediction. A wide variety of machine learning algorithms are available, including k-nearest neighbors, naïve Bayes, decision trees, support vector machines, logistic regression, k-means etc. The choice of method to be deployed depends on the problem definition discussed in the first step and the type of collected data. The final product of this step is a model inferred from the data.
Step 5. Generalization and evaluationThe last step is devoted to model assessment. The main issue models built with machine learning face is how well they model the underlying data – if a model is too specific, that is, it strongly follows the data used for training, it is quite possible it will not perform well on a new data. On the other hand, the model can be too specific, for instance, when asked what the weather is in California, it always answers sunny, which is indeed correct most of the time. However, such a model is not really useful for making valid predictions. The goal of this step is to correctly evaluate the model and make sure it will work on new data as well. Evaluation methods include separate test and train set, cross validation, and leave-one-out validation.
The true value of data science lies in the ability to extract useful insights and find interesting trends and correlations to support decision making. This overall approach describes at a high level how to extract knowledge from raw data. The most beneficial part of the process described here is how this forces you to think in a more focused way for tackling a data-driven problem, putting you in the right direction.
You have learned a simple process that you can follow to apply the machine learning workflow. In the upcoming posts, we will go into more detail about each of the steps mentioned here and try to understand the type of questions that must be answered and issues to be considered for arriving at the best outcome for your Data Mining project.
This article is excerpted from my upcoming book: Practical Machine Learning in Java, scheduled to be published later this year.