Workflow - Data Collection Stage
Data can be found and presented in many forms, and used to tell a story. Yet, what is the point of collecting so much data if you can't get useful and timely information out of it?!
The process of data mining is best applied by progressing through a designated workflow consisting of different stages. This article is part of a series that started with the overview article Applied Machine Learning Workflow. This post aims to help you understand the nature of data collection and how it fits into this workflow.
Data CollectionSo, where does the data come from? We have two choices: observe the data from existing sources or generate the data via surveys, simulations, and experiments. Let's take a closer look into both approaches.
Found or Observed Data
What happens in an Internet minute? Intel (2013) presented the following infographic showing the massive amounts of data collected by different services. In 2013, digital devices created four zettabytes of data. In 2017, it is expected that the number of connected devices will reach three times the number of people on earth.
To get the data from the Internet, there are multiple options:
- Bulk downloads from websites such as Wikipedia, IMBD, Million Song Database
- Accessing the data through API (NY Times, Twitter, Facebook, Foursquare)
- Web scraping – It is ok to scrap public, non-sensitive, and anonymized data.
IT operations face a sheer volume of data, which they are literally drowning in. In monitoring complex, rapidly growing, and changing IT infrastructures and the applications, every second of every day, IT generates enormous amounts of data around operational activity – system behavior, application performance, user actions, security activity, and more. Furthermore, IT is inundated with 'noise' from false positives, dealing with hundreds upon hundreds of alarms a day.
The magnitude of these data sets is known as Big Data, "big" due to the large magnitude of three independent characteristics of the data ("the 3 V's") – the volume of the data, the velocity with which the data is generated, and the variety of forms of the data. For instance, to grasp a sense of the enormous magnitude of Big Data, consider that an enterprise as large as HP that is estimated to generate 1 trillion events per day, or roughly 12 million events per second. Moreover, the volume, velocity and variety of data businesses generate are only expected to increase.
With humongous data, only machines can get effective insights. Listening to huge streams of data from many sources at multiple infrastructure layers, learning engine can discover trends and patterns, and identify abnormalities.
An alternative approach is to generate data by yourself, for example, with a survey. In survey design we have to pay attention to data sampling, that is, who are the users completing the survey. We only get data from the users who are accessible and willing to respond. Also, users can provide answers that are in line with self-image and researcher's expectations.
Next, data can be collected with simulations, where a domain expert specifies behavior model of users at a micro level. For instance, crowd simulation requires specifying how different types of users will behave in crowd, for example, following the crowd, looking for an escape, etc. The simulation can be then run under different conditions to see what happens (Tsai et al. 2011). Simulations are appropriate for studying macro-phenomena and emergent behavior; however, they are typically hard to validate empirically. Furthermore, you can design experiments to thoroughly cover all possible outcomes, where you keep all the variables constant and only manipulate one variable at the time.
In IT Operations, the process of generating data is often already established. Usually, there is a monitoring tool plugged-in collecting key performance indicators we want to track. If we want to generate additional data, we need to invoke or write additional scripts measuring, querying and logging them.
Data collection may involve many traps. To demonstrate one, let me share a story. There is supposed to be a global unwritten rule for sending regular mail between students for free. If you write student to student to the place where the stamp should be, the mail is delivered to the recipient for free. Now suppose Jacob sends a set of postcards to Emma, and given that Emma indeed receives some of the postcards, she concludes that all the postcards are delivered and that the rule indeed holds true. Emma reasons that since she received the postcards all the postcards are delivered. However, she does not possess the information about the postcards Jacob has sent, but were undelivered, hence she is unable to account this into her inference. What Emma experienced is survivorship bias, that is, she draws the conclusion based on the survived data only. For your information, the postcards that are being sent with student to student stamp get a circled black T letter stamp on them which means postage is due and that receiver should pay it including a small fine. However, mail services often have higher costs on applying such fee and hence do not do it (Magalhães, 2010).
Another example is a study, which found that the profession with the lowest average age of death was student. Being a student does not cause you to die at an early age. Being a student means you are young. This is what makes the average of those that die so low (Gelman and Nolan, 2002).
Furthermore, a study that found that only 1.5% of drivers in accidents reported they were using a cell phone, whereas 10.9% reported another occupant in the car distracted them. Can we conclude that using a cell phone is safer than speaking with another occupant (Uts, 2003)? To answer this question, we need to know the prevalence of the cell phone use. It is likely that much more people talk to another occupant in the car while driving than talking on the cell phone.
For IT operations, finding the data is not the problem; the real challenge is which data to collect and how to not be overwhelmed by useless noise. Collecting detailed metrics from the systems, both on the application and infrastructure side, IT operations specialists need to look at tracking data from many sources including:
- Deployment Automation
- Service Desk
It is important to understand the capabilities and limitations of each data source so that you can determine its applicability for your organization.
In gathering large volumes of IT operations data, IT Operations Management needs to leverage this data and build an adaptive system that is more proactive, and less reactive. The more the system can learn from the data, the better it can identify variances and problem areas in a timely manner to help IT fix issues before it negatively impacts the business with poor performance or downtime. So the main question for IT operations to consider is: "how do you get more useful insights out of the collected data."
IT Big Data analytics (or IT Operations Analytics) facilitates extracting insights that only manifest themselves through the parallel analysis of large independent data sets, helping to monitor and detect modern IT operations and security issues. With analytics, important variables can be uncovered and adapted to be used for predicting an outcome. The more data collected at a detailed level, the more accurate are these predictions, providing insights and inferences for a variety of use cases.