Algorithm of the Week: k-Nearest Neighbors (k-NN)
K-Nearest Neighbor (or k-NN for short) is a simple, supervised learning algorithm that stores all available examples and classifies new examples based on a similarity measure (i.e., distance functions). k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. k-NN is the fundamental and simplest classification technique when there is little or no prior knowledge about the distribution of the data. It can be used for both classification and regression problems.
This classification method is highly dependent on defining an appropriate similarity or distance measure.
The most common example of a distance measure is the Euclidean distance:
where n is the number of dimensions (attributes) and xk and yk are the kth attributes (components) of data objects x and y, respectively.
A generalization of Euclidean Distance is p-norm:
where p is the degree of the distance, for example, p=1 is Mantattan distance, p=2 is Euclidian distance, and so on.
The Mahalanobis distance, for example, measures the distance between a point and a distribution:
where σ is the covariance matrix of the data.
There are many more measures focusing on different properties, for example, Edit distance measures how much does it take to transform one string to another, Hamming distance compares two vectors and counts the number of places they differ, cosine distance measures the angle between two examples, and so on. An impressive overview of similarity measures is summarized in the book Image Registration: Principles, Tools and Methods (Advances in Computer Vision and Pattern Recognition) by A. A. Goshtasby (2012), 2nd chapter.
k-NN as a data mining technique has a wide variety of applications in classification as well as regression.
- Brand management
- Plagiarism detection
- Advertising campaigns
- Handwriting detection
- Video detection
- Simulating daily precipitations and other weather variables
- Evaluation of forest inventories and for estimating forest variables
- Climate forecasting and estimating soil water parameters
- Predict whether a patient will have a second heart attack
- Estimate the amount of glucose in the blood of a diabetic person
- Identify the risk factors for prostate cancer
- Diagnosis of breast cancer
- Stock market forecasting - includes uncovering market trends, planning investment strategies, identifying the best time to purchase the stocks, and what stocks to purchase
- Forecasting stock market: Predict the price of a stock, on the basis of company performance measures and economic data.
- Currency exchange rate
- Bank bankruptcies
- Understanding and managing financial risk
- Trading futures
- Credit rating
- Loan management
- Bank customer profiling
- Money laundering analyses
- Prediction of solvent accessibility in protein molecules
- Detection of intrusions in computer systems
- Management of databases of moving objects such as computer with wireless connections
Applying k-NN to IT Operations Analytics (ITOA)
The k-NN method works well for dynamic environments that require frequent updates, making it attractive for anomaly detection and outlier detection, identifying suspicious activity and false alerts. Anomaly Detection
By applying k-NN for complex pattern analysis, the IT Operations Analytics solution can analyze connected data and identify abnormal conditions or gradual trends by using density-based distance measures. Imbued with cross-correlation and statistical anomaly detection, an IT operations analytics engine can then send real-time alerts when deviations occur. These intelligent anomaly detection systems can pinpoint strange behavior in environments before a full-fledged incident occurs, that impacts security or performance.
k-NN can be applied towards outlier detection, helping to monitor and detect for changes in self-monitored data streams. Outlier detection is an essential step preceding most any data analysis routine, and is used either to suppress or amplify outliers. The initial usage (also known as learning period) improves data robustness for analysis. Then the algorithm searches for rare patterns in such areas as fraud analysis, intrusion detection, and web purchase analysis (among others). k-NN allows for flexibility in determining the confidence of an outlier by using more points to determine an outlier.
By automatically learning changing patterns in individual servers and detecting anomalies when servers behave differently from the norm, IT professionals can then use these early insights to prevent potential problems and avoid false positives and annoying alerts.