Use the K Nearest Neighbor Outliers method in the Explore Outliers platform to identify an outlier based on distance to its nearest neighbor. For each value of k, the K Nearest Neighbor Outliers method displays a plot of the Euclidean distance from each point to its kth nearest neighbor. You specify the largest value of k, denoted as K. Plots are provided for k = 1,2,3,5,...,K, using the Fibonacci sequence to avoid displaying too many plots.
Before the nearest neighbors are calculated, the columns are centered and scaled. The scaling factor is as follows:
max [Q(0.75) - Q(0.50), Q(0.50) - Q(0.25)] / [normalQuantile(0.75)]
where
Q(p) is the pth quantile
Note: If Q(0.75) or Q(0.25) are equal to the median, then more extreme quantiles are used until there is a non-zero range.
This approach is sensitive to the specified value of k. A small value of k can miss identifying points as outliers and a large value of k can falsely classify points as outliers:
• Suppose that the specified K is small, so that you are studying only a few neighbors. If there is a cluster of more than K points that is far from the rest of the points, then the points within the cluster have small distances to their nearest neighbors. You might be unable to detect the cluster of outliers.
• Suppose that the specified K is large, so that you are studying a large number of neighbors. If there are clusters with fewer than K data points, then the points within these clusters can appear to be outliers. You might overlook the fact that the points form a cluster, interpreting the individual cluster members as outliers instead.