To launch Explore Outliers, select Analyze > Screening > Explore Outliers. The launch window appears.
In the launch window, select the analysis columns as Y, Columns. You can also specify a By variable. After you click OK, the Explore Outliers report appears. You are presented with the following four outlier analysis commands:
The Quantile Range Outliers method of outlier detection uses the quantile distribution of the values in a column to locate the extreme values. Quantiles are useful for detecting outliers because there is no distributional assumption associated with them. Data are simply sorted from smallest to largest. For example, the 20th quantile is the value at which 20% of values are smaller. Extreme values are found using a multiplier of the interquantile range, the distance between two specified quantiles. For more details about how quantiles are computed, see Quantiles in the Basic Analysis book.
The Quantile Range Outliers panel enables you to specify how outliers are to be calculated and how you want to manage them. Quantile Range Outliers Window shows the default Quantile Range Outliers window.
An outlier is considered any value more than Q times the interquantile range from the lower and upper quantiles. You can adjust the value of Q and the size of the interquantile range.
The multiplier that helps determine values as outliers. Outliers are considered Q times the interquantile range past the Tail Quantile and values. Large values of Q provide a more conservative set of outliers than small values. The default is 3.
Turns on the exclude row state for the selected rows. Click Rescan to update the Quantile Range Outliers report.
Adds the selected outliers to the missing value codes column property. Use this option to identify known missing value or error codes within the data. Missing value and error codes are often integers and are sometimes either a positive or negative series of nines. Click Rescan to update the Quantile Range Outliers report.
Changes the outlier value to a missing value in the data table. Use caution when changing values to missing. Change values to missing only if the data are known to be invalid or inaccurate. Click Rescan to update the Quantile Range Outliers report.
Adds the selected outlier values to the missing value codes column property. You must click Rescan to update the Quantile Range Outliers report.
Note: The first time you use choose an action (such as Change to Missing or Exclude Rows) to change your data, the alert window warns you to use the Save As command to save your data table as a new file to preserve a copy of your original data. When this window appears, click OK. If you decide to save your new data file, select File > Save As and save the file with a new name.
Robust estimates of parameters are less sensitive to outliers than non-robust estimates. Robust Fit Outliers provides several types of robust estimates of the center and spread of your data to determine those values that can be considered extreme. Robust Fit Outliers Window shows the default Robust Fit Outliers window.
Given a robust estimate of the center and spread, outliers are defined as those values that are K times the robust spread from the robust center. The Robust Fit Outliers window provides several options for calculating the robust estimates and multiplier K as well as provides tools to manage the outliers found.
The multiplier that determines outliers as K times the spread away from the center. Large values of K provide a more conservative set of outliers than small values. The default is 4.
Sets the Exclude Row state for outliers in the selected columns in the data table. Click Rescan to update the Robust Estimates and Outliers report.
Adds the selected outliers to the missing value codes column property for the selected columns. Use this option to identify known missing value or error codes within the data. Click Rescan to update the Robust Estimates and Outliers report.
Changes the outlier value to a missing value in the data table. Click Rescan to update the Robust Estimates and Outliers report.
The Outlier Analysis calculates the Mahalanobis distances from each point to the center of the multivariate normal distribution. This measure relates to contours of the multivariate normal density with respect to the correlation structure. The greater the distance from the center, the higher the probability that it is an outlier. For more information about the Mahalanobis distance and other distance measures, see Multivariate Platform Options in the Multivariate Methods book.
You can save the distances to the data table by selecting the Save option from the Mahalanobis Distances red triangle menu.
Multivariate Robust Outliers Mahalanobis Distance Plot shows the Mahalanobis distances of 16 different columns. The plot contains an upper control limit (UCL) of 4.82.This UCL is meant to be a helpful guide to show where potential outliers might be. However, you should use your own discretion to determine which values are outliers. For more details about this upper control limit (UCL), see Mason and Young (2002).
The red triangle menu for Multivariate with Robust Estimates contains numerous options to analyze your multivariate data. For a list and description of these options, see Multivariate Platform Options in the Multivariate Methods book.
The basic approach of outlier detection is to consider points distant from other points as outliers. One way of determining the distance of a point to other clusters of points is explore the distance to its nearest neighbors. For each value of K, the Multivariate k-Nearest Neighbor Outliers utility displays a plot of the Euclidean distance from each point to it’s Kth nearest neighbor. You specify the largest value of K, denoted as k. Plots are provided for , skipping values by the Fibonacci sequence to avoid displaying too many plots.
This approach is sensitive to the specified value of k. A small value of k can miss identifying points as outliers and a large value of k can falsely classify points as outliers:
•
|
Suppose that the specified k is small, so that you are only studying a few neighbors. If there is a cluster of more than k points that is far from the rest of the points, then the points within the cluster will have small distances to their nearest neighbors. You may be unable to detect the cluster of outliers.
|
•
|
Suppose that the specified k is large, so that you are studying a large number of neighbors. If there are clusters with fewer than k data points, then the points within these clusters may appear to be outliers. You may overlook the fact that the points form a cluster, interpreting the individual cluster members as outliers instead.
|
When you select Multivariate k-Nearest Neighbor Outliers from the list of commands, you are asked to specify the value of k to use as an upper bound for the furthest neighbor to be considered. Notice that the default value is set to 8.
The report shows plots for select values of K up to the value k. The value of K for each plot is displayed in its vertical axis label, which is of the form Distance to Neighbor K = <a>, where a is an integer denoting the ath closest neighbor. Each plot shows the distance from the point in the ith row to its ath nearest neighbor. The points that have large distances from their neighbors, across multiple values of K, are likely to be outliers.