The Robust Fit Outliers report in the Explore Outliers platform includes a set of controls and results organized on multiple tabs.
The Robust Fit Outliers controls specify the method used for calculating the robust estimates and the multiplier K. Given a robust estimate of the center and spread, outliers are defined as those values that are K times the robust spread from the robust center.
Figure 21.7 Robust Fit Outliers Controls
Huber
Uses Huber M-Estimation to estimate center and spread. This option is the default. See Huber and Ronchetti (2009).
Cauchy
Assumes a Cauchy distribution to calculate estimates for the center and spread. Cauchy estimates have a high breakdown point and are typically more robust than Huber estimates. However, if your data are separated into clusters, the Cauchy distribution tends to consider only the half of the data that is clustered more closely, ignoring the rest.
Quartile
Uses the median as the measure of center and the interquartile range (IQR) divided by 1.34898 as the measure of spread. Dividing the IQR by the 1.34898 factor results in the spread corresponding to one standard deviation if the data are normally distributed.
K Sigma
The multiplier that determines outliers as K times the spread away from the center. Large values of K provide a more conservative set of outliers than small values. The default is 4.
Rescan
Rescans the data after outlier actions have been taken.
Tip: Press Ctrl and click Rescan to rescan across all open outlier methods.
Close
Closes the Robust Fit Outliers panel.
Tip: Press Ctrl and click Close to close all outlier reports.
The Outliers by Column tab in the Robust Fit Outliers report contains a table with a row for each column selected in the launch window. The columns of the table depend on the technique that is used to estimate the center and spread of the data: Huber, Cauchy, or Quartile. For each technique, there is a column of the estimated center, the estimated spread, and the number of outliers based on the center and spread.
The Outliers by Column tab contains the following options that can be applied when on or more rows are selected in the outliers table:
Show only columns with outliers
Removes columns without outliers from the table in the Outliers by Column tab.
Identify Outliers in Table
Applies actions to the original data table for selected rows in the outlier summary table.
Select Rows
Selects the rows containing outliers.
Exclude Rows
Applies the exclude row state. Click Rescan to update the Robust Fit Outliers report.
Color Cells
Colors the cells containing outliers. Low valued outliers are colored blue and high valued outliers are colored red.
Color Rows
Colors the rows containing outliers.
Clear Outliers in Table
Applies actions to the original data table for selected rows in the outlier summary table.
Add to Missing Value Codes
Adds the selected outliers to the missing value codes column property. Use this option to identify known missing value or error codes within the data. Click Rescan to update the Robust Fit Outliers report.
Note: Add to Missing Value Codes is not available with Robust Fit Outliers if a By variable is specified in the launch window.
Change to Missing
Changes the outlier value to a missing value. Click Rescan to update the Robust Fit Outliers report.
Formula Columns
Creates a new formula column for each column to set outliers to missing. The new columns are prefixed or suffixed by a user specified name to distinguish them from the original columns. By default, the suffix is set to “Culled”.
Formula Script
Creates a script that is added to the data table. When the script is run, it creates a new formula column for each column to set outliers to missing. The new columns are prefixed or suffixed by a user specified name to distinguish them from the original columns. By default, the suffix is set to “Culled”.
The Outliers by Cell tab in the Robust Fit Outliers report contains a table of individual outliers found by the settings specified by the controls. The table shows the column name, row number, outlier distance and the actual value of the individual outliers. The outlier distance is a measure of how extreme an outlier is and is calculated using the following equation:
Outlier Distance =
where
x = the actual value of the outlier
c = the center of column that contains the outlier, measured by the specified outlier method (Huber, Cauchy, or Quartile)
s = the spread of the column that contains the outlier, measured by the specified outlier method (Huber, Cauchy, or Quartile)
A larger outlier distance indicates a more extreme outlier.
The Outliers by Cell tab contains the following options that can be applied when one or more rows are selected in the outliers table:
Identify Outliers in Table
Applies actions to the original data table for selected rows in the outlier summary table.
Select Row and Column
Selects the rows and columns that correspond to the selected outliers.
Color Cells
Colors the cells containing outliers. Low valued outliers are colored blue and high valued outliers are colored red.
Clear Outliers in Table
Applies actions to the original data table for selected rows in the outlier summary table.
Add to Missing Value Codes
Adds the selected outliers to the missing value codes column property. Use this option to identify known missing value or error codes within the data. Missing value and error codes are often integers and are sometimes a series of nines. Click Rescan to update the Robust Fit Outliers report.
Note: Add to Missing Value Codes is not available with Robust Fit Outliers if a By variable is specified in the launch window.
Change to Missing
Changes the outlier value to a missing value in the data table. Use caution when changing values to missing. Change values to missing only if the data are known to be invalid or inaccurate. Click Rescan to update the Robust Fit Outliers report.
Note: If the selected outlier has been added to the missing value codes, the outlier is not changed to a missing value.