Data Processing Options

The Data Processing red triangle menu in the Functional Data Explorer platform contains the following options:

Cleanup

A submenu of the following data cleanup options:

Remove Zeros

Removes observations with zero values. If there are no zeros in the data, an alert appears, indicating that no zero values were found.

Remove Value

Displays a specifications window that enables you to specify a value to remove from the data.

Remove Selected

Removes observations that correspond to rows that are selected in the data table.

Remove Unselected

Removes observations that correspond to rows that are not selected in the data table.

Caution: Remove Selected and Remove Unselected remove the row numbers. When Auto Recalc is enabled, you must add or delete rows before using these options.

Filter X

Removes X values that fall outside of a specified interval. When you select the Filter X option, you must specify Below and Above values. The X values that fall outside of the specified interval are not used for the analysis.

Filter Y

Removes Y values that fall outside of a specified interval. When you select the Filter Y option, you must specify Below and Above values. The Y values that fall outside of the specified interval are not used for the analysis.

Reduce

Reduces the data over the X values using one of the following techniques:

• Use the Grid tab to interpolate observations to a common grid of values. You can specify the grid size. By default, the grid size is the number of values in the longest function. This is also the maximum allowable grid size.

• Use the Bin tab to create a specified number of bins that are evenly spaced over the unique X values. For each function (or level of the ID, Function variable), the observations within a bin are averaged to produce a Y value for the corresponding bin level.

• Use the Thin tab to remove every N observation over the X values, where N is determined by the specified thinning rate. This is done for each function (or level of the ID, Function variable). By default, the thinning rate is 2, which removes half of the observations in each function.

Note: The Remove options exclude the specified observations from the analysis and modeling reports, but the observations remain unchanged in the data table.

Transform

A submenu of the following options to transform the data:

Center

Centers the output.

Standardize

Standardizes the output by centering and scaling the data to have mean 0 and variance 1.

Range 0 to 1

Scales the output to lie within the range of 0 and 1.

Square Root

Transforms the data by computing the square root of the output. The output values must be nonnegative.

Square

Transforms the data by computing the square of the output.

Log

Transforms the data by computing the natural logarithm of the output.

Exp

Transforms the data by computing the exponential function of the output.

Negation

Transforms the data by negating the output.

Logit

Transforms the data by computing the logit function of the output. The output values must be between 0 and 1.

Log X

Transforms the data by computing the natural logarithm of the input.

Align

A submenu of the following options to align the input data:

Row Alignment

Replaces the input values with the row number.

Align Maximum

Aligns the functions using the observed maximum output value for each ID level. The input value associated with the observed maximum output value is set to zero for each ID level and the other input values are shifted up or down based on the difference between the observed maximum and zero.

Align Minimum

Aligns the functions using the observed minimum output value for each ID level. The input value associated with the observed minimum output value is set to zero for each ID level and the other input values are shifted up or down based on the difference between the observed minimum and zero.

Align 0 to 1

Aligns the output functions such that the range of the input values is 0 to 1.

Tip: Align 0 to 1 is particularly useful when you fit a P-Spline model.

Align by Function

Aligns the output functions such that each function starts at the overall minimum of the input values and ends at the overall maximum of the input values.

Dynamic Time Warping

(Available only when there is more than one function.) Aligns the output functions using dynamic time warping (DTW). DTW is a function alignment technique that finds an optimal warping to align two or more functions together. When you select the DTW option, a Select Reference Function window appears. Use this to select the reference function. The reference function is the function that the remaining functions are aligned to.

Once you select a reference function and click OK, a warping function plot is shown along with a list for the remaining query functions. On the warping function plot, the reference function is on the y-axis and the selected query function is on the x-axis. Deviations from the red diagonal line (y = x) indicate that the inputs of the query function have been warped for better alignment.

Spectral

A submenu of the following options that are useful for spectral data:

SNV

Applies the Standard Normal Variate method to the data. This method standardizes the output by centering and scaling each individual function (level of the ID variable) to have a mean of 0 and a standard deviation of 1.

MSC

Applies the Multiplicative Scatter Correction to the data. A simple linear regression is fit for each individual function (level of the ID variable) where the response is output values for the function and the regressor is the output values for the mean function. The original output values, yit, are then replaced by new values, y*it, using the following equation:

Equation shown here

where b0i and b1i are the intercept and slope obtained from the simple linear regression for function i. For more information, see Geladi et al (1985).

Savitzky-Golay

Provides options to use the Savitzky-Golay method. See Savitzky and Golay (1964).

Note: All options involving the Savitzky-Golay method require that the input data be on an evenly spaced grid and that at least one function contains 7 or more data points. If the data is not on an evenly spaced grid, it is automatically placed on an evenly spaced grid when you select a Savitzky-Golay option.

Filter

Applies a Savitzky-Golay filter to the data. This method fits local polynomials to several collections of points across the domain. The polynomials are fit using least squares and the number of points in each fit is determined by the bandwidth. When you select this option, several fits are made for polynomials of order 0, 1, and 2 and bandwidths up to 10. The best fitting models for each function are selected based on the AIC. The order of the polynomial and the bandwidth can be different for each function.

First Derivative

Applies a Savitzky-Golay filter to the data using only polynomials of order 2 or 3 and then takes the first derivative. Since the filter fits polynomials, the derivatives are computed analytically.

Second Derivative

Applies a Savitzky-Golay filter to the data using only polynomials of order 3 and then takes the second derivative. Since the filter fits polynomials, the derivatives are computed analytically.

Baseline Correction

Subtracts a baseline function from each individual function. A baseline correction is used when there is a known trend, or baseline, that you want to remove. For example, this could be due to an artifact of how the data is measured. Usually, the information is in the peaks of the data, so these regions are not included in the baseline model.

When you select this option, a baseline correction window is shown. This window contains a selection plot that displays the data and a set of options to specify the baseline model. The baseline correction window contains the following options:

Baseline Model

Specifies the type of model for the baseline function. You can specify a linear, quadratic, cubic, two parameter exponential, or three parameter exponential model.

Correction Region

Specifies the region that the baseline function is subtracted from. You can subtract the baseline from the entire function region or from only the regions that were used to construct the baseline model.

Baseline Regions

Adds or removes a pair of blue vertical lines to the selection plot. The lines are initially on top of one another. Move the lines to specify regions of the data that you do not want to include in the baseline model. The region of the data that falls in between a pair of blue lines is not included in the baseline model.

Anchor Points

Adds or removes a red vertical line to the selection plot. This line specifies data points that are forced into the baseline model.

Target Functions

(Available only when there is more than one function.) Enables you to load a target function.

Load Targets

Shows a window that enables you to specify a target function. A target function is used for curve matching, where it is desirable for all of the functions to look like the target function, also known as a reference function or a golden curve.

If you specify a target function, the data from the function is not used in model fitting. When you specify a target function, there are additional options added to the FPC Profiler. See “FPC Profiler”.

Note: A target function must be loaded before any other preprocessing steps are performed.

Dynamic Time Warping Options

Plot Warping Functions

Shows or hides the warping function plot. On by default.

Save Distance Matrix

Saves the distance matrix to a separate data table. The distance matrix can be useful for clustering the functions. The distance matrix data table contains a hierarchical clustering script.

Save Warping Functions

Saves the warping functions to a separate data table. Each row of the data table contains the DTW adjusted input variable, the original input variable, and the ID variable.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).