Test Set Model Comparison

How do you determine whether your predictive model will generalize to future observations? One way to begin addressing this problem is simply to try fitting different models on different sized data sets, one at a time, and compare the results. However, with very wide data sets it is easy to overfit your data, so some type of validation is recommended. Plus, given the huge number of possible modeling variations, it can quickly become overwhelming to compare more than just a few models.

The Test Set Model Comparison process enables you to compare the relative abilities of different predictive models to make consistent, valid predictions. It does this by computing performance metrics for one or more test sets for each of the models selected and then displays the results, side-by-side, in a pair of graphs.

What do I need?

To run the Test Set Model Comparison, your Input Data Set must be in the wide format. The appropriate data import engine as well as each of the predictive modeling processes to be used in the comparison must be configured and the settings saved in one settings folder. Finally, an output folder must be created, into which all of the resulting data sets, analyses, graphics, and other output are placed.

It is assumed that you are familiar with the Predictive Modeling processes, have settled upon one or more of them to compare, and have saved specific settings (see Saving and Loading Settings) for each of the models to be compared.

Important: The Mode parameter (found on the Analysis tab) for each process setting must be set to Automated to allow processing with SAS code rather than using the interactive JMP mode.

A saved setting can be edited either in the dialogs for that process or in the Test Set Model Comparison process itself. If you are not familiar with the individual processes that you want to use, consult the specific chapters for those processes for more information.

At least two SAS data sets are needed to run the Test Set Model Comparison. The first is the training data set. This is the primary data set you are modeling and it is specified as the Input Data Set for each of the models to be compared.

In addition to your primary data set, you must specify one or more test data sets. These are the data sets you are using to evaluate the effectiveness of each of the predictive models for making predictions on your data. Test data sets must be saved in one folder and are specified on the Test Sets tab of this process.

Settings for running the Nicardipine data set described in Nicardipine through each of the predictive processes (Discriminant Analysis, Distance Scoring, General Linear Model Selection, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, and Radial Basis Machine) are included with JMP Clinical. These settings are located in the default Settings folder located within the JMP Clinical directory (typically C:\Program Files\SASHome\JMPGenomics\15\Genomics\Settings). Each of these individual predictive models and settings were described previously in this manual. The default settings for each predictive model were modified, as described below, for use in this example.

The samplegmdata_numgeno.sas7bdat data set described in Data Sets Used in JMP Genomics Processes was computer generated and consists of 611 rows of individuals with 60 columns corresponding to genotype data on these individuals. Marker data is presented as numeric variables in the one-column genotypic format.

•

Generating the Training and Test data sets

To generate the training and test data sets used in this example, the samplegmdata_numgeno.sas7bdat data set was divided into two subsets. The first subset, which contained the records for individuals 1 through 400, was saved as the samplegmdata_numgeno_train.sas7bdat training data set. Data for individuals 401 through 611 were saved in a new samplegmdata_numgeno_test.sas7bdat test data set.

•

Generating New Settings for each of the Predictive Models

Each of the settings for use in this example were generated by modifying the default GeneticMarkerExample setting included for use with each predictive model to use the samplegmdata_numgeno_train.sas7bdat data set as the input data set.

•

The Test Set

The samplegmdata_numgeno_test.sas7bdat data set was specified as the test set.

Important: Both the model comparison and respective main method setting files for any sample settings that you run must be placed in your user WorkflowResults folder¹ before you run them. If you ever clear this folder, you should replenish it with the setting files from the Settings folder².

For detailed information about the files and data sets used or created by JMP Life Sciences software, see Files and Data Sets.

Output/Results

The output generated by this process is summarized in a Tabbed report. Refer to the Test Set Model Comparison output documentation for detailed descriptions and guides to interpreting your results.

In Windows 10, this is typically C:\Users\username\AppData\Roaming\SAS\JMPGenomics\15\JMPG\WorkflowResults.

In Windows 10, this is typically C:\Program Files\SASHome\JMPGenomics\15\Genomics\Settings.