Test Set Model Comparison

How do you determine whether your predictive model will generalize to future observations? One way to begin addressing this problem is simply to try fitting different models on different sized data sets, one at a time, and compare the results. However, with very wide data sets it is easy to overfit your data, so some type of validation is recommended. Plus, given the huge number of possible modeling variations, it can quickly become overwhelming to compare more than just a few models.

The Test Set Model Comparison process enables you to compare the relative abilities of different predictive models to make consistent, valid predictions. It does this by computing performance metrics for one or more test sets for each of the models selected and then displays the results, side-by-side, in a pair of graphs.

What do I need?

To run the Test Set Model Comparison, your Input Data Set must be in the wide format. The appropriate data import engine as well as each of the predictive modeling processes to be used in the comparison must be configured and the settings saved in one settings folder. Finally, an output folder must be created, into which all of the resulting data sets, analyses, graphics, and other output are placed.

It is assumed that you are familiar with the Predictive Modeling processes, have settled upon one or more of them to compare, and have saved specific settings (see Saving and Loading Settings) for each of the models to be compared.

Important: The Mode parameter (found on the Analysis tab) for each process setting must be set to Automated to allow processing with SAS code rather than using the interactive JMP mode.

A saved setting can be edited either in the dialogs for that process or in the Test Set Model Comparison process itself. If you are not familiar with the individual processes that you want to use, consult the specific chapters for those processes for more information.

At least two SAS data sets are needed to run the Test Set Model Comparison. The first is the training data set. This is the primary data set you are modeling and it is specified as the Input Data Set for each of the models to be compared.

In addition to your primary data set, you must specify one or more test data sets. These are the data sets you are using to evaluate the effectiveness of each of the predictive models for making predictions on your data. Test data sets must be saved in one folder and are specified on the Test Sets tab of this process.

Settings for running the Nicardipine data set described in Nicardipine through each of the predictive processes (Discriminant Analysis, Distance Scoring, General Linear Model Selection, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, and Radial Basis Machine) are included with JMP Clinical. These settings are located in the default Settings folder located within the JMP Clinical directory (typically C:\Program Files\SASHome\JMPGenomics\13\Genomics\Settings). Each of these individual predictive models and settings were described previously in this manual. The default settings for each predictive model were modified, as described below, for use in this example.

•

Generating the Training and Test data sets

To generate the training and test data sets used in this example, the adsl_dii.sas7bdat data set, which contains observations on 906 patients and is included with JMP Clinical, was divided into two equivalent subsets, each containing the data on 453 patients. The first subset, which contained the records for patients 1 through 453, was saved as the adsl_dii_training_set.sas7bdat data set. Data for patients 454 through 906 were saved in a new adsl_dii_test_set.sas7bdat data set. Both data sets were saved in a new TSMC folder placed in the Sample Data\Nicardipine folder.

•

Generating New Settings for each of the Predictive Models

Each of the settings for use in this example were generated by modifying the default Nicardipine_ARM setting included for use with each predictive model to use the adsl_dii_training_set.sas7bdat data set as the input data set. The modified settings were saved in a new folder.

•

The Test Set

The adsl_dii_test_set.sas7bdat data set was specified as the test set.

Important: Both the model comparison and respective main method setting files for any sample settings that you run must be placed in your user WorkflowResults folder¹ before you run them. If you ever clear this folder, you should replenish it with the setting files from the Settings folder².

For detailed information about the files and data sets used or created by JMP Life Sciences software, see Files and Data Sets.

Output/Results

The output generated by this process is summarized in a Tabbed report. Refer to the Test Set Model Comparison output documentation for detailed descriptions and guides to interpreting your results.

In Windows 7, this is typically C:\Users\username\AppData\Roaming\SAS\JMPGenomics\13\JMPLS\WorkflowResults.

In Windows 7, this is typically C:\Program Files\SASHome\JMPGenomics\13\Genomics\Settings.