Introduction to Predictive Modeling

The primary focus of JMP Genomics is scientific discovery and understanding through statistics and graphics. However, the software does offer some basic capabilities for creating predictive models. You can construct predictors of either continuous or categorical outcomes using data from genetic markers, microarrays, or proteomics as predictor variables. These processes, which include Discriminant Analysis, Distance Scoring, General Linear Model Selection, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, Quantile Regression Selection, Radial Basis Machine, Ridge Regression, Life Regression, and Proportional Hazards Regression, and Genomic BLUP are grouped under the Predictive Modeling submenu. These models can either be run one at a time or in different combinations (using the Predictive Modeling Review process). Additional processes (Cross Validation Model Comparison, Learning Curve Model Comparison, Test Set Model Comparison) help you select the most appropriate model for your data.

Predictive modeling is also known as exploratory modeling or data mining. This volume discusses the JMP Genomics functions that target exploratory and basic data mining for genomics data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient, workflow-style interface. After the data has been appropriately preprocessed and stored as a wide SAS data set, one or more of the processes, described in this chapter, can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.

Data Sets

All of the processes described in this chapter require that data be in wide format, with individual samples as rows and experimental design variables (any combination of the following: phenotypes, genetic markers, transcripts, or peptides) as columns. Genetic marker data is likely already in this form, but any data that are in tall form must be converted to the wide format. Use the Transpose Tall to Wide command to convert the tall data set and its accompanying Experimental Design Data Set (EDDS) to wide form.

With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single wide data set using the Genomics > SAS Data Set Utilities > Tables > Merge or Genomics > SAS Data Set Utilities > Tables > Merge process, as described in Merge. These data can then be used together to build jointly predictive models. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling.

For large data sets with tens or hundreds of thousands of predictors, computing time for some of the JMP Clinical or JMP Genomics predictive modeling processes can become prohibitively long. In this situation, perform a preliminary reduction of the predictor set by using the Clinical > Pattern Discovery > K-Means Clustering or Genomics > Pattern Discovery > K-Means Clustering process to select a thousand or so representative predictors. (The data must be in tall form to execute this process. Use the Transpose Tall to Wide and Transpose Wide to Tall processes to go back and forth between tall and wide forms.)

When performing variable selection (or reduction) with an entire data set, it is important to realize that an optimistic bias can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in cross validation capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to missing values. Although the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross validation is invaluable in assessing initial performance of the models.

Predictive Modeling Processes

Please refer to the following chapters for specific descriptions of each of the predictive modeling processes.