Predictive modeling is also known as exploratory modeling or data mining. This volume discusses the JMP Genomics functions that target exploratory and basic data mining for genomics data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient,
workflow-style interface. After the data has been appropriately preprocessed and stored as a
wide SAS data set, one or more of the processes, described in this chapter, can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.
All of the processes described in this chapter require that data be in wide format, with individual samples as rows and experimental design
variables (any combination of the following: phenotypes, genetic markers,
transcripts, or peptides) as columns. Genetic marker data is likely already in this form, but any data that are in
tall form must be converted to the
wide format. Use the
Transpose Tall to Wide command to convert the
tall data set and its accompanying
Experimental Design Data Set (EDDS) to
wide form.
With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single wide data set using the
Genomics > SAS Data Set Utilities > Tables > Merge or
Genomics > SAS Data Set Utilities > Tables > Merge process, as described in
Merge. These data can then be used together to build jointly predictive models. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling.
When performing variable selection (or reduction) with an entire data set, it is important to realize that an optimistic bias can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in
cross validation capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to
missing values. Although the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross validation is invaluable in assessing initial performance of the models.