Predictive modeling is also known as
exploratory modeling
or
data mining
. These documentation pages discuss the JMP Genomics functions that target exploratory and basic data mining for
genomics
data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient,
workflow
-style interface. After the genomics data has been appropriately preprocessed and stored as a
wide
SAS data set, one or more of the processes can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.
All of the processes described in these pages require that data be in
wide
format, with individual samples as rows and experimental design variables (any combination of the following:
phenotypes
, genetic markers,
transcripts
, or peptides) as columns. Genetic marker data is likely already in this form, but any data that are in
tall
form must be converted to the
wide
format. Use the
Transpose Tall to Wide
command to convert the
tall
data set and its accompanying
Experimental Design Data Set (EDDS)
to
wide
form.
With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single
wide
data set using the
Genomics > SAS Data Set Utilities > Tables > Merge
process, as described in
Merge
. These data can then be used together to build jointly
predictive models
. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling.
When performing variable selection (or reduction) with an entire data set, it is important to realize that an
optimistic bias
can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in
cross validation
capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to
missing values
. Although the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross validation is invaluable in assessing initial performance of the models.
See the
JMP Genomics Starter
main page for other process categories.