Predictive Modeling
The primary focus of JMP Genomics is scientific discovery and understanding through statistics and graphics. However, the software does offer some basic capabilities for creating predictive models.
Subcategory |
Contains processes for... |
Constructing predictors of either continuous or categorical outcomes using data from genetic markers, microarrays, or proteomics as predictor variables |
|
Selecting the most appropriate model for your data |
|
Manipulating data for use in specific predictive modeling processes |
Predictive modeling is also known as exploratory modeling or data mining. These documentation pages discuss the JMP Genomics functions that target exploratory and basic data mining for genomics data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient, workflow-style interface. After the genomics data has been appropriately preprocessed and stored as a wide SAS data set, one or more of the processes can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.
Data Sets
All of the processes described in these pages require that data be in wide format, with individual samples as rows and experimental design variables (any combination of the following: phenotypes, genetic markers, transcripts, or peptides) as columns. Genetic marker data is likely already in this form, but any data that are in tall form must be converted to the wide format. Use the Transpose Tall to Wide command to convert the tall data set and its accompanying Experimental Design Data Set (EDDS) to wide form.
With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single wide data set using the Genomics > SAS Data Set Utilities > Tables > Merge process, as described in Merge. These data can then be used together to build jointly predictive models. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling.
For large data sets with tens or hundreds of thousands of predictors, computing time for some of the JMP Genomics predictive modeling processes can become prohibitively long. In this situation, perform a preliminary reduction of the predictor set by using the Genomics > Pattern Discovery > K-Means Clustering process to select a thousand or so representative predictors. (The data must be in tall form to execute this process. Use the Transpose Wide to Tall and Transpose Tall to Wide processes to go back and forth between tall and wide forms.)
When performing variable selection (or reduction) with an entire data set, it is important to realize that an optimistic bias can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in cross validation capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to missing values. Although the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross validation is invaluable in assessing initial performance of the models.
Please consult the subcategory documentation pages, as well as the documentation on individual processes, for additional information.
See the The JMP Genomics Starter main page for other process categories.