Data Sets Used in JMP Genomics Processes

A variety of different types of SAS Data Sets are used by JMP Genomics processes. In this section, we review the characteristics of the more common types.

Input Data Set

The input data set contains the primary data used in the analysis. These can be very large, frequently consisting of thousands, or even millions, of columns and/or rows. This data set might be in either the tall or the wide form, depending on the process to be run.

Experimental Design Data Set (EDDS)

An EDDS is a SAS data set that provides information about the columns of a tall data set. It describes relevant experimental variables such as treatment conditions and covariates as well as a variable named ColumnName. Entries in the ColumnName column must exactly match the column names in the input tall data set. EDDSs have certain constraints that must be followed for the processes to run successfully.

An EDDS is frequently constructed using information from a corresponding Experimental Design File (EDF). An EDDS contains many of the same columns as an EDF, but, unlike the EDF, must be saved as a SAS data set (.sas7bdat). Many of the input engines that generate a tall data set from raw data files also automatically generate the needed EDDS.

An EDDS is required by most processes using a tall input data set. Processes using wide SAS data sets (most of the Genetics processes, for example) do not require an EDDS.

Annotation Data Set

In addition to an experimental design data set, many JMP Genomics processes also optionally accept an annotation data set. This is a SAS data set containing arbitrary biological or chemical properties corresponding to the molecular entities in the experiment.

Annotation data sets can correspond to either tall or wide data sets. For tall data sets, they must share one or more merge key variables with the tall data set so that the two data sets can be joined at run time. For wide data sets, an assumption on the order of the variables is usually in effect.

Annotation data sets are typically created by opening an appropriate text or Excel table in JMP, removing any undesired columns, and then saving it as a SAS data set (with extension .sas7bdat) using the JMP File > Save As... command.

Other Types of Data Sets

In addition to the common data sets listed above, additional supplementary SAS data sets might be required by specific processes. These can include Coordinate data sets, which list x- and y-coordinates of spots on microarrays, Haplotype Frequency data sets, used in haplotype analysis, as well as others. These supplementary data sets are used only in specific cases, are frequently optional, and are described, as needed, in the chapters detailing the processes that use them.

Tall and Wide Data Sets

Most of the processes in JMP Genomics assume that the input SAS data set has a particular data structure. JMP Genomics distinguishes between tall and wide SAS data sets. A tall SAS data set has samples as columns and molecular entity (for example, marker, gene, clone, protein, or metabolite) as rows, whereas a wide SAS data set is the transpose of the tall, having the samples as rows and molecular entity as columns.

When specifying the input SAS data set for a process, it is important to know the required form. Most of the processes associated with the Genetics processes require a wide structure, whereas most of the others use a tall structure. The Transpose Tall to Wide and Transpose Wide to Tall processes under the Genomics > SAS Data Set Utilities > Tables menu enable you to transform your SAS data sets between tall and wide forms.