Sample Case Studies
The following included data sets enable you to work through most of the analytical processes. In addition to the data sets, each case study includes experimental design files and other needed files. These case studies are referred to throughout this manual.
Drosophila Aging Experimental Data
This data set represents a small subset of the Drosophila aging experiment data from (Jin, Riley et al. 2001). The experiment consisted of 24 two-color cDNA microarrays, 6 for each experimental combination of 2 lines (Oregon and Samarkand), 2 sexes (Female and Male), and 2 ages (1 week and 6 weeks). The Cy3 and Cy5 dyes were flipped for two of the 6 replicates for each genotype and sex combination. The design is a split-plot, with Age and Dye as subplot factors, and Line and Sex as whole-plot factors. A total of 4256 clones were spotted on the arrays, but this example uses a subset containing 100 randomly selected genes from the original data set.
Affymetrix Latin Square Data
The spike-in data set used in this example was originally generated by Affymetrix Corporation to develop and validate their U95A GeneChip and Microarray Suite (MAS) 5.0 algorithm over a range of known concentrations (Affymetrix, 2001). The experiment consists of 59 arrays. There are 14 experimental groups, designated with letters, a, b, c, d, e, f, g, h, i, j, k, l, m, and q. (Group m and group q each have 4 within-chip replicates, group m replicates were originally designated n, o, and p and group q replicates were originally designated r, s, and t, The extra letters are not needed because they are replicates of m and q, respectively.)
Each experiment was repeated in triplicate using Affymetrix chips cut from different wafers. The last four digits of the wafer numbers are 1521, 1532 and 2353. Wafer 2353, chip c was defective, so is not included in the data set. For wafers 1521 and 1532, 20 .cel files were generated, and for wafer 2353, 19 .cel files were generated. Each group contains a pool of non-specific RNA as well as a set of 14 distinct human transcripts spiked in at known concentrations of 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM.
Sample Genetic Marker Data
These data are computer-simulated. The data are in wide form. 1000 rows correspond to individuals and 130 columns correspond to various data on these individuals. These data contain family, genotype, and phenotype information. The disease column contains the binary trait of primary interest, where 1 indicates individuals affected with the disease and 0 indicates unaffected individuals. There are also four quantitative traits and sixty markers, with two possible alleles (designated 1 and 2), per marker, for each individual. The marker data occur in pairs, so that the genotype at the first marker comprises columns ma1 and ma2, ma3 and ma4 the second marker genotype, and so on. The analyses performed on this data set are aiming to locate the gene or genes that affect susceptibility to this disease.
Accompanying this data set is a map data set that provides information about the 60 markers, which are spread across two hypothetical candidate gene regions. The variable representing on which candidate gene the marker resides can be used to group analyses, and the Location variable is useful for accurately displaying distances in base pairs between markers along the x-axis of plots containing various association p-values.
Affected Sib-Pair (ASP) Data
Two hundred families, each containing an affected sib-pair and the siblings' parents, were genotyped at 20 markers from a single chromosome in simulated data provided by Gonçalo Abecasis at the University of Michigan Center for Statistical Genetics. MERLIN was used to estimate identical-by-descent (IBD) allele-sharing probabilities at these markers for all pairs of related individuals. The 400 offspring are also measured for a quantitative trait of interest.
Nicardipine
These data came from a study of the effects of nicardipine on patients suffering from recent aneurismal subarachnoid hemorrhages (Haley, et al. 1993a, 1993b). 906 patients were included in this randomized double-blind placebo-controlled study; 449 patients received nicardipine while 457 received the placebo. Patients in each group were balanced with regard to prognostic factors for overall outcome. Nicardipine and the placebo were delivered continuously at 0.15 mg/Kg/hr for up to 14 days and patients were followed for up to 120 days following administration of the drugs. Results are formatted according to the CDISC Study Tabulation Model (SDTM) and Analysis Data Model (ADaM).
Prostate Cancer Biomarkers
This data set was obtained by surface-enhanced laser desorption/ionization (SELDI). This method allows an investigator to detect and resolve multiple proteins bound to protein chip arrays (Merchant and Weinberger 2000). This approach was used by Qu et al. (2002) to discriminate prostate cancer from non-prostate cancer patients. The promise of this approach is that a panel of multiple biomarkers can be used to distinguish important phenotypes such as cancer status. However, great care must be taken to pre-process and analyze the data appropriately to ensure generalizability of results.
The example data set consists of serum samples collected from 165 men, 84 of whom had prostate cancer. The remaining 81 men are considered to be controls. The primary goal is to determine differences in protein expression between these groups.
Additional Data Sets
Some of the examples discussed in this manual use data sets, not included with JMP Genomics software, nor described here. Where applicable, these additional data sets are described in relevant chapters. We attempted to use publicly available data sets wherever possible, and have included instructions on where and how you can obtain these data as they are encountered in this manual.