Recode Genotypes

Genotypic information for sets of allele or markers is frequently listed using combinations of letters and numbers. Alternatively, data sets can use a two column allelic format in which each of the two alleles that make up a genotype is listed in a separate column. Such formats are not amenable to many forms of statistical analysis, which typically can use only single-column, numerical data. Before any analyses can be performed. Therefore, data sets containing genetic marker data in a non-numerical or allelic format must be recoded. Conversely, genotypes might have been originally coded numerically as the values 0, 1, and 2 or the genotype might be coded as a single letter -- "A" for genotype A/A, "B" for genotype B/B, and "H" for the heterozygous genotype A/B. Since the processes for analyzing genotype data require a representation that discerns the two alleles that the genotype comprises, converting the single character or numeric genotypes to this recognized form is necessary.

When numeric recoding of genotypes is performed, Recode Genotypes tabulates the SNP genotypes for all of the individuals in a data set, ranking the prevalence of each allele. Depending on the recoding desired, the process assigns numerical identities to the genotypes based on the prevalence of the alleles, whether they are homozygous or heterozygous, and their dominant-recessive relationship. Conversely, Recode Genotypes can also convert single character numeric (0, 1, and 2) or non-numeric (A, B, and H) back into a genotypic string (A/A, B/B, and A/B).

What do I need?

Two data sets are needed for this process.

The first, the Input Data Set, contains all of the marker data. This data set must be a wide data set; markers are listed in columns, whereas individuals are listed in rows. Marker data can be presented in either the one- or the two-column format.

The second data set is the Annotation Data Set. This data set contains information, such as gene identity or chromosomal location, for each of the markers. This data set is a tall data set; each row corresponds to a different marker.

For detailed information about the files and data sets used or created by JMP Life Sciences software, see Files and Data Sets.

Output/Results

This process generates a new output data set containing the recoded genotypes. Refer to the Recode Genotypes output documentation for detailed descriptions of the output of this process.