Glossary

Appendixes | Appendixes | Glossary

Glossary

Definitions/Explanations of Terms Used in JMP Life Sciences Documentation

Term

Definition/Explanation

Accession Number

A numerical identifier for a Gene , marker, or gene product in a public database.

ADaM

CDISC Analysis Data Model.

ADSL

CDISC ADaM Subject-Level Analysis Data set.

Adverse Event (AE)

Any adverse change in health or side effect that occurs in a person who participates in a Clinical Research trial while the patient is receiving treatment or within a previously specified length of time after treatment completion.

Alanine Transaminase (ALT)

A transaminase enzyme, also known as serum glutamic pyruvic transaminase (SGPT) or alanine aminotransferase (ALAT). Catalyzing two parts of the alanine cycle, it is found in body tissues (most commonly associated with liver tissues) and serum. It is a clinical diagnostic indicator of hepatocellular injury.

Allele

Any variant form of a Nucleic Acid sequence or marker at a particular Locus .

Alpha

Significance level. Although alpha can be any value between 0 and 1, it is typically set at either 0.01, 0.05 or 0.10.

Alternative Hypothesis

A position that a researcher evaluates in an experiment. The alternative hypothesis, H 1 (or H a ), is the hypothesis that sample observations are influenced by a specific non-random cause. It is rival to the Null Hypothesis , H 0 .

Amino Acid

The building block of Protein s. An enormous variety of proteins can be constructed by linking together Amino Acid s (there are more than twenty types) in a linear chain, which in turn become folded in more complex configurations.

Analytical Procedure (AP)

A computational task that uses input data and parameters in a statistical method or other algorithm to produce output in the form of files, data sets, tables, graphics, logs, and so on. Also known as an Analytical Process . Abbreviated as AP .

ANCOVA

Analysis of covariance; a general linear model with a continuous outcome Variable and multiple predictor variables, with at least one nominal and one continuous predictor variable. Considered a hybrid of regression for continuous variables and ANOVA , ANCOVA can determine whether specific factors have an impact on the outcome variable after removing variance resulting from Covariate s (the qualitative predictors).

Annotation Data Set

Data set containing a variety of information about the identity, biological function, pathway association, and so on, for the Gene s, gene fragments, gene products, Probe Set (Probeset) s, genetic markers, and so on, under investigation.

ANOVA

Statistical models and procedures that partition observed Variance in a Variable into components attributable to different variation sources. By analyzing comparisons of variance estimates, ANOVA can determine whether the Mean s of several groups are statistically equal.

AP

See Analytical Procedure (AP) .

Arm

In a Clinical Research trial, the group of patients receiving a certain type of therapy. For example, one arm of a clinical trial might consist of patients receiving a new medication, another arm might consist of a standard-of-care medication, and another a placebo pill.

Aspartate Transaminase (AST)

A pyridoxal phosphate-dependent transaminase enzyme, also known as aspartate aminotransferase (AspAT, ASAT, AAT) or serum glutamic oxaloacetic transaminase (SGOT). Catalyzing a reversible transfer between aspartate and glutamate, it is found in the brain, heart, muscles, kidneys, red blood cells, and liver. It is a clinical diagnostic indicator of liver health.

Association

The statistically significant co-occurrence of two or more phenomena.

In the context of a Genome-Wide Association Study (GWAS) , mapping of a Gene for a particular Trait or disease is performed by detecting significant associations between the trait and marker Genotype .

AUC

Area under the Receiver Operator Characteristics (ROC) Curves Statistic that is used to assess the ability of a model to predict future responses. For most models, the greater the AUC, the better the model is for making accurate predictions.

Autosome

Any Chromosome that is not a Sex Chromosome .

Bar Chart

A graphical representation of discrete or non-continuous data. It consists of rectangular bars whose lengths are proportional to the magnitude of the values that they represent. See Bar Chart .

Base

Also known as a nitrogenous base or nucleobase , it is one component of a Nucleotide . Every nucleotide has exactly one base.

BCPNN

Bayesian Confidence Propagation Neural Network ¹

Beta

Denotes Type II Error rate, and is related to the Power of a test power = 1- beta ).

Bin

A group of related of functionally similar Observation s that are considered as a unit for statistical analysis.

Binary Trait

A characteristic with a dichotomous expression. Unlike a Quantitative Trait (which can be found in degrees, or among a continuous scale), a binary trait is either found, or not found, in an organism. Synonymous with Dichotomous Trait .

Binary Trait Locus (BTL)

A region of DNA that is associated with either the presence or absence of a particular Phenotype .

Binary Variable

A Variable that contains two discrete values (0 and 1, for example).

Binomial Regression

A regression method where the Dependent Variable contains binomial values (for example, 0 and 1, often corresponding to ‘no’ and ‘yes’, or ‘failure’ and ‘success’, respectively).

Bioinformatics

A scientific field of study involving the integration and application of computer science, information technology, mathematics, and statistics to the fields of biology, genetics, Genomics , and medicine.

Bivariate

Involving two Variable s.

Body System

A group of Organ s that work together to perform a task. Examples in humans include the digestive system, the nervous system, and the endocrine system.

Bootstrap

The practice of using with-replacement empirical distributions of Observation s to estimate the statistical properties of the population from which the observations were made.

Box Plot

Used to display the response distribution at different combinations of factor levels. Box plots can reveal differences in the response Mean at different levels, suggesting Main Effect s. Box plots can also reveal whether the response variation is homogenous across factor levels, an assumption made in ANOVA . See Box Plot .

Bubble Plot

A two-dimensional Scatterplot showing the relationship between two Variable s over time. Each circle, or bubble , represents a single instance of an ID variable. See Bubble Plot .

BY Group

All of the Observation s with the same values for all BY Variable s.

BY Variable

An optional (in most AP s) Variable specification whose values define groups of Observation s, such as hour, month, or year. Specifying a BY variable enables you to animate an image so that you can see how response values change according to some grouping, like over time. Alternatively, BY variables can enable analyses to be performed separately on different groups as defined by a variable such as gender .

CDISC

Clinical Data Interchange Standards Consortium, a nonprofit organization that has “established standards to support the acquisition, exchange, submission and archive of Clinical Research data and Metadata ” whose mission is “to develop and support global, platform-independent data standards that enable information system interoperability to improve medical research and related areas of health-care”. See the CDISC website for more information.

Cell

The fundamental building block of a living Organism . Cells consist of Organelle s. A cell is the smallest unit that can be said to be living.

Cell Plot

A Heat Map or color map display, mapping colors to values of Variable s and displaying them in a rectangular grid. See Cell Plot .

Censor Variable(s)

These columns specify those Observation s for which data have been censored or truncated. For example, investigations of the effects of certain Gene s on life span might be terminated before all of the individuals have expired. The ultimate life spans for these individuals are unknown. All that can be said is that they exceed the period of the study. These data are considered censored.

CentiMorgan (cM)

Also referred to as a map unit. A measure of Genetic Distance between loci; one centimorgan is equivalent to the physical separation between loci needed for a recombination frequency of 1%. Recombination between loci is affected by a variety of molecular and biochemical factors, in addition to Physical Distance , and the centimorgan should not be considered as representing a linear measurement.

Character Variable

A Variable whose values can consist of alphabetic and special characters as well as numeric characters

Chart

A graphical representation of data. Charts can take many forms. See Chart .

Check Box

An item in a dialog or window that you can select without affecting any other items. You can deactivate a check box by selecting it again.

Chi-square Test

A statistical test used to test the existence of a relationship between two nominal Variable s where the sampling distribution of the Test Statistic is a chi -squared distribution when the Null Hypothesis is true (or where it is asymptotically true).

Cholesky Decomposition

A mathematical method for taking the square root of a symmetric matrix. The Cholesky root of a matrix A is L, where A = LL`, and L is a lower-triangular matrix. It is useful in Quantitative Trait Locus (QTL) analysis because it allows modeling of a Pedigree -induced Covariance structure via a Mixed Model . It more efficiently solves a kinship matrix ( see K Matrix ) compared to classical LU decomposition, thus improves execution time. However, it should not be applied to ill-conditioned matrices containing sparse or low-quality data.

Chromosome

Bundled strands of DNA and Protein located in the Cell Nucleus . Chromosomes are inherited from parents. Chromosome count per Organism cell varies by Species . Human cells contain 23 pairs of nuclear chromosomes.

Class Variable

The Variable whose values define the groups for analysis. Class variables can have continuous values, but they typically have a few discrete values that define the classifications of the variable. Values can either be character or numeric.

Clinical Research

A medical science branch focused on determining both the safety and effectiveness of diagnostic products, medications, medical devices, and treatment regimens for human health.

Clustering

The process of dividing a data set into mutually exclusive groups such that the Observation s for each group are as close as possible to one another, and different groups are as far as possible from one another.

Cochran-Mantel-Haenszel Test

A statistical test used for repeated tests of nominal variable independence.

Color Variable

A Variable whose values are used to specify how the graphical output of an analysis is to be colored.

Composite Interval Mapping (CIM)

Method for mapping of a target Quantitative Trait Locus (QTL) for a Trait . It uses markers, located elsewhere in the Genome , that have previously been shown to be associated with additional QTLs for that trait. Multiple analysis points across each inter- Locus interval for the target QTL are assessed.

Conditional Probability

The probability of an event (for example, X ) given that another specific event (for example, Y ) occurs. Conditional probability is often expressed as P ( X | Y ) or P Y ( X ).

Contingency Plot

See Mosaic Plot .

Contingency Table

A table used to record and analyze the relationship between two or more categorical variables. See Contingency Table .

Continuous Trait

A trait based on a characteristic measured on an ordered scale, lacking discrete divisions or gaps. For example, height and skin color are continuous traits. Continuous traits are in contrast to Dichotomous Trait s , where categorizations are discrete.

Copy Number Variation (CNV)

Any Deletion , Insertion , duplication, or other variant in the DNA sequence of a Genome that results in that sequence being present in greater or lesser numbers relative to those seen in a normal, reference genome. A CNV can result from relatively simple duplications, either tandem or inverted, or deletion of small or large blocks of sequence. They might also be more complex, involving gains or losses of homologous sequences at multiple sites in the genome. Disruption of contiguous Gene sequences and altered gene dosages by CNVs have been shown to influence gene Expression , increase phenotypic variation and cause disease.

Correlation

A relationship between Variable s in terms of dependence.

Correlation Coefficient

Also known as the Pearson product-moment correlation coefficient , it is equal to the Covariance of two Variable s divided by the product of their Standard Deviation s.

Covariance

A measure of the relationship between two Variable s. It equals the Correlation Coefficient between the two variables times the square roots of their Variance s.

Covariate

An Independent Variable , not manipulated by the experimenter, that can influence the outcome of the experiment.

Cox Proportional Hazards Model

A classical semiparametric (sometimes considered nonparametric) method that relates the time of an event (for example, failure or death ) to explanatory variables ( Covariate s). This model assumes that hazard rate , rather than survival time , is a function of the explanatory variables. There are no assumptions made on the shape or nature of the hazard function.

Cross Validation

A statistical method for evaluating how well a model predicts the outcome of additional experiments.

CSV

Comma-separated value format. This text format stores tabular data, with line breaks and commas used to delimit table rows and columns, respectively.

Deletion

A mutation that results in a missing DNA sequence or chromosomal region.

Delimiter

One or more characters that separate the designations for the different Allele s in a Genotype . JMP frequently uses a forward-slash ( / ) as a delimiter.

Dendrogram

A tree-like diagram used to summarize a Clustering process. A dendrogram shows where each cluster divides in a hierarchical fashion. See Dendrogram .

Dependent Variable

A Variable whose value is determined by the value of another variable or by the values of a set of variables. This variable lists the responses you measure. In a two-dimensional plot, the dependent variable is usually plotted on the y (horizontal) axis.

Deviance Residual

A Residual that measures the disagreement between the maxima of the fitted and observed log likelihood functions.

Dialog

An interactive window that enables you to set parameters for and run an analytical process.

Dichotomous Trait

A trait that completely and discretely separates a population of organisms belonging to the same species. For example, blood type is a dichotomous trait. Dichotomous traits are in contrast to Continuous Trait s (such as weight ), where categorizations are not discrete.

Distance Matrix

A matrix of distances.

Distribution

Graphics showing the number or proportion of events falling within a particular interval. JMP Life Sciences software presents these distributions as histograms or Parallel Plot s. See Distribution .

DNA

Deoxyribonucleic acid, a Nucleic Acid containing genetic instructions for the development and functioning of living Organism s (except for RNA viruses). DNA segments encoding genetic information are known as Gene s. Non-coding DNA can have structural or regulatory purposes.

Dot Product

An algebraic operation that takes two equal-length number sequences (usually coordinate vectors ) and returns a single number obtained by multiplying corresponding entries and summing those products.

Double False Discovery Rate (FDR) Adjustment

The Double FDR method of Mehrotra and Heyse (2004) ² is used to compare the incidence of adverse events among treatments, leveraging the grouping of related adverse events (typically defined by the MedDRA system organ class). The method considers whether related terms within a group show differences between the treatments and upweights or downweights the significance of an individual term within the group accordingly. In the 2004 paper, the FDR adjustment is performed twice , and simulations are used to control the false discovery rate. Mehrotra and Adewale (2011) refine the Double FDR method to avoid the need for simulations by applying FDR adjustment thrice .

Drill Down

To start at one level of a dimension hierarchy and to click through one or more lower levels until you reach the data that you are interested in.

Ecosystem

The sum of all Organism s living in a given area along with the relevant Environment al components of that area.

EDDS

See Experimental Design Data Set (EDDS) .

EDF

See Experimental Design File (EDF) .

Eigenvalue

A scalar value that determines by how much a corresponding eigenvector is scaled by the square matrix for which it is defined. In Principal Components analysis, the eigenvalues of the Covariance or Correlation matrix represent the Variance of the components.

Eigenvector

For a given square matrix, a nonzero vector that changes length, but not direction, when multiplied by the matrix. The computation of Principal Components for a set of Variable s uses the eigenvectors of the variables' Covariance or Correlation matrix.

Electrocardiogram (EG)

A record or display of a person’s heartbeat produced by electrocardiography, the transthoracic interpretation of electrical activity of the heart over a period of time.

Environment

The sum of all living and non-living things (including physical conditions and factors) in a given physical space. This term is often used in contrast with Genotype to explain effects not attributable to genetic makeup alone.

ESTIMATE Statement

A programming statement for certain SAS procedures (for example, PROC MIXED) used to specify parameters used in both ANOVA and Mixed Model analyses to test an arbitrary set of linear hypotheses regarding the relative importance of different combinations of Fixed Effects parameters.

Euclidean Distance

The distance between two points that would be measured with a ruler. It is derived from the Pythagorean equation: a 2 + b 2 = c 2 .

Exon

Any portion of the transcribed region of a Gene that make up the mature m RNA . Exons are separated in the genomic DNA by intervening sequences ( Intron s) that are transcribed with the exons. Introns are excised and the exons are spliced together during post-transcriptional processing and mRNA maturation.

Experimental Design Data Set (EDDS)

An EDDS is a SAS data set that provides information about the columns of a tall data set. It describes relevant experimental Variable s such as treatment conditions and Covariate s as well as a variable named ColumnName. Entries in the ColumnName column must exactly match the column names in the input tall data set. EDDSs have certain constraints that must be followed for the processes to run successfully.

An EDDS is required by most processes using a tall input data set. Many of the input engines that generate a tall data set from raw data files also automatically generate the needed EDDS.

Experimental Design File (EDF)

A file that provides JMP Genomics with important information about how an experiment was carried out. It defines experimental Variable s such as treatment conditions and Covariate s and provides the basis for organizing and analyzing your data.

An EDF is required by many of the input engines for the construction of a SAS data set from the raw data files. An EDF also serves as a precursor to the Experimental Design Data Set (EDDS) .

Expression

The process by which the information contained in a Gene is used to make a functional RNA or Protein .

Extensible Markup Language (XML)

A markup language that structures information by tagging it for content, meaning, or use. Structured information contains both content (for example, words or numbers) and an indication of what role the content plays.

Factor

Also referred to as an Independent Variable or Predictor variable, a factor is a Variable included in a model to account for variation in a response. Factors are the variables whose values (levels) you set to study their relationship to a response. You often experiment with many potentially influential factors at the same time.

False Discovery Rate (FDR)

The expected percentage of a set of predictions that are assumed to be false. For example, if an analysis, which predicts the association of 10 genes with a particular Trait has a false discovery rate of 0.1, you can expect 9 of the predictions to be correct.

Familywise Error Rate (FWER)

The probability of making one or more false discoveries ( Type I Error s ) among all hypotheses while performing multiple pairwise tests.

FASTA Format

A text-based format for representing DNA , RNA , or Protein sequences. Single-letter codes corresponding to Nucleotide s (composing DNA and RNA) or Amino Acid s (composing proteins) are used for sequence representation. No standard file extension exists for FASTA files, but often the extensions “ .fasta ”, “ .fas ”, “ .fa ”, or “ .txt ” are used.

FASTQ Format

A text-based format for representing DNA or RNA sequences with associated quality scores. Nucleotide s (composing DNA and RNA) and quality scores are represented with a single character. No standard file extension exists for FASTQ files, but often the extensions “ .fastq ”, “ .fq ”, or “ .txt ” are used.

Field

A window area in which you can view, enter, or modify a value.

Fisher’s Exact Test

A statistical significance test used in the analysis of contingency tables where sample sizes are small . It is useful when you want to conduct a Chi-square Test , but one of your cells has an expected frequency of five or less. Its name is derived from its inventor, R.A. Fisher, and reflects that the significance of the deviation from a Null Hypothesis can be calculated exactly (as opposed to relying on an approximation whose exactness is realized only as sample size approaches infinity).

Fixed Effects

The effects that drive the variation that you are interested in assessing that have a fixed number of well-defined levels. They can also include nuisance variables that you need to consider in your model. Fixed effects include factors such as experimental treatment, disease status, age or developmental status of the test organisms, and gender. Variation due to fixed effects is the variation that you are interested in estimating and must be kept in the analysis.

Forest Plot

A graphical display designed to illustrate the relative strength of treatment effects (or relative degree of gene enrichment), in multiple quantitative scientific studies (or databases) addressing the same question. Forest plots generally display results for each study (or other data source) as horizontal lines representing the 95% confidence interval of the effect observed in that trial. See Forest Plot .

Gaussian Graphical Models

Multivariate probability distributions encoding a dependency network among variables.

Gene

Molecular unit of heredity in a living Organism , comprising a contiguous sequence of DNA or RNA , coding for a Protein or RNA chain, which in turn has a function in the organism.

Genetic Distance

Genetic divergence between Species (with or without using a genetic map) or between populations within a species. Smaller genetic distances indicate closer genetic relationships. A variety of genetic distance measures exist, including the CentiMorgan (cM) , Trait frequency differences, the fixation index, Nei’s standard genetic distance, and so on.

Genetic Pathway

A set of interactions occurring between a group of Gene s. Many of the genes in the interaction network rely on the functions of other individual genes within the network in order to yield a net worthwhile product or function to the Cell .

Genome

The entire set of hereditary information in an Organism , encoded by DNA or RNA , comprising Gene s and non-coding sequences.

Genome-Wide Association Study (GWAS)

An examination of the entire Genome of multiple individuals for an association between a genetic variant and a Trait .

Genomics

The scientific field of study concerned with Genome s.

Genotype

The genetic complement an individual has at a particular Locus .

Grid Computing

The uniting of computer resources from multiple computers or domains in a distributed system to perform faster and more powerful computations. Some problems involving enormous data sets and extremely complicated algorithms are feasible to attempt only with grid computing.

Group Variable

A Variable that is used for grouping results.

Haplotype

The combination of Allele s that is transmitted from one generation to the next as a single unit on a Chromosome .

Hardy-Weinberg Equilibrium

The state of a population in which Gene frequencies and genotypic ratios remain constant from generation to generation.

Heat Map

See Cell Plot .

Hemizygous

The state of a Cell containing only one copy of an Allele for a particular Gene .

Heterozygous

The state of a Cell containing two different Allele s for a particular Gene on homologous Chromosome s.

Hepatotoxicity

Chemical-driven damage (toxicity) to the liver.

Hierarchical Clustering

A method of cluster analysis that constructs a hierarchy of clustering. Strategies include the agglomerative approach, where each cluster initially contains only one observation, and the divisive approach, where all observations are initially contained in one cluster. Hierarchical clustering results are commonly presented in Dendrogram form.

High Level Group Term (HLGT)

The second-highest level of the Medical Dictionary for Regulatory Activities (MedDRA) Hierarchy, below System Organ Class (SOC) and above High Level Term (HLT) . An example of an HLGT is “Respiratory tract infections”.

High Level Term (HLT)

The third-highest level of the Medical Dictionary for Regulatory Activities (MedDRA) Hierarchy, below High Level Group Term (HLGT) and above Preferred Term (PT) . An example of an HLT is “Viral upper respiratory tract infections”.

Holdout Data

A portion of the data that is set aside during model development. Holdout data can be used as test data to benchmark the fit and accuracy of the emerging predictive model .

Hoeffding Correlation (D)

A nonparametric measure of association that detects general departures from independence. This Statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables.

Homozygous

The state of a Cell containing two identical Allele s for a particular Gene on homologous Chromosome s.

Hotelling T-squared Test

A test of the Null Hypothesis that “the Population Mean vector is equal to the given mean vector”. It is the multidimensional equivalent of the one-sample t-test .

htSNP

Haplotype tag Single Nucleotide Polymorphism (SNP) . htSNPs are a subset of genetic markers that can be used to explain much of the haplotype diversity. See Tag SNP .

HyperText Markup Language (HTML)

The most popular markup language used in Web pages.

Hypothesis Test

A decision-making rule based on data from an experiment or observational study. A hypothesis test is used to conclude significance of a result based on the sufficiently low likelihood (set by the predefined significance level ) that it occurred because of random chance alone.

Hy’s Law

An ominous prognostic indicator (in Clinical Research ) that a pure drug-induced liver injury (DILI) leading to jaundice, without a hepatic transplant, has a case fatality rate of 10-50%.

Identical by Descent (IBD)

Two or more Allele s are identical by descent (IBD) if they are identical copies of the same ancestral allele .

Identical by State (IBS)

Two ore more Allele s are identical by state (IBS) either because they are derived from a common ancestor ( Identical by Descent (IBD) ) or because of chance.

Identical by Type (IBT)

Two or more Allele s are identical by type (IBT) if they have the same phenotypic effect, or if applied to a variation in the composition of DNA such as a Single Nucleotide Polymorphism (SNP) when they have the same DNA sequence.

Imputation

The computation of replacement values for missing input values.

Inbreeding Coefficient

The probability that an individual's two Allele s at any locus are Identical by Descent (IBD) .

Independent Variable

This Variable does not depend on the value of another variable; it represents the condition or parameter that is manipulated by the investigator. In a two-dimensional plot, the independent variable is usually plotted on the x (horizontal) axis.

Index Variable

One or more columns specifying how the Observation s are to be classified.

Insertion

A mutation that results in an extra DNA sequence or chromosomal region.

Interval Mapping (IM)

Method for mapping of a target Quantitative Trait Locus (QTL) for a Trait between two flanking markers.

Intron

A Nucleotide sequence within a Gene that is removed through RNA splicing to yield the mature RNA product of a gene. Intron can refer to both DNA gene sequence and the RNA Transcript sequence corresponding to that sequence.

Jitter

A random shifting of points by a slight amount along an axis so that more of those points can be effectively visualized in a graphical display.

JMP Scripting Language (JSL)

A scripting language used in JMP applications.

Journal

In JMP, a journal is a file ( .jrn ) and associated window that contains results of user-specified Process es .

JSL

See JMP Scripting Language (JSL) .

K Matrix

The relative kinship matrix containing pairwise kinship (or coancestry) coefficients.

K_Rho

The information for the Linkage Disequilibrium (LD) measure Rho .

Kaplan-Meier Survival Curve

A curve based on the survival function estimator from life-time or clinical outcome data. For example, it can be used to measure the proportion of patients living for a given amount of time after treatment, or to measure the time until a tumor disappears.

Kendall Correlation

A metric used to measure the degree of correspondence between two sets of rankings where the metrics used to assess each set of rankings are not equivalent.

Kinship (Coancestry) Coefficient

The probability that two Allele s from a single Locus in a randomly selected pair of individuals are Identical by Descent (IBD) .

K-Means Clustering

A statistical method that creates optimally separated groups of Observation s in data using one of several methods. A set of points called cluster seeds is selected as a first guess of the means of the clusters. One cluster seed is selected for each of k clusters. Each observation is assigned to the nearest seed to form temporary clusters. The seeds are then replaced by the Mean s of the temporary clusters, and the process is repeated until no further changes occur in the clusters.

Label Variable

A column containing descriptive labels that can be printed in the output by certain procedures instead of, or in addition to, the Variable name (which is also known as the SAS Variable Name ). Synonymous with SAS Variable Label .

Leaf

In a Tree Map , a leaf is any segment that is not further segmented. The final leaves in a tree are called terminal nodes.

Learning Curve

A plot of model performance (for example, accuracy, AUC, or RMSE) according to training set sample size.

Level

A successive hierarchical partition of data in a Tree Map . The first level represents the entire unpartitioned data set. The second level represents the first partition of the data into segments, and so on.

Linkage

The property by which Allele s located on the same Chromosome tend to be inherited together. The closer two loci are on the chromosome, the more tightly they tend to be linked.

Linkage Disequilibrium (LD)

A measure of the association between two Allele s at separate loci.

Note : A high LD does not imply that loci are physically linked.

An association (either positive or negative) between alleles can occur even if the loci are not located on the same Chromosome , provided other factors affecting the Population (directional selection, for example) are in effect.

Locus

The location of a specific genetic sequence on a Chromosome .

Lod Score

A logarithmic statistical estimate of the probability that two loci are located proximal to each other on a Chromosome . A Lod score of 3, which indicates that two loci are 1000 times more likely than not to lie close to each other, is generally considered minimal for significance.

Loess Normalization

Method for eliminating non-biological bias and variation from Microarray data by fitting a local regression curve to Expression data. Assumes that the majority of the Gene s in the study are not differentially affected by the experimental conditions.

Log-rank Test

A nonparametric Hypothesis Test to compare the survival distributions of two samples. It is appropriate when data are right-skewed and non-informatively censored. Also known as a Mantel-Cox test. This test can be considered a time-stratified Cochran-Mantel-Haenszel Test .

Logistic Function

A common sigmoid curve that can model the S-shaped curve of Population growth. Initial growth approximates an exponential curve, followed by slowing growth as saturation begins, followed by no growth at maturity.

Logistic Regression

A generalized linear model used for prediction of the probability of event occurrence ( Binomial Regression ) by fitting data to a Logit Function logistic curve.

Logit Function

The inverse of the sigmoidal Logistic Function . It is synonymous with Log-odds .

Log-odds

See Logit Function .

Loss of Heterozygosity (LOH)

Results from a Deletion (or other Mutation ) of the normal Allele , at a Locus Heterozygous for the normal alleles and a deleterious mutant allele. Produces a locus that is either Homozygous or Hemizygous for the deleterious allele.

Lowest Level Term (LLT)

The lowest level of the Medical Dictionary for Regulatory Activities (MedDRA) , below Preferred Term (PT) . This level is reserved for non-current, vague, ambiguous, truncated, or misspelled terms, or for terms taken from other terminologies that do not conform to MedDRA rules.

LSMeans

Least squares Mean s, which are estimates of means of classification effects that would be observed, assuming that the experimental design is balanced.

MA Plot

A plot of the distribution of red-to-green intensity ratio ( M ; y -axis) by the average intensity ( A ; x -axis). It can be used to visualize the intensity-dependent ratio of raw Microarray data in order to determine whether Normalization is needed. See MA Plot for more information.

Macro

A single statement, instruction, or catalog entry that automatically expands into a set of statements, instructions, or text.

Mahalanobis Distance

A distance measure based on Correlation s between Variable s. In contrast to Euclidean Distance , it is better adapted to non-spherically symmetric distributions, and is scale-invariant. See Mahalanobis Distances .

Main Effect

An effect measures the extent to which the response depends on the factors involved in the effect. A main effect is the change in the response due to a single factor. For two-level factors, the main effect is the difference between the mean response at the high level of a factor and the Mean response at its low level.

Major Allele

From a set of Allele s for a given gene or locus, the allele that occurs most often in a Population .

MANCOVA

Multivariate analysis of covariance; an extension of the analysis of covariance ( ANCOVA ) for multiple Dependent Variable s or where it is not feasible to combine dependent variables. MANCOVA is similar to MANOVA , but enables control for additional continuous Independent Variable s ( Covariate s).

Manhattan Plot

A type of scatter plot commonly used to display dense data, or data of highly diverse orders of magnitude. One typical use is in genomics applications, such as Genome-Wide Association Study (GWAS) . See Manhattan Plot .

MANOVA

Multivariate analysis of variance, a generalized form of univariate analysis of variance ( ANOVA ), used when there are two or more Dependent Variable s. This analysis is useful in determining whether changes in the Independent Variable s have significant effects on the dependent variables, as well as the associated interactions among dependent and independent variables.

Marker Variable

The column listing each individual's Allele or Genotype for the genetic markers used in an analysis. If marker variables are listed as alleles, there is a pair of marker variables for each marker.

Matched Pairs Analysis

A Clinical Research comparison of average score during baseline and a summary score during the trial for each finding. See Matched Pairs Analysis .

Mean

Mathematical average for a collection of n Observation s. It is calculated by dividing the sum of the observations by n .

Medical Dictionary for Regulatory Activities (MedDRA)

A clinically validated international medical terminology used by regulatory authorities and the biopharmaceutical industry.

Median

In any set of n Observation s arranged in order of magnitude, the median is represented by the observation positioned at n /2.

Menu Bar

The primary list of items at the top of a window, which represent the actions or classes of actions that can be executed. Selecting an item executes an action, opens a Pull-down Menu , or opens a Dialog box that requests additional information.

Metadata

Descriptive data on the content of primary data.

MGPS

Multi-Item Gamma Poisson Shrinker ³

Microarray

A compact two-dimensional array of biological material on a solid substrate (chip). Microarrays can be used for measuring Gene Expression , detecting SNPs, and evaluating protein-protein interactions, among many other things.

Minor Allele

From a set of Allele s for a given gene or locus, the allele that occurs least often in a Population .

Missing Value

A value in the SAS System indicating that no data is stored for the Variable in the current Observation . It is indicated by a single dot (.) for a numeric variable or a blank for a character variable.

Mixed Model

A statistical model containing both Fixed Effects and Random Effects .

Mode

The value that occurs most often in a probability Distribution or data set.

Model

A formula or algorithm that computes output values from input values.

Modus Tollens

An argument of proof by contradiction ; often known as denying the consequent . It has the general argument form of:

1. If P , then Q .

2. Not Q .

3. Therefore, not P .

Monophyletic

Describes any group descended from a common ancestor that includes the ancestor and all of its descendents.

Mosaic Plot

A graphical representation of a two-way frequency table or Contingency Table . A mosaic plot is divided into colored rectangles, so that the area of each rectangle is proportional to the proportions of the Y Variable in each level of the X Variable. See Mosaic Plot .

Mutation

A change ( Insertion , Deletion , substitution, duplication, or inversion) in a genomic sequence.

Nominal Variable

A Variable that contains discrete values that do not have a logical order. Includes names and other verbal descriptions.

Normalization

Multiple meanings are possible:

1. The division of more than one data set by a shared Variable to remove the effects of that variable from the data. By bringing the data to a common scale, data originating from different scales can be properly compared.

2. The isolation of statistical error in repeated measures data.

3. The adjustment of experimental data to remove variation from background noise and account for differences from technical artifacts (for example, assay or Microarray chip-specific differences)

Nucleic Acid

A DNA or RNA molecule. Nucleic acids store genetic information in all living Organism s.

Nucleotide

Building block of DNA and RNA . DNA nucleotides include Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). RNA nucleotides are the same as those in DNA, with the exception of Uracil (U) replacing Thymine (T).

Nucleus

The membrane-covered Organelle containing most of the genetic material in a eukaryotic Cell .

Null Hypothesis

A general or default position that a researcher tests (and attempts to reject) in an experiment. The null hypothesis, H 0 , is an essential part of a research design, and usually proposes that sample Observation s result purely from chance. The null hypothesis can never be proven; data can reject it or fail to reject it only. If a null hypothesis is rejected, an Alternative Hypothesis , H 1 (or H a ) is accepted.

Numeric Variable

A Variable that contains only numeric values and related symbols, such as decimal points, plus signs, and minus signs.

Observation

A row (horizontal component) in a SAS data set. Each observation contains one data value for each Variable in the data set.

One-way ANOVA

Analysis of variance with one between-groups factor. This is useful when you have a nominal Independent Variable and a normally distributed interval Dependent Variable , and you want to compare differences in the means of the dependent variable according to levels of the independent variable.

One-way Plot

A plot showing the response points along the y -axis for each X factor value. Using the plot, you can compare the distribution of the response across the levels of the X factor. The distinct values of X are sometimes called levels. See One-way Plot .

One-way Repeated Measures ANOVA

Analysis of variance used on one nominal Independent Variable and a normally distributed interval Dependent Variable that is repeated at least twice for each subject. This method is equivalent to the paired samples t -test, but allows for two or more nominal variable levels.

Operand

An object in an expression to be operated on. An operand can be a variable, a function, or a constant.

Operator

A symbol in an expression that requests a comparison, a logical operation, or arithmetic computation.

Optimistic Bias

The established systematic tendency for humans to be overly optimistic about the outcome of planned actions. The likelihood of positive and negative events are over- and under-estimated, respectively. This tendency varies based on person and type of action .

Ordinal Variable

A Variable that contains discrete values that have a logical order. For example, a variable called Rank could have values such as 1, 2, 3, 4, and 5.

Organ

A collection of Tissue s combined in one structure to serve a particular function.

Organelle

A component (usually encapsulated in a lipid bilayer) within a cell. An organelle elicits a specific function. Examples of organelles include ribosomes, mitochondria, and chloroplasts.

Organism

A living system or body capable of development, growth, homeostasis, and reproduction. Advanced multicellular organisms comprise multiple Body System s.

Overfit

To train a model to the random variation in the sample data. Overfit models contain too many parameters (weights), and they do not generalize well.

Overlay Plot

A plot showing several lines or markers on the y -axis overlaid to a common variable on the x -axis. See Overlay Plot .

Parallel Plot

A plot consisting of connected line segments across all responses for each row in a data table. See Parallel Plot .

Partial Least Squares (PLS)

A statistical technique that simultaneously partitions variability of both X and Y Variable s (or matrices), somewhat similar to Principal Components . A PLS model attempts to identify the multidimensional direction in the X space that explains maximum multidimensional Variance direction in the Y space.

PCTL

See Percentile .

Pearson Correlation

A parametric measure of association for two variables. It measures both the strength and the direction of a linear relationship. If one variable X is an exact linear function of another variable Y , a positive relationship exists if the correlation is 1 and a negative relationship exists if the correlation is -1. If there is no linear predictability between the two variables, the correlation is 0. If the two variables are normal with a correlation 0, the two variables are independent. However, correlation does not imply causality because, in some cases, an underlying causal relationship might not exist.

Pedigree

The ancestral lineage of an individual or group of closely related individuals.

Penalized Logistic Regression (PLR)

A discriminative classifier known for simultaneous Variable selection and classification. Its performance declines as the number of variables increases, and is often compared with that of Support Vector Machine (SVM) .

Percentile

The value of a Variable below which a certain percent of Observation s fall. For example, the 60 th percentile is the value below which 60% of the observations can be found. Note the following percentile landmarks.

- 25 th percentile = first quartile = Q 1

- 50 th percentile = second quartile = median = Q 2

- 75 th percentile = third quartile = Q 3

Phenotype

The physical manifestation of a Genotype .

Physical Distance

Distance between Gene s or sequences measured in Base s , kilobases, megabases, or centiRays.

Plain Text Format

The format of a text (.txt) file that is readable with little-to-no-processing. Files in this format cannot be embellished with multiple font styles, underlining, italicization, emboldening, and so on.

Pleiotropy

The phenomenon where a single Gene influences multiple phenotypic Trait s. Mutation s in one pleiotropic gene can affect some or all associated traits simultaneously.

Population

The collection of all Organism s from the same Species living in a given geographical area.

Population Stratification

The phenomenon in which differences in Allele frequencies between cases and controls in studies of genetic diseases that might have been ascribed to association of specific Gene s with disease are instead found to result from systematic differences in ancestry of the experimental groups.

Portable Document Format (.pdf)

An open standard for document exchange, created by Adobe Systems, that is used for representation of documents in a software, hardware, and operating system-independent manner.

Posterior Probability

The Conditional Probability of a random event that is assigned after relevant evidence is considered. Contrast with Prior Probability .

Power

The probability of a statistical significance test allowing you to reject the Null Hypothesis when the Alternative Hypothesis is true. Power equals one minus Beta (the rate of Type II Error ).

Preamble

An introductory section of code that can be used to list authors, revision dates, special instructions or assumptions, or any other information.

Predictor

A function or variable used to estimate a response.

Predictor Class Variable

An Independent Variable used to predict a Dependent Variable . Its distinct levels correspond to different predictions, and are often modeled by constructing a set of 0-1 Binary Variable s corresponding to each distinct level.

Predictor Continuous Variable

A numeric Independent Variable that predicts a Dependent Variable . Its predictions are computed directly as a function of the numeric variable, as in a linear regression.

Predictive Model

A statistical tool, made up of informative Variable s, that is used to forecast future behaviors or responses.

Preferred Term (PT)

The fourth-highest level of the Medical Dictionary for Regulatory Activities (MedDRA) Hierarchy, below High Level Term (HLT) and above Lowest Level Term (LLT) . An example of a PT is “Influenza”.

Principal Components

Linear combinations of all of the original Variable s that maximally explain variability. Useful when the analysis considers effects of many variables at once. By combining the variables into groups, you can reduce the total number in any one analysis.

Principal Components Analysis Plot

A graphical display of a Principal Components analysis. The results of a principal components analysis (PCA) are plotted either on a Scatterplot Matrix or a Three-Dimensional Scatterplot . See Principal Components Analysis Plot .

Prior Probability

The probability of an event computed before collection of new data (often based on an experienced expert opinion or rules-of-thumb). An experimenter begins with a prior probability of an event and then revises it in light of new data. Contrast with posterior probability .

Probe

An individual spot or sample (a DNA or Protein sequence, for example) attached to a Microarray chip that hybridizes to a specific Target . Typical microarrays contain thousands of probes.

Probe Intensity

Strength of the signal generated by hybridization of the Target sequence to the specific Probe Set (Probeset) on the Microarray .

Probe Set (Probeset)

All of the different Probe s specific for a Transcript . As an internal control, Microarray s typically include multiple Target sequences taken from different regions of a transcript. Comparisons of intensities resulting from the hybridization of transcripts to each of these sequences allows for more effective evaluation of transcript levels. The collection of targets specific for a transcript are referred to as a probeset .

PROC

A SAS procedure; a group of SAS statements that call and execute a procedure, usually with a SAS data set as input

Process

A computational task or procedure. In JMP Life Sciences, process typically refers to an Analytical Procedure (AP) .

Protein

A biochemical compound consisting of polypeptides, ordered as folded chains, sheets, or more complex three-dimensional structures. Proteins can have structural or functional roles in Organism s .

Proxy Server

A server that acts as an intermediary for requests from clients seeking resources from other servers.

PRR

Proportional Reporting Ratio ⁴

Pull-down Menu

The list of menu items or choices that appears when you choose an item from a Menu Bar or from another menu.

p-Value

The statistical probability that a Statistic is as or more extreme than the observed value, assuming that the Null Hypothesis is true. A smaller p -value enables you to more rigorously reject the null hypothesis .

Q Matrix

The n x p Population structure incidence matrix where n is the number of individuals assayed and p is the number of populations defined.

Quantile

Portions taken at regular intervals along a distribution that divide a data set into discrete subsets.

Quantitative Trait

Characteristics that are found in different degrees (along a continuous scale) across Organism s, unlike Binary Trait s, which are either present or absent. Examples of quantitative traits are height and hair color .

Quantitative Trait Locus (QTL)

A region of DNA that is associated with the strength of a particular Phenotype . By themselves, QTLs do not determine whether a Gene is expressed. Instead, each QTL interacts with other QTLs, located throughout the Genome , to influence the relative level of Expression .

Radial Basis Function (RBF)

A real-valued function whose output value is determined only by the distance from the origin or center.

Radial Basis Machine (RBM)

A predictive modeling method used to interpolate multidimensional space, using a hidden layer (often Gaussian) and an output layer, associating each input data point with a Radial Basis Function (RBF) . RBM is resistant to local minima problems, but the input space must be covered well by RBFs.

Random Effects

The effects that cause extraneous variation in your results and have little to do with the questions being addressed. Random effects include factors such as physical differences between the arrays, or batch effects resulting from performing different parts of the experiments at different times, on different days, using different lots of reagents, and so on. Variation resulting from random effects can confound your results and should be eliminated from your analysis.

Random Number Seed

The starting point for a random number generator. Unless a number is specified, an arbitrary value, such as the date or time of an event, is used.

Receiver Operator Characteristics (ROC) Curves

A plot of curves summarizing the trade-off between Sensitivity and Specificity in predictive modeling. See Receiver Operating Characteristics (ROC) Curves .

Regression Analysis

Techniques for modeling and analyzing several Variable s, with the focus on the relationship between dependent and independent variables. Regression analysis is useful in uncovering how values of a Dependent Variable change when a single Independent Variable is varied.

Reliability Diagram

A graph where the conditional distribution of the observations , given the forecast probability, is plotted against the forecast probability. The distributions for perfectly reliable forecasts are plotted along the 45-degree diagonal. See Reliability Diagram .

Residual

Value equal to the response value minus the predicted value.

Rho

See Spearman Correlation . Also, a measure of Linkage Disequilibrium (LD) .

Rich Text Format (.rtf)

A proprietary document file format created by Microsoft, used in many word processing programs.

RNA

Ribonucleic acid, a nucleic acid transcribed from DNA . RNA is used as a messenger (mRNA) to conduct Protein synthesis. Some viruses use RNA to encode Gene s.

Root Mean Square Error (RMSE)

A measure of the differences between the values predicted by a model or an estimator and the values actually observed. It is calculated by taking the square root of the Mean square error value.

ROR

Reporting Odds Ratio ⁵

Sample Size

The number of Observation s that constitute a statistical sample. For example, the sample size in a study might consist of the number of subjects. Greater sample sizes lead to greater precision and Power for a study design to detect an effect at a given size.

SAS Data Set

A file whose contents are in one of the native SAS file formats. SAS data sets contain data values in addition to descriptor information that is associated with the data.

SAS Log

A file that contains all the SAS statements that you have submitted, messages about the execution of your analytical process and the SAS program running in the background, and in some cases, output from certain procedures. This file is generated and placed in the designated output folder.

SAS Transport File

A file with a compressed format used in SAS. Transport files can be used to move SAS libraries, SAS catalogs, and SAS data sets across different operating systems. Files of this format have the extension .cpt .

SAS Variable Label

Variable s (columns) in a SAS data set can have a SAS Variable Label . This label has much less restrictive creation rules than the corresponding SAS Variable Name . Blank spaces, special characters, and longer lengths are permitted. See SAS Variable Names and Labels .

SAS Variable Name

Every Variable (column) in a SAS data set must have a unique SAS Variable Name . This name must conform to a number of conventions, with notable restrictions on the first character, blank spaces, special characters, and length. See SAS Variable Names and Labels .

Scatterplot

A graph showing the relationship between two Variable s . Multiple scatterplot formats exist, including scatterplot matrices, three-dimensional scatterplots, and Bubble Plot s. See Scatterplot .

Scree Plot

A graphical method for determining the number of factors. The Eigenvalue s are plotted in the sequence of the principal factors. The number of factors is chosen where the plot levels off to a linear decreasing pattern. See Scree Plot .

Screen Failure

A subject in a Clinical Research study that skips treatments or otherwise does not meet treatment criteria. In clinical data sets, a value of “Screen Failure” is given in the treatment column for this subject.

SDTM

CDISC Study Data Tabulation Model. See ( SDTM for more details.

Segmentation Summary Plot

A plot of sample versus physical position by Chromosome . See Segmentation Summary Plot .

Sensitivity

The proportion of true positives that are correctly identified. Specifically, sensitivity equals the number of true positives divided by the sum of the number of true positives and the number of false negatives.

Settings File

A SAS file that contains saved process settings, including paths to input and output data sets, and specific options. Settings files are generated by use of the Save button after completing a process input dialog. Rather than manually respecifying parameters for every run of a process, a settings file can quickly autopopulate this information via the Load button. See Saving and Loading Settings .

Sex Chromosome

A Chromosome that influences the sex of an Organism . For any given sexually reproducing Species , there are usually two types of sex chromosomes. All other chromosomes are called Autosome s.

Shift Plot

A graphical display enabling you to compare how an experimental Population responds to an experimental treatment. See Shift Plot .

Sib-pair Analysis

A specific type of Linkage analysis where markers are tested for linkage to a phenotypic Trait or disease by measuring the degree to which affected sib pairs share marker Haplotype s.

Single Nucleotide Polymorphism (SNP)

DNA variant in which a sequence differs from the wild-type or other reference sequence by a single Nucleotide .

Singular Value Decomposition

The factorization of a real or complex matrix, allowing the matrix to be expressed as a product. Every m x n matrix has a singular value decomposition.

Smoothing Bandwidth

A number determining the degree of smoothing for certain algorithms.

SNP

See Single Nucleotide Polymorphism (SNP) .

Spearman Correlation

Nonparametric method for examining whether two quantitative Variable s co-vary. Each pair of variables is converted to ranks and is linked with an “unseen nominal variable .

Species

Multiple definitions exist, which convey essentially the same meaning :

- A group of Organism s capable of interbreeding, resulting in fertile offspring.

- A group of Organism s belonging to the same taxonomic rank by means of an arbitrarily sufficient similarity in morphology, ecological niche, or genomic content.

Specificity

The proportion of true negatives that are correctly identified. Specifically, specificity equals the number of true negatives divided by the sum of the number of true negatives and the number of false positives.

Square Data Set

A SAS data set that contains a square matrix . Identical entities are found in rows and columns. This data set often contains similarity, dissimilarity, identity, or distance values, and diagonal cells containing the values of 1 or 0. Note that although square data sets contain square matrices, they sometimes contain slightly more columns than rows (so that rows can be labeled to match column labels and to associate additional information with these labels).

Stacked Data Set

A SAS data set that has all Observation s of interest stacked into a single Variable , or column. Rows are organized into groups of similar observations in which each row in the group differs in only one experimental parameter. Each group differs from other groups in additional parameters. Whereas a Tall Data Set typically has more rows than columns, a Stacked Data Set typically has many, many more rows than columns.

Standard Deviation

A statistical measure of how “spread out” the data are. It is calculated by taking the positive square root of the sum of the squared deviations of each Observation from the sample Mean divided by (n-1).

Standard Error

The standard deviation of the sample mean. It is calculated by dividing the Standard Deviation by the square root of the Sample Size .

Standard MedDRA Query (SMQ)

A group of Preferred Term (PT) s and Lowest Level Term (LLT) s relating to a particular medical condition or concept. It could also include High Level Term (HLT) s and High Level Group Term (HLGT) s, as well as hierarchies. Such a grouping is helpful in formulating a “case definition” and in data exploration, search, and retrieval. Each medical condition or group of related conditions has one or more individual SMQs. Terms listed in the SMQ define signs, symptoms, events, laboratory data, physical and physiological findings, and so on.

Standardization

Multiple meanings are possible:

- Transformation of a data set to have zero Mean and unit Variance .

- Making all regression coefficients have the same scale.

- Normalization .

Standardized Residual Plots

Residual plots , widely used in regression analyses. are useful in determining whether there are additional Variable s that should be included in the regression model. Residual plots also assist in outlier detection. More commonly, residual plots are used as diagnostic tools in deciding whether a distribution or model fit the data well. In linear regression, residuals are assumed to be normally distributed. Therefore, for convenience, they are transformed to the standardized form in standardized residual plots . See Standardized Residual Plots .

Statistic

A single measure of a sample attribute. Statistics are derived from the application of a function to sample data. An example is the sample median .

Strata Variable

A Variable that partitions the data into blocks with similar characteristics.

Support Vector Machine (SVM)

A supervised learning method that generates one or more hyperplanes in high-dimensional space to predict which of two classes input data points are categorized.

Survival Curves

Plots of survival functions estimated for each subject. See Survival Curves .

Survival Plot

A plot summarizing the survival of patients in each experimental ARM over the course of a Clinical Research trial. See Survival Plot .

System Organ Class (SOC)

The highest level of the Medical Dictionary for Regulatory Activities (MedDRA) Hierarchy, above High Level Group Term (HLGT) . An example of an SOC is “Respiratory, thoracic, and mediastinal disorders”.

Tag SNP

A Single Nucleotide Polymorphism (SNP) in a high linkage disequilibrium region of a genome, representative of other specific SNPs. The use of Tag SNPs in a Genome-Wide Association Study (GWAS) can greatly reduce the computational burden of the underlying analyses. Also known as a haplotype-tag SNP ( htSNP ).

Tall Data Set

A SAS data set that has samples as columns and molecular entity (for example, marker, gene, clone, protein, or metabolite) as rows . Tall data sets are the transpose of Wide Data Set s. See Tall and Wide Data Sets for more information.

Target

In Microarray s, a target is a specific experimental or clinical sample of c DNA , c RNA , or Protein that is washed over a microarray chip (containing Probe s). Sequence presence or abundance is quantified by probe-target hybridization on the chip.

Tau Value

A nonparametric measure of association based on the number of concordances and discordances in paired Observation s. Concordance occurs when paired observations vary together, and discordance occurs when paired observations vary differently. Also, used for the truncated product p-Value adjustment method to indicate that there is at least one false Null Hypothesis among those with p-values less than tau when the null hypothesis is rejected.

Test Data Set

See Holdout Data .

Test Statistic

A function of the data sample that reduces and summarizes the data to either one or a few values that can be used to conduct a Hypothesis Test .

Tissue

A collection of Cell s from the same origin that achieves a specific function. The cells need not be identical in morphology or genomic content, but must have identical function. Examples of animal and plant tissues include muscle and meristematic tissues, respectively.

Transcript

The sequence generated through transcription by using a DNA sequence as a template to create a complimentary RNA sequence.

Transmission Disequilibrium Test (TDT)

A family-based association test used to map Binary Trait s .

Training Data Set

The portion of the initial data set that contains input values and target values that are used to develop a predictive model .

Trait

A characteristic possessed by an Organism . Traits can be determined by Genotype , Environment , or both.

Transcript Cluster

A consensus sequence of Nucleotide Base s made up of the cluster of Exon s transcribed from a particular strand of a defined region of the Genome .

Transformation

The process of applying a function to a Variable in order to adjust the variable's range, variability, or both.

Tree

The complete set of rules that are used to split data into a hierarchy of successive segments. A tree consists of branches and leaves, in which each set of leaves represents an optimal segmentation of the branches above them according to a statistical measure.

Tree Map

A graphic of hierarchical data represented by nested, tiled, colored rectangles. Every branch of the hierarchical tree is assigned a rectangle, which in turn is tiled with successively smaller rectangles corresponding to sub-branches. See Tree Map .

Truncated Product Method (TPM)

A method that smooths p-Value s over windows of markers for n Hypothesis Test s by taking the product of those p -values less than a specified cutoff value and evaluating the probability of this product under the overall hypothesis that all n hypotheses are true.

t-statistic

A measure of how extreme a statistical estimate is. It is calculated by subtracting a reference of hypothetical value from your estimate and then dividing the remainder by the Standard Error value for the experiment.

t-test

A test that assesses the statistical difference between the Mean s of two different experimental groups. The Test Statistic follows a Student’s t distribution if the Null Hypothesis is supported.

If only one Variable is chosen (one-sample t-test ), the null hypothesis is that “the population mean is equal to the given mean”.

Type I Error

An incorrect decision made when a test rejects a true Null Hypothesis ( H 0 ). This is comparable to a false positive error. Type I error rate is denoted by Alpha , and is referred to as the size of the test .

Type II Error

An incorrect decision made when a test fails to reject a false Null Hypothesis ( H 0 ). This is comparable to a false negative error. Type II error rate is denoted by Beta , and is related to the Power of a test ( power = 1- beta ).

Variable

A column (vertical component) in a SAS data set. The data values for each variable describe a single characteristic for all Observation s.

Variance

A measure of deviation of a group of samples from the mean. It is calculated by squaring the Standard Deviation .

Venn Diagram

A graphical representation composed of two or more overlapping circles that shows all of the hypothetical relationships between two or more data sets.

Vital Signs (VS)

Measures of physiological statistics used to assess basic body functions. The most common vital signs include body temperature, heart rate (or pulse), blood pressure, and respiratory rate.

Volcano Plot

A Scatterplot of the negative log 10 -transformed p-Value s derived from Gene -specific t-test against the log 2 -fold change in Expression . Genes whose expression is decreased lie to the left of the Mean ; genes whose expression is increased lie to the right of the mean. Genes with statistically significant differential expression lie above a horizontal threshold. This plot provides an effective means for visualizing the direction, magnitude, and significance of changes in gene expression. See Volcano Plot .

WHERE Clause

A SAS statement that enables you to filter a set of Observation s so that only the subset of data meeting the specific filtering criteria are considered in the analysis.

Wide Data Set

A SAS data set that has samples as rows and molecular entity (for example, marker, gene, clone, protein, or metabolite) as columns . Wide data sets are the transpose of Tall Data Set s. See Tall and Wide Data Sets for more information.

Wilcoxon Signed-rank Test

A nonparametric statistical hypothesis test used when comparing two related samples or repeated measurements on a single sample to assess whether their population Mean ranks differ. It is appropriate as an alternative to the paired Student’s t-test when the population is not normally distributed or the data is ordinal.

Wizard

An interactive utility program that consists of a series of dialog boxes, windows, or pages. You supply information in each dialog box, window, or page, and the wizard uses that information to perform a task.

Workflow

A series of Process es run in a specified order, whose output is collected in a Journal . Given a constant basic experimental design and analysis objectives, a workflow can be used repeatedly with different data sets.

XML

See Extensible Markup Language (XML) .

1

Bate, A, et al. (1998) A Bayesian neural network method for adverse drug reaction signal generation. Eur J Clin Pharmacol 54:315-321; Gould, AL. (2003) Practical pharmacovigilance analysis strategies. Pharmacoepidemiology and Drug Safety 12: 559–574.

2

Mehrotra DV, Heyse JH. (2004) Use of the false discovery rate for evaluating clinical safety data. Statistical Methods in Medical Research 13:227-238.

3

DuMouchel W. (1999) Bayesian data mining in large frequency tables with an application to the FDA spontaneous reporting system. The American Statistician 53: 177-90.

4

Evans SJW, Waller PC, Davis S. (2001) Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharmacoepidemiology and Drug Safety 10:483-486.

5

Meyboom RHB, Egberts ACG, Edwards IR, Hekster YA, de Koning FHP, Gribnau FWJ. (1997) Principles of signal detection in pharmacovigilance. Drug Safe 16: 355–365.