The Cluster Subjects Across Study Sites report is used to identify similar subjects. It does so by constructing a cross domain data set using as much data as possible (subject to user options). Next, it calculates Euclidean distances to compute a
distance matrix and performs
hierarchical clustering of subjects, across all of the study centers. Findings values are averaged by
USUBJID, test code, visit number, and time point (if available) if there are multiple measurements for a visit or time point. The goal of this exercise is to identify pairs of subjects with a very small distance. This could be an indication that these subjects are in fact the same individual who has enrolled at multiple sites.
Running this report using the Nicardipine sample setting and default options generates the output shown below. This report uses pre-dosing information with the goal of identifying subjects that have enrolled at two or more clinical sites.
The Cluster Subjects Across Study Sites report shows the results of clustering of the subjects on the basis of different combinations of covariates (demographic groups in this example). The results for each grouping are presented on a separate “section”. This report initially shows two sections
Between-Subject Distance Summary and
Subgroup Clustering. Use the available options in each section to drill-down into the data.
•
|
Box plots are presented for all pairwise distances between subjects in the selected population. Pairs are limited based on selections from the Cluster subjects matching these criteria panel of the dialog.
|
•
|
One Box Plot of Minimum Between-Subject Distances for Each Site. The minimum distance from each covariate subgroup is presented in the box plot to the right.
|
•
|
A Local Data Filter to subset histograms to data of interest is available.
|
The data filters for this report include an option to specify the Number of Overlapping Variables. One of the challenges in this report is that distances may be very small, even zero. This may be driven by the fact that a patient may have a number of missing values for variables, and these don't contribute to the distance calculation. By default, SAS calculates the distance between pairs of variables that are non-missing for each pair of subjects. Number of Overlapping Variables makes it easier to subset to pairs that have a high-number of non-missing overlapping pairs of variables.
One or more Subgroup Clustering sections: Only one section is opened initially. The name of this section is dependent on the covariate values used (as specified in the
Cluster subjects matching these criteria panel) and the subgroup that is identified with the minimum pairwise distance. Other subgroup results can be opened from the
Results Sections menu.
The Sex = F, Race = BLACK OR AFRICAN AMERICAN subgroup clustering section is shown below:
•
|
A Box Plot showing all pairwise distances between white females across all sites. Smaller distances indicate individuals that are more similar based on pre-dose information selected for use from the dialog. Using the data filter to subset to pairs with a small age, height and weight difference, we can highlight them in the hierarchical clustering profile or examine in the data table to assess similarity.
|
•
|
A Dendrogram showing the Hierarchical Clustering performed to identify subsets of subjects that might be very similar, for example, a subject that has attended at least 3 sites. Points indicating highly similar pairs of subjects can be selected from the box plot, and these rows can be highlighted in the clustering heat map.
|
•
|
Show Subjects: Select subjects and click to open the ADSL (or DM if ADSL is unavailable) of selected subjects.
|
•
|
Subset Clustering: On a subgroup clustering page, subsets clustering to subjects, based on pairs selected from corresponding box plot.
|
•
|
Revert Clustering: Click to return a subset clustering to the original state where all subjects are clustered.
|
Output includes one summary data set (named csass_sum_XXX1, by default) containing one record per subject with pre-dosing data, one data set of all pairwise distances within the
covariate subgroups (named
csass_alldist_XXX, by default), one data set containing minimum pairwise distances for each covariate subgroup (named
csass_mindist_XXX), by default), one data set per covariate subgroup containing pairwise distances (named
csass_p_Y_XXX, by default, where
Y is indexed 1 to the number of covariate subgroups) and one data set per covariate subgroup containing the
distance matrix of subjects within the covariate subgroup (named
csass_Y_XXX, by default, where
Y is indexed 1 to the number of covariate subgroups).
•
|
Click the Options arrow to reopen the completed report dialog used to generate this output.
|
By default, the analysis Include Age. You can opt to
Include Sex, as well, or to ignore either these if you choose.
You can opt to Include findings domains data. While all tests from all findings domains are included in the analysis by default, you can restrict the analysis to specified
Findings Tests only. You can also select analysis units along with cutoff values for including events or interventions, for summarizing subgroups, and specify whether variables with missing values are allowed. You can
Analyze findings using: either standard units or the original units.
Unscheduled visits can occur for a variety of reasons and can complicate analyses. By default, these are excluded from this analysis. However, by unchecking the
Remove unscheduled visits box, you have the option of including them.
You can opt to Include intervention domains data in your analysis. By default, all intervention domains are included, however, you can use the
Subset of Domains to Analyze for Interventions option to restrict the analysis to specific domains.
You can opt to Include event domains data in your analysis. By default, all event domains are included, however, you can use the
Subset of Domains to Analyze for Events option to restrict the analysis to specific domains. Use the
Include events or interventions experienced by at least this percent of patients: option to specify a minimum threshold for including an event or intervention in the analysis.
The Summarize subgroups with at least this many subjects option enables you to generate summaries of significant subgroups of patients.
By default, specified subjects are clustered by Sex and
Race using Ward’s
Hierarchical Clustering Method. Available options enable you to change both the criteria for clustering (adding
Country, for example) and the clustering method used.