Clustering Method

Parameters | Pattern Discovery | Clustering Method

Clustering Method

Use the drop-down menu to select the method used for clustering.

Available options are described in the table below:

Clustering Method

Description

Average

Choose this method to set the distance between clusters to the average distance between pairs of observations.

This method tends to join clusters with small variances and is biased toward producing clusters with the same variance.¹

Centroid

Choose this method to set the distance between clusters to the squared Euclidean Distance between the means of each cluster.²

This method is more robust than other clustering methods.

Complete

Choose this method to set the distance between clusters to the maximum distance between an observation in one cluster and an observation in the other.2

This method is biased toward producing clusters of equivalent diameters and can be distorted by even moderate outliers.

The TRIM=n option is recommended (n = the threshold probability, below which points are omitted).

Density

Choose this method to use nonparametric probability density estimates to generate the clusters.

Density linkages are calculated in two steps:

1

A new dissimilarity measure, d*, is computed based on density estimates and adjacencies.

2

A single linkage cluster analysis is performed using d*.

You must specify the specific type of density linkage to be performed, in the Additional PROC CLUSTER Options field. Options include kth-nearest neighbor, uniform kernel, and hybrid.

You must specify one of the following options:

•

K=n (where n = the number of neighbors for k-nearest neighbor density estimation),

•

R=n (where n = the radius of sphere of support for uniform-kernel density estimation), or

•

HYBRID to specify the Wong hybrid clustering method.

Flexible

Choose this option to use the flexible- beta method developed by Lance and Williams (1967)³.

The BETA=n option is recommended (n= the beta parameter, usually between 0 and -1, -0.25 is specified by default).

McQuitty

Choose this option to use the combinatorial method developed independently by Sokal and Michener (1958)1 and McQuitty (1966)⁴.

Median

Choose this option to use the median method developed by Gower (1967)⁵.

Single

Choose this method to set the distance between two clusters to the minimum distance between an observation in one cluster and an observation in the other cluster.

Because there are no constraints on the shape of clusters, single linkage sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters. Single linkage tends to chop off the tails of distributions before separating the main clusters.

Twostage

This method is a modification of density linkage that ensures that all points are assigned to modal clusters before the modal clusters are permitted to join. The CLUSTER procedure supports the same three varieties of two-stage density linkage as of ordinary density linkage: kth-nearest neighbor, uniform kernel, and hybrid.

You must specify one of the following options:

•

K=n (where n = the number of neighbors for k-nearest neighbor density estimation),

•

R=n (where n = the radius of sphere of support for uniform-kernel density estimation, or

•

HYBRID to specify the Wong hybrid clustering method.

Ward

Choose this method to set the distance between clusters to the ANOVA sum of squares between the two clusters summed over all the variables. At each generation, two clusters from the previous generation are merged to reduce the within-cluster sum of squares over all partitions. The sums of squares are easier to interpret when they are divided by the total sum of squares to give the proportions of variance (squared semipartial correlations).

This method joins clusters to maximize the likelihood at each level of the hierarchy under the assumptions of multivariate normal mixtures, spherical covariance matrices, and equal sampling probabilities.

This method tends to join clusters with a small number of observations and is biased toward producing clusters with approximately the same number of observations. It is also very sensitive to outliers.2

The TRIM=n option is recommended (n = the threshold probability, below which points are omitted).

1

Sokal, R.R., and C.D. Michener. (1958) A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38: 1409-1438.

2

Milligan, G.W. (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45: 325-342.

3

Lance, G. N. and Williams, W. T. (1967) A general theory of classificatory sorting strategies. I. hierarchical systems. Computer Journal 9: 373–380.

4

McQuitty, L. L. (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educational and Psychological Measurement 26: 825–831

5

Gower, J. C. (1967) “A Comparison of Some Methods of Cluster Analysis,” Biometrics, 23, 623–637.

To Specify a Clustering Method:

Check the Perform SAS-based clustering on the Distance Matrix box.

Select the desired clustering method using the drop-down menu.

For Additional Information

Refer to the SAS PROC CLUSTER documentation for more information.