For the latest version of JMP Help, visit JMP.com/help.


Multivariate Methods > K Means Cluster > Overview of the K Means Cluster Platform > Overview of Platforms for Clustering Observations
Publication date: 07/24/2024

Overview of Platforms for Clustering Observations

Clustering is a multivariate technique that groups together observations that share similar values across a number of variables. Typically, observations are not scattered evenly through p-dimensional space, where p is the number of variables. Instead, the observations form clumps, or clusters. Identifying these clusters provides you with a deeper understanding of your data.

Note: JMP also provides a platform that enables you to cluster variables. See “Cluster Variables”.

JMP provides four platforms that you can use to cluster observations:

Hierarchical Cluster is useful for small and large data tables and allows character data. Hierarchical clustering combines rows in a hierarchical sequence that is portrayed as a tree. You can choose the number of clusters that is most appropriate for your data after the tree is built. See “Hierarchical Cluster”.

K Means Cluster is appropriate for larger tables with up to millions of rows and allows only numerical data. You need to specify the number of clusters, k, in advance. The algorithm guesses at cluster seed points. It then conducts an iterative process of alternately assigning points to clusters and recalculating cluster centers. See K Means Cluster.

Normal Mixtures is appropriate when your data come from a mixture of multivariate normal distributions that might overlap and allows only numerical data. For situations where you have multivariate outliers, you can use an outlier cluster with an assumed uniform distribution. See “Normal Mixtures”.

You need to specify the number of clusters in advance. Maximum likelihood is used to estimate the mixture proportions and the means, standard deviations, and correlations jointly. Each point is assigned a probability of being in each group. The EM algorithm is used to obtain estimates.

Latent Class Analysis is appropriate when most of your variables are categorical. You need to specify the number of clusters in advance. The algorithm fits a model that assumes a multinomial mixture distribution. A maximum likelihood estimate of cluster membership is calculated for each observation. An observation is classified into the cluster for which its probability of membership is the largest. See “Latent Class Analysis”.

Table 14.1 Summary of Clustering Methods

Method

Data Type or Modeling Type

Data Table Size

Specify Number of Clusters

Hierarchical Cluster

Any

With Hybrid Ward, up to hundreds of thousands of rows

With Fast Ward, up to 200,000 rows

With other methods, up to 5,000 rows

No

K Means Cluster

Numeric

Up to millions of rows

Yes

Normal Mixtures

Numeric

Any size

Yes

Latent Class Analysis

Nominal or Ordinal

Any size

Yes

Some of the clustering platforms have options to handle outliers in the data. However, if your data has outliers, it is best to explore them first prior to analyzing. This can be done using the Explore Outliers Utility. For more information, see “Explore Outliers” in Predictive and Specialized Modeling.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).