There are three standards methods for exploring multidimensional data:

Principal component methods, used to summarize and to visualize the information contained in a multivariate data table. Individuals and variables with same profiles are grouped together in the plot. Principal component methods include:
- Principal Component Analysis (PCA), used for analyzing a data set containing continuous variables
- Correspondence Analysis (CA), an extension of PCA suited to handle a contingency table formed by two qualitative variables (or categorical data).
- Multiple Correspondence Analysis (MCA), an extension of simple CA for analyzing a data table containing more than two categorical variables.
Hierarchical Clustering, used for identifying groups of similar observations in a data set.
Partitioning clustering such as k-means, used for splitting a data set into several groups.

In my previous article, Hybrid hierarchical k-means clustering, I described HOW and WHY, we should combine hierarchical clustering and k-means clustering.

In the present article, I will show how to combine the three methods: principal component methods, hierarchical clustering and partitioning methods such as k-means to better describe and visualize the similarity between observations. The approach described here has been implemented in the R package FactoMineR (F. Husson et al., 2010). Its named HCPC for Hierarchical Clustering on Principal Components.

1 Why combining principal component and clustering methods?

1.1 Case of continuous variables: Use PCA as denoising step

In the case of a multidimensional data set containing continuous variables, principal component analysis (PCA) can be used to reduce the dimensionality of the data into few continuous variables (i.e, principal components) containing the most important information in the data.
PCA step can be considered as denoising step which can lead to a more stable clustering. This is very useful if you have a large data set with multiple variables, such as in gene expression data.

1.2 Case of categorical variables: Use CA or MCA before clustering

CA (for analyzing contingency table formed by two categorical variables) and MCA (for analyzing multidimensional categorical variables) can be used to transform categorical variables into a set of few continuous variables (the principal components) and to remove the noise in the data.

CA and MCA can be considered as pre-processing steps which allow to compute clustering on categorical data

2 Algorithm of hierachical clustering on principal component (HCPC)

Compute principal component methods
- Use Principal Component Analysis (PCA) for continuous variables
- Use Correspondence Analysis (CA) for a contingency table formed by two categorical variables
- Use Multiple Correspondence Analysis (MCA) for a data set containing multiple categorical variables.
Compute hierarchical clustering: Hierarchical Clustering is performed using Wards criterion on the selected principal components. Ward criterion has to be used in the hierarchical clustering because it is based on the multidimensional variance (i.e.inertia) as well as principal component analysis.
Choose the number of clusters based on the hierarchical tree: An initial partitioning is performed by cutting the hierarchical tree.
K-means clustering is performed to improve the initial partition obtained from hierarchical clustering. The final partitioning solution, obtained after consolidation with k-means, can be (slightly) different from the one obtained with the hierarchical clustering. The importance of combining hierarchical clustering and k-means clustering has been described in my previous post: Hybrid hierarchical k-means clustering

3 Computing HCPC in R

3.1 Required R packages

Well use FactoMineR for computing HCPC() and factoextra for data visualizations.

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

FactoMineR can be installed as follow:

install.packages("FactoMineR")

Load the packages:

library(factoextra)
library(FactoMineR)

3.2 R function for HCPC

The function HCPC()[in FactoMineR package] can be used to compute hierarchical clustering on principal components.

A simplified format is:

HCPC(res, nb.clust = 0, iter.max = 10, min = 3, max = NULL, graph = TRUE)

res: a PCA result or a data frame
nb.clust: an integer specifying the number of clusters. Possible values are:
- 0: the tree is cut at the level the user clicks on
- -1: the tree is automatically cut at the suggested level
- Any positive integer: the tree is cut with nb.clusters clusters
iter.max: the maximum number of iterations for k-means
min, max: the minimum and the maximum number of clusters to be generated, respectively
graph: if TRUE, graphics are displayed

3.3 Case of continuous variables

We start by computing again the principal component analysis(PCA) is performed on the data set. The argument ncp = 3 is used in the function PCA() to keep only the first three principal components. Next HCPC is applied on the result of the PCA.

3.3.1 Data preparation

Well use USArrest data set and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: Distances and scaling.

3.3.2 Compute principal component analysis

Well use the package FactoMineR for computing HCPC and factoextra for the visualization of the output.

# Compute principal component analysis
library(FactoMineR)
res.pca <- PCA(USArrests, ncp = 5, graph=FALSE)
# Percentage of information retained by each
# dimensions
library(factoextra)
fviz_eig(res.pca)

Clustering on principal component - Unsupervised Machine Learning

# Visualize variables
fviz_pca_var(res.pca)

Clustering on principal component - Unsupervised Machine Learning

# Visualize individuals
fviz_pca_ind(res.pca)

Clustering on principal component - Unsupervised Machine Learning

The first three dimensions of the PCA retains 96% of the total variance (i.e information) contained in the data:

get_eig(res.pca)

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.4802416        62.006039                    62.00604
## Dim.2  0.9897652        24.744129                    86.75017
## Dim.3  0.3565632         8.914080                    95.66425
## Dim.4  0.1734301         4.335752                   100.00000

Read more about PCA: Principal Component Analysis (PCA)

3.3.3 Compute hierarchical clustering on the PCA results

The function HCPC() is used:

# Compute PCA with ncp = 3
res.pca <- PCA(USArrests, ncp = 3, graph = FALSE)
# Compute HCPC
res.hcpc <- HCPC(res.pca, graph = FALSE)

The function HCPC() returns a list containing:

data.clust: The original data with a supplementary row called class containing the partition.
desc.var: The variables describing clusters
call$t$res: The outputs of the principal component analysis
call$t$tree: The outputs of agnes() function [in cluster package]
call$t$nb.clust: The number of optimal clusters estimated

# Data with cluster assignements
head(res.hcpc$data.clust, 10)

##             Murder Assault UrbanPop Rape clust
## Alabama       13.2     236       58 21.2     3
## Alaska        10.0     263       48 44.5     4
## Arizona        8.1     294       80 31.0     4
## Arkansas       8.8     190       50 19.5     3
## California     9.0     276       91 40.6     4
## Colorado       7.9     204       78 38.7     4
## Connecticut    3.3     110       77 11.1     2
## Delaware       5.9     238       72 15.8     2
## Florida       15.4     335       80 31.9     4
## Georgia       17.4     211       60 25.8     3

# Variable describing clusters
res.hcpc$desc.var

## $quanti.var
##               Eta2      P-value
## Assault  0.7841402 2.376392e-15
## Murder   0.7771455 4.927378e-15
## Rape     0.7029807 3.480110e-12
## UrbanPop 0.5846485 7.138448e-09
## 
## $quanti
## $quanti$`1`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop -3.898420         52.07692       65.540       9.691087  14.329285
## Murder   -4.030171          3.60000        7.788       2.269870   4.311735
## Rape     -4.052061         12.17692       21.232       3.130779   9.272248
## Assault  -4.638172         78.53846      170.760      24.700095  82.500075
##               p.value
## UrbanPop 9.682222e-05
## Murder   5.573624e-05
## Rape     5.076842e-05
## Assault  3.515038e-06
## 
## $quanti$`2`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop  2.793185         73.87500       65.540       8.652131  14.329285
## Murder   -2.374121          5.65625        7.788       1.594902   4.311735
##              p.value
## UrbanPop 0.005219187
## Murder   0.017590794
## 
## $quanti$`3`
##             v.test Mean in category Overall mean sd in category Overall sd
## Murder    4.357187          13.9375        7.788       2.433587   4.311735
## Assault   2.698255         243.6250      170.760      46.540137  82.500075
## UrbanPop -2.513667          53.7500       65.540       7.529110  14.329285
##               p.value
## Murder   1.317449e-05
## Assault  6.970399e-03
## UrbanPop 1.194833e-02
## 
## $quanti$`4`
##            v.test Mean in category Overall mean sd in category Overall sd
## Rape     5.352124         33.19231       21.232       6.996643   9.272248
## Assault  4.356682        257.38462      170.760      41.850537  82.500075
## UrbanPop 3.028838         76.00000       65.540      10.347798  14.329285
## Murder   2.913295         10.81538        7.788       2.001863   4.311735
##               p.value
## Rape     8.692769e-08
## Assault  1.320491e-05
## UrbanPop 2.454964e-03
## Murder   3.576369e-03
## 
## 
## attr(,"class")
## [1] "catdes" "list "

3.3.4 Visualize the results of HCPC using base plot

The function plot.HCPC() [in FactoMineR] is used:

plot(x, axes = c(1,2), choice = "3D.map", 
     draw.tree = TRUE, ind.names = TRUE, title = NULL,
     tree.barplot = TRUE, centers.plot = FALSE)

x: an object of class HCPC
axes: the principal components to be plotted
choice: a string. Possible values are:
- tree: plots the tree (dendrogram)
- bar: plots bars of inertia gains
- map: plots a factor map. Individuals are colored by cluster
- 3D.map: plots the factor map. The tree is added on the plot
draw.tree: a logical value. If TRUE, the tree is plotted on the factor map if choice = map
ind.names: a logical value. If TRUE, individual names are shown
title: the title of the grap
tree.barplot: a logical value. If TRUE, the barplot of intra inertia losses is added on the tree graph.
centers.plot: a logical value. If TRUE, the centers of clusters are drawn on the factor maps

# Principal components + tree
plot(res.hcpc, choice = "3D.map")

Clustering on principal component - Unsupervised Machine Learning

# Plot the dendrogram only
plot(res.hcpc, choice ="tree", cex = 0.6)

Clustering on principal component - Unsupervised Machine Learning

# Draw only the factor map
plot(res.hcpc, choice ="map", draw.tree = FALSE)

Clustering on principal component - Unsupervised Machine Learning

# Remove labels and add cluster centers
plot(res.hcpc, choice ="map", draw.tree = FALSE,
     ind.names = FALSE, centers.plot = TRUE)

Clustering on principal component - Unsupervised Machine Learning

3.3.5 Visualize the results of HCPC using factoextra

The function fviz_cluster() can be used:

fviz_cluster(res.hcpc)

Clustering on principal component - Unsupervised Machine Learning

3.4 Case of categorical variables

Compute CA or MCA and then apply the function HCPC() on the results as described above. If you want to learn more about CA and MCA, read the following articles:

4 Infos

This analysis has been performed using R software (ver. 3.2.1)

Husson, F., Josse, J. & Pagès J. (2010). Principal component methods - hierarchical clustering - partitional clustering: why would we need to choose for visualizing data?. Technical report. pdf