Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

Determining the optimal number of clusters: 3 must known methods - Unsupervised Machine Learning

$
0
0


The first step in clustering analysis is to assess whether the dataset is clusterable. This has been described in a chapter entitled: Assessing Clustering Tendency.

Partitioning methods, such as k-means clustering require also the users to specify the number of clusters to be generated.

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

Unfortunately, there is no definitive answer to this question. The optimal clustering is somehow subjective and depend on the method used for measuring similarities and the parameters used for partitioning.

A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering to see if it suggests a particular number of clusters. Unfortunately this approach is, again, subjective.

In this article, we’ll describe different methods for determining the optimal number of clusters for k-means, PAM and hierarchical clustering . These methods include direct methods and statistical testing methods.


  • Direct methods consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
  • Testing methods consists of comparing evidence against null hypothesis. An example is the gap statistic.


In addition to elbow, silhouette and gap statistic methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters. We’ll provide R codes for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.

For each of these methods:

  • We’ll describe the basic idea, the algorithm and the key mathematical concept
  • We’ll provide easy-o-use R codes with many examples for determining the optimal number of clusters and visualizing the output

1 Required packages

The following package will be used:

  • cluster for computing pam and for analyzing cluster silhouettes
  • factoextra for visualizing clusters using ggplot2 plotting system
  • NbClust for finding the optimal number of clusters

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

The remaining packages can be installed using the code below:

pkgs <- c("cluster",  "NbClust")
install.packages(pkgs)

Load packages:

library(factoextra)
library(cluster)
library(NbClust)

2 Data preparation

The data set iris is used. We start by excluding the species column and scaling the data using the function scale():

# Load the data
data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])

This iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

3 Example of partitioning method results

The functions kmeans() [in stats package] and pam() [in cluster package] are described in this section. We’ll split the data into 3 clusters as follow:

# K-means clustering
set.seed(123)
km.res <- kmeans(iris.scaled, 3, nstart = 25)
# k-means group number of each observation
km.res$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize k-means clusters
fviz_cluster(km.res, data = iris.scaled, geom = "point",
             stand = FALSE, frame.type = "norm")

Optimal number of clusters - R data visualization

# PAM clustering
library("cluster")
pam.res <- pam(iris.scaled, 3)
pam.res$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize pam clusters
fviz_cluster(pam.res, stand = FALSE, geom = "point",
             frame.type = "norm")

Optimal number of clusters - R data visualization

Read more about partitioning methods: Partitioning clustering

4 Example of hierarchical clustering results

The built-in R function hclust() is used:

# Compute pairewise distance matrices
dist.res <- dist(iris.scaled, method = "euclidean")
# Hierarchical clustering results
hc <- hclust(dist.res, method = "complete")
# Visualization of hclust
plot(hc, labels = FALSE, hang = -1)
# Add rectangle around 3 groups
rect.hclust(hc, k = 3, border = 2:4) 

Optimal number of clusters - R data visualization

# Cut into 3 groups
hc.cut <- cutree(hc, k = 3)
head(hc.cut, 20)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Read more about hierarchical clustering: Hierarchical clustering

6 NbClust: A Package providing 30 indices for determining the best number of clusters

6.1 Overview of NbClust package

As mentioned in the introduction of this article, many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.

NbClust package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

An important advantage of NbClust is that the user can simultaneously computes multiple indices and determine the number of clusters in a single function call.

The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.

6.2 NbClust R function

The simplified format of the function NbClust() is:

NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL, index = "all")

  • data: matrix
  • diss: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”
  • distance: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
  • min.nc, max.nc: minimal and maximal number of clusters, respectively
  • method: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average” and more
  • index: the index to be calculated including “silhouette”, “gap” and more.


The value of NbClust() function includes the following elements:

  • All.index: Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc
  • All.CriticalValues: Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc
  • Best.nc: Best number of clusters proposed by each index and the corresponding index value
  • Best.partition: Partition that corresponds to the best number of clusters

6.3 Examples of usage

Note that, user can request indices one by one, by setting the argument index to the name of the index of interest, for example index = “gap”.

In this case, NbClust function displays:

  • the gap statistic values of the partitions obtained with number of clusters varying from min.nc to max.nc ($All.index)
  • the optimal number of clusters ($Best.nc)
  • and the partition corresponding to the best number of clusters ($Best.partition)

6.3.1 Compute only an index of interest

The following example determine the number of clusters using gap statistics:

library("NbClust")
set.seed(123)
res.nb <- NbClust(iris.scaled, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="gap") 
res.nb # print the results
## $All.index
##       2       3       4       5       6       7       8       9      10 
## -0.2899 -0.2303 -0.6915 -0.8606 -1.0506 -1.3223 -1.3303 -1.4759 -1.5551 
## 
## $All.CriticalValues
##       2       3       4       5       6       7       8       9      10 
## -0.0539  0.4694  0.1787  0.2009  0.2848  0.0230  0.1631  0.0988  0.1708 
## 
## $Best.nc
## Number_clusters     Value_Index 
##          3.0000         -0.2303 
## 
## $Best.partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 3 2 3 3 3 3 2 2 2
##  [71] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 2 3 2 2 3 2 2 2 3 3 3 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3

The elements returned by the function NbClust() are accessible using the R code below:

# All gap statistic values
res.nb$All.index

# Best number of clusters
res.nb$Best.nc

# Best partition
res.nb$Best.partition

6.3.2 Compute all the 30 indices

The following example compute all the 30 indices, in a single function call, for determining the number of clusters and suggests to user the best clustering scheme. The description of the indices are available in NbClust documentation (see ?NbClust).

To compute multiple indices simultaneously, the possible values for the argument index can be i) “alllong” or ii) “all”. The option “alllong” requires more time, as the run of some indices, such as Gamma, Tau, Gap and Gplus, is computationally very expensive. The user can avoid computing these four indices by setting the argument index to “all”. In this case, only 26 indices are calculated.

With the “alllong” option, the output of the NbClust function contains:


  • all validation indices
  • critical values for Duda, Gap, PseudoT2 and Beale indices
  • the number of clusters corresponding to the optimal score for each indice
  • the best number of clusters proposed by NbClust according to the majority rule
  • the best partition


The R code below computes NbClust() with index = “all”:

nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")
# Print the result
nb

It’s possible to visualize the result using the function fviz_nbclust() [in factoextra], as follow:

fviz_nbclust(nb) + theme_minimal()
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 18 proposed  3 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is  3 .

Optimal number of clusters - R data visualization


  • ….
  • 2 proposed 2 as the best number of clusters
  • 18 indices proposed 3 as the best number of clusters.
  • 3 proposed 10 as the best number of clusters
According to the majority rule, the best number of clusters is 3


7 Infos

This analysis has been performed using R software (ver. 3.2.1)

  • Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36.
  • Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
  • Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423. PDF

Viewing all articles
Browse latest Browse all 183

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>