Clustering is an unsupervised machine learning method for partitioning dataset into a set of groups or clusters. A big issue is that clustering methods will return clusters even if the data does not contain any clusters. Therefore, its necessary i) to assess clustering tendency before the analysis and ii) to validate the quality of the result after clustering.
A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
Generally, clustering validation statistics can be categorized into 4 classes (Theodoridis and Koutroubas, 2008; G. Brock et al., 2008, Charrad et al., 2014):
Relative clustering validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). Its generally used for determining the optimal number of clusters.
External clustering validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. Since we know the true cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific dataset.
Internal clustering validation, which use the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
- Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time. Clustering stability measures will be described in a future chapter.
The aim of this article is to:
- describe the different methods for clustering validation
- compare the quality of clustering results obtained with different clustering algorithms
- provide R lab section for validating clustering results
In all the examples presented here, well apply k-means, PAM and hierarchical clustering. Note that, the functions used in this article can be applied to evaluate the validity of any other clustering methods.
1 Required packages
The following packages will be used:
- cluster for computing PAM clustering and for analyzing cluster silhouettes
- factoextra for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system
- NbClust for determining the optimal number of clusters in the data
- fpc for computing clustering validation statistics
Install factoextra package as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
The remaining packages can be installed using the code below:
pkgs <- c("cluster", "fpc", "NbClust")
install.packages(pkgs)
Load packages:
library(factoextra)
library(cluster)
library(fpc)
library(NbClust)
2 Data preparation
The data set iris is used. We start by excluding the column Species and scaling the data using the function scale():
# Load the data
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])
Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
3 Relative measures: Determine the optimal number of clusters
Many indices (more than 30) has been published in the literature for finding the right number of clusters in a dataset. The process has been covered in my previous article: Determining the optimal number of clusters.
In this section well use the package NbClust which will compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:
# Compute the number of clusters
library(NbClust)
nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "complete", index ="all")
# Visualize the result
library(factoextra)
fviz_nbclust(nb) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 2 proposed 2 as the best number of clusters
## * 18 proposed 3 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is 3 .
4 Clustering analysis
Well use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: Visual Enhancement of Clustering Analysis.
eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used to compute hierarchical clustering and partititioning clustering in a single line function call.
4.1 Example of partitioning method results
K-means and PAM clustering are described in this section. Well split the data into 3 clusters as follow:
# K-means clustering
km.res <- eclust(iris.scaled, "kmeans", k = 3,
nstart = 25, graph = FALSE)
# k-means group number of each observation
km.res$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
## [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize k-means clusters
fviz_cluster(km.res, geom = "point", frame.type = "norm")
# PAM clustering
pam.res <- eclust(iris.scaled, "pam", k = 3, graph = FALSE)
pam.res$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
## [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize pam clusters
fviz_cluster(pam.res, geom = "point", frame.type = "norm")
Read more about partitioning methods: Partitioning clustering
4.2 Example of hierarchical clustering results
# Enhanced hierarchical clustering
res.hc <- eclust(iris.scaled, "hclust", k = 3,
method = "complete", graph = FALSE)
head(res.hc$cluster, 15)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = FALSE)
Read more about hierarchical clustering: Hierarchical clustering
5 Internal clustering validation measures
In this section, we describe the most widely used clustering validation indices. Recall that the goal of clustering algorithms is to split the dataset into clusters of objects, such that:
- the objects in the same cluster are similar as much as possible,
- and the objects in different clusters are highly distinct
That is, we want the average distance within cluster to be as small as possible; and the average distance between clusters to be as large as possible.
Internal validation measures reflect often the compactness, the connectedness and separation of the cluster partitions.
Compactness measures evaluate how close are the objects within the same cluster. A lower within-cluster variation is an indicator of a good compactness (i.e., a good clustering). The different indices for evaluating the compactness of clusters are base on distance measures such as the cluster-wise within average/median distances between observations.
- Separation measures determine how well-separated a cluster is from other clusters. The indices used as separation measures include:
- distances between cluster centers
- the pairwise minimum distances between objects in different clusters
- Connectivity corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.
Generally most of the indices used for internal clustering validation combine compactness and separation measures as follow:
\[ Index = \frac{(\alpha \times Separation)}{(\beta \times Compactness)} \]
Where \(\alpha\) and \(\beta\) are weights.
In this section, well describe the two commonly used indices for assessing the goodness of clustering: silhouette width and Dunn index.
Recall that, more than 30 indices has been published in literature. They can be easily computed using the function NbClust which has been described in my previous article: Determining the optimal number of clusters.
5.1 Silhouette analysis
5.1.1 Concept and algorithm
Silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
For each observation \(i\), the silhouette width \(s_i\) is calculated as follows:
- For each observation \(i\), calculate the average dissimilarity \(a_i\) between \(i\) and all other points of the cluster to which i belongs.
For all other clusters \(C\), to which i does not belong, calculate the average dissimilarity \(d(i, C)\) of \(i\) to all observations of C. The smallest of these \(d(i,C)\) is defined as \(b_i= \min_C d(i,C)\). The value of \(b_i\) can be seen as the dissimilarity between \(i\) and its neighbor cluster, i.e., the nearest one to which it does not belong.
- Finally the silhouette width of the observation \(i\) is defined by the formula: \(S_i = (b_i - a_i)/max(a_i, b_i)\).
5.1.2 Interpretation of silhouette width
Silhouette width can be interpreted as follow:
Observations with a large \(S_i\) (almost 1) are very well clustered
A small \(S_i\) (around 0) means that the observation lies between two clusters
- Observations with a negative \(S_i\) are probably placed in the wrong cluster.
5.1.3 R functions for silhouette analysis
The silhouette coefficient of observations can be computed using the function silhouette() [in cluster package]:
silhouette(x, dist, ...)
- x: an integer vector containing the cluster assignment of observations
- dist: a dissimilarity object created by the function dist()
The function silhouette() returns an object, of class silhouette containing:
- The cluster number of each observation i
- The neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal)
- The silhouette width\(s_i\) of each observation
The R code below computes silhouette analysis and draw the result using R base plot:
# Silhouette coefficient of observations
library("cluster")
sil <- silhouette(km.res$cluster, dist(iris.scaled))
head(sil[, 1:3], 10)
## cluster neighbor sil_width
## [1,] 1 3 0.7341949
## [2,] 1 3 0.5682739
## [3,] 1 3 0.6775472
## [4,] 1 3 0.6205016
## [5,] 1 3 0.7284741
## [6,] 1 3 0.6098848
## [7,] 1 3 0.6983835
## [8,] 1 3 0.7308169
## [9,] 1 3 0.4882100
## [10,] 1 3 0.6315409
# Silhouette plot
plot(sil, main ="Silhouette plot - K-means")
Use factoextra for elegant data visualization:
library(factoextra)
fviz_silhouette(sil)
The summary of the silhouette analysis can be computed using the function summary.silhouette() as follow:
# Summary of silhouette analysis
si.sum <- summary(sil)
# Average silhouette width of each cluster
si.sum$clus.avg.widths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
# The total average (mean of all individual silhouette widths)
si.sum$avg.width
## [1] 0.4599482
# The size of each clusters
si.sum$clus.sizes
## cl
## 1 2 3
## 50 47 53
Note that, if the clustering analysis is done using the function eclust(), cluster silhouettes are computed automatically and stored in the object silinfo. The results can be easily visualized as shown in the next sections.
5.1.4 Silhouette plot for k-means clustering
Its possible to draw silhouette plot using the function fviz_silhouette() [in factoextra package], which will also print a summary of the silhouette analysis output. To avoid this, you can use the option print.summary = FALSE.
# Default plot
fviz_silhouette(km.res)
## cluster size ave.sil.width
## 1 1 50 0.64
## 2 2 47 0.35
## 3 3 53 0.39
# Change the theme and color
fviz_silhouette(km.res, print.summary = FALSE) +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2") +
theme_minimal()+
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
Silhouette information can be extracted as follow:
# Silhouette information
silinfo <- km.res$silinfo
names(silinfo)
## [1] "widths" "clus.avg.widths" "avg.width"
# Silhouette widths of each observation
head(silinfo$widths[, 1:3], 10)
## cluster neighbor sil_width
## 1 1 3 0.7341949
## 41 1 3 0.7333345
## 8 1 3 0.7308169
## 18 1 3 0.7287522
## 5 1 3 0.7284741
## 40 1 3 0.7247047
## 38 1 3 0.7244191
## 12 1 3 0.7217939
## 28 1 3 0.7215103
## 29 1 3 0.7145192
# Average silhouette width of each cluster
silinfo$clus.avg.widths
## [1] 0.6363162 0.3473922 0.3933772
# The total average (mean of all individual silhouette widths)
silinfo$avg.width
## [1] 0.4599482
# The size of each clusters
km.res$size
## [1] 50 47 53
5.1.5 Silhouette plot for PAM clustering
fviz_silhouette(pam.res)
## cluster size ave.sil.width
## 1 1 50 0.63
## 2 2 45 0.35
## 3 3 55 0.38
5.1.6 Silhouette plot for hierarchical clustering
fviz_silhouette(res.hc)
## cluster size ave.sil.width
## 1 1 49 0.75
## 2 2 75 0.37
## 3 3 26 0.51
5.1.7 Samples with a negative silhouette coefficient
It can be seen that several samples have a negative silhouette coefficient in the hierarchical clustering. This means that they are not in the right cluster.
We can find the name of these samples and determine the clusters they are closer (neighbor cluster), as follow:
# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## 51 2 3 -0.02848264
## 148 2 3 -0.03799687
## 129 2 3 -0.09622863
## 111 2 3 -0.14461589
## 109 2 3 -0.14991556
## 133 2 3 -0.18730218
## 42 2 1 -0.39515010
5.2 Dunn index
5.2.1 Concept and algorithm
Dunn index is another internal clustering validation measure which can be computed as follow:
- For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters
Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
- For each cluster, compute the distance between the objects in the same cluster.
Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
Calculate Dunn index (D) as follow:
\[ D = \frac{min.separation}{max.diameter} \]
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, Dunn index should be maximized.
5.2.2 R function for computing Dunn index
The function cluster.stats() [in fpc package] and the function NbClust() [in NbClust package] can be used to compute Dunn index and many other indices.
The function cluster.stats() is described in the next section.
5.3 Clustering validation statistics
In this section, well describe the R function cluster.stats() [in fpc package] for computing a number of distance based statistics which can be used either for cluster validation, comparison between clustering and decision about the number of clusters.
The simplified format is:
cluster.stats(d = NULL, clustering, al.clustering = NULL)
- d: a distance object between cases as generated by the dist() function
- clustering: vector containing the cluster number of each observation
- alt.clustering: vector such as for clustering, indicating an alternative clustering
The function cluster.stats() returns a list containing many components useful for analyzing the intrinsic characteristics of a clustering:
- cluster.number: number of clusters
- cluster.size: vector containing the number of points in each cluster
- average.distance, median.distance: vector containing the cluster-wise within average/median distances
- average.between: average distance between clusters. We want it to be as large as possible
- average.within: average distance within clusters. We want it to be as small as possible
- clus.avg.silwidths: vector of cluster average silhouette widths. Recall that, the silhouette width is also an estimate of the average distance between clusters. Its value is comprised between 1 and -1 with a value of 1 indicating a very good cluster.
- within.cluster.ss: a generalization of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix.
- dunn, dunn2: Dunn index
- corrected.rand, vi: Two indexes to assess the similarity of two clustering: the corrected Rand index and Meilas VI
All the above elements can be used to evaluate the internal quality of clustering.
In the following sections, well compute the clustering quality statistics for k-means, pam and hierarchical clustering. Look at the within.cluster.ss (within clusters sum of squares), the average.within (average distance within clusters) and clus.avg.silwidths (vector of cluster average silhouette widths).
5.3.0.1 Cluster statistics for k-means clustering
library(fpc)
# Compute pairwise-distance matrices
dd <- dist(iris.scaled, method ="euclidean")
# Statistics for k-means clustering
km_stats <- cluster.stats(dd, km.res$cluster)
# (k-means) within clusters sum of squares
km_stats$within.cluster.ss
## [1] 138.8884
# (k-means) cluster average silhouette widths
km_stats$clus.avg.silwidths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
# Display all statistics
km_stats
## $n
## [1] 150
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 50 47 53
##
## $min.cluster.size
## [1] 47
##
## $noisen
## [1] 0
##
## $diameter
## [1] 5.034198 3.343671 2.922371
##
## $average.distance
## [1] 1.175155 1.307716 1.197061
##
## $median.distance
## [1] 0.9884177 1.2383531 1.1559887
##
## $separation
## [1] 1.5533592 0.1333894 0.1333894
##
## $average.toother
## [1] 3.647912 3.081212 2.674298
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 2.4150235 1.5533592
## [2,] 2.415024 0.0000000 0.1333894
## [3,] 1.553359 0.1333894 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 4.129179 3.221129
## [2,] 4.129179 0.000000 2.092563
## [3,] 3.221129 2.092563 0.000000
##
## $average.between
## [1] 3.130708
##
## $average.within
## [1] 1.222246
##
## $n.between
## [1] 7491
##
## $n.within
## [1] 3684
##
## $max.diameter
## [1] 5.034198
##
## $min.separation
## [1] 0.1333894
##
## $within.cluster.ss
## [1] 138.8884
##
## $clus.avg.silwidths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
##
## $avg.silwidth
## [1] 0.4599482
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.679696
##
## $dunn
## [1] 0.02649665
##
## $dunn2
## [1] 1.600166
##
## $entropy
## [1] 1.097412
##
## $wb.ratio
## [1] 0.3904057
##
## $ch
## [1] 241.9044
##
## $cwidegap
## [1] 1.3892251 0.9432249 0.7824508
##
## $widestgap
## [1] 1.389225
##
## $sindex
## [1] 0.3524812
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Read the documentation of cluster.stats() for details about all the available indices.
The same statistics can be computed for pam clustering and hierarchical clustering.
5.3.0.2 Cluster statistics for PAM clustering
# Statistics for pam clustering
pam_stats <- cluster.stats(dd, pam.res$cluster)
# (pam) within clusters sum of squares
pam_stats$within.cluster.ss
## [1] 140.2856
# (pam) cluster average silhouette widths
pam_stats$clus.avg.silwidths
## 1 2 3
## 0.6346397 0.3496332 0.3823817
5.3.0.3 Cluster statistics for hierarchical clustering
# Statistics for hierarchical clustering
hc_stats <- cluster.stats(dd, res.hc$cluster)
# (HCLUST) within clusters sum of squares
hc_stats$within.cluster.ss
## [1] 152.7107
# (HCLUST) cluster average silhouette widths
hc_stats$clus.avg.silwidths
## 1 2 3
## 0.6688130 0.3154184 0.4488197
6 External clustering validation
The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to a reference.
To compare two cluster solutions, use the cluster.stats() function as follow:
res.stat <- cluster.stats(d, solution1$cluster, solution2$cluster)
Among the values returned by the function cluster.stats(), there are two indexes to assess the similarity of two clustering, namely the corrected Rand index and Meilas VI.
We know that the iris data contains exactly 3 groups of species.
Does the K-means clustering matches with the true structure of the data?
We can use the function cluster.stats() to answer to this question.
A cross-tabulation can be computed as follow:
table(iris$Species, km.res$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 11 39
## virginica 0 36 14
It can be seen that:
- All setosa species (n = 50) has been classified in cluster 1
- A large number of versicor species (n = 39 ) has been classified in cluster 3. Some of them ( n = 11) have been classified in cluster 2.
- A large number of virginica species (n = 36 ) has been classified in cluster 2. Some of them (n = 14) have been classified in cluster 3.
Its possible to quantify the agreement between Species and k-means clusters using either the corrected Rand index and Meilas VI provided as follow:
library("fpc")
# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(iris.scaled),
species, km.res$cluster)
# Corrected Rand index
clust_stats$corrected.rand
## [1] 0.6201352
# VI
clust_stats$vi
## [1] 0.7477749
The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement). Agreement between the specie types and the cluster solution is 0.62 using Rand index and 0.748 using Meilas VI
The same analysis can be computed for both pam and hierarchical clustering:
# Agreement between species and pam clusters
table(iris$Species, pam.res$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 9 41
## virginica 0 36 14
cluster.stats(d = dist(iris.scaled),
species, pam.res$cluster)$vi
## [1] 0.7129034
# Agreement between species and HC clusters
table(iris$Species, res.hc$cluster)
##
## 1 2 3
## setosa 49 1 0
## versicolor 0 50 0
## virginica 0 24 26
cluster.stats(d = dist(iris.scaled),
species, res.hc$cluster)$vi
## [1] 0.6097098
External clustering validation, can be used to select suitable clustering algorithm for a given dataset.
7 Infos
This analysis has been performed using R software (ver. 3.2.1)
- Malika Charrad, Nadia Ghazzali, Veronique Boiteau, Azam Niknafs (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. URL http://www.jstatsoft.org/v61/i06/.
- Theodoridis S, Koutroubas K (2008). Pattern Recognition. 4th edition. Academic Press.