ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization

October 9, 2015, 8:51 pm

≫ Next: How to choose the appropriate clustering algorithms for your data? - Unsupervised Machine Learning

≪ Previous: How to compute p-value for hierarchical clustering in R - Unsupervised Machine Learning

To arrange multiple ggplot2 graphs on the same page, the standard R functions - par() and layout() - cannot be used.

This R tutorial will show you, step by step, how to put several ggplots on a single page.

The functions grid.arrange()[in the package gridExtra] and plot_grid()[in the package cowplot], will be used.

Install and load required packages

Install and load the package gridExtra

install.packages("gridExtra")
library("gridExtra")

Install and load the package cowplot

cowplot can be installed as follow:

install.packages("cowplot")

as follow using devtools package (devtools should be installed before using the code below):

devtools::install_github("wilkelab/cowplot")

Load cowplot:

library("cowplot")

Prepare some data

ToothGrowth data is used :

df <- ToothGrowth
# Convert the variable dose from a numeric to a factor variable
df$dose <- as.factor(df$dose)
head(df)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Cowplot: Publication-ready plots

The cowplot package is an extension to ggplot2 and it can be used to provide a publication-ready plots.

Basic plots

library(cowplot)
# Default plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")
bp

# Add gridlines
bp + background_grid(major = "xy", minor = "none")

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Recall that, the function ggsave()[in ggplot2 package] can be used to save ggplots. However, when working with cowplot, the function save_plot() [in cowplot package] is preferred. Its an alternative to ggsave with a better support for multi-figure plots.

save_plot("mpg.pdf", plot.mpg,
          base_aspect_ratio = 1.3 # make room for figure legend
          )

Arranging multiple graphs using cowplot

# Scatter plot
sp <- ggplot(mpg, aes(x = cty, y = hwy, colour = factor(cyl)))+ 
  geom_point(size=2.5)
sp

# Bar plot
bp <- ggplot(diamonds, aes(clarity, fill = cut)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle=70, vjust=0.5))
bp

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Combine the two plots (the scatter plot and the bar plot):

plot_grid(sp, bp, labels=c("A", "B"), ncol = 2, nrow = 1)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

The function draw_plot() can be used to place graphs at particular locations with a particular sizes. The format of the function is:

draw_plot(plot, x = 0, y = 0, width = 1, height = 1)

plot: the plot to place (ggplot2 or a gtable)
x: The x location of the lower left corner of the plot.
y: The y location of the lower left corner of the plot.
width, height: the width and the height of the plot

The function ggdraw() is used to initialize an empty drawing canvas.

plot.iris <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_point() + facet_grid(. ~ Species) + stat_smooth(method = "lm") +
  background_grid(major = 'y', minor = "none") + # add thin horizontal lines 
  panel_border() # and a border around each panel
# plot.mpt and plot.diamonds were defined earlier
ggdraw() +
  draw_plot(plot.iris, 0, .5, 1, .5) +
  draw_plot(sp, 0, 0, .5, .5) +
  draw_plot(bp, .5, 0, .5, .5) +
  draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.5, 0.5), size = 15)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

grid.arrange: Create and arrange multiple plots

The R code below creates a box plot, a dot plot, a violin plot and a strip chart (jitter plot) :

library(ggplot2)
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")

# Create a dot plot
# Add the mean point and the standard deviation
dp <- ggplot(df, aes(x=dose, y=len, fill=dose)) +
  geom_dotplot(binaxis='y', stackdir='center')+
  stat_summary(fun.data=mean_sdl, mult=1, 
                 geom="pointrange", color="red")+
   theme(legend.position = "none")

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len)) +
  geom_violin()+
  geom_boxplot(width=0.1)

# Create a stripchart
sc <- ggplot(df, aes(x=dose, y=len, color=dose, shape=dose)) +
  geom_jitter(position=position_jitter(0.2))+
  theme(legend.position = "none") +
  theme_gray()

Combine the plots using the function grid.arrange() [in gridExtra] :

library(gridExtra)
grid.arrange(bp, dp, vp, sc, ncol=2, nrow =2)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

grid.arrange() and arrangeGrob(): Change column/row span of a plot

Using the R code below:

The box plot will live in the first column
The dot plot and the strip chart will live in the second column

grid.arrange(bp, arrangeGrob(dp, sc), ncol = 2)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Its also possible to use the argument layout_matrix in grid.arrange(). In the R code below layout_matrix is a 2X3 matrix (2 columns and three rows). The first column is all 1s, thats where the first plot lives, spanning the three rows; second column contains plots 2, 3, 4, each occupying one row.

grid.arrange(bp, dp, sc, vp, ncol = 2, 
             layout_matrix = cbind(c(1,1,1), c(2,3,4)))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Add a common legend for multiple ggplot2 graphs

This can be done in four simple steps :

Create the plots : p1, p2, .
Save the legend of the plot p1 as an external graphical element (called a grob in Grid terminology)
Remove the legends from all plots
Draw all the plots with only one legend in the right panel

To save the legend of a ggplot, the helper function below can be used :

library(gridExtra)
get_legend<-function(myggplot){
  tmp <- ggplot_gtable(ggplot_build(myggplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)
}

(The function above is derived from this forum. )

# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot()

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_violin()+
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)

# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")

# 4. Arrange ggplot2 graphs with a specific width
grid.arrange(bp, vp, legend, ncol=3, widths=c(2.3, 2.3, 0.8))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Change legend position

# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot with a top legend position
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot()+theme(legend.position = "top")

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_violin()+
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)

# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")

# 4. Create a blank plot
blankPlot <- ggplot()+geom_blank(aes(1,1)) + 
  cowplot::theme_nothing()

Change legend position by changing the order of plots using the following R code. Grids with four cells are created (2X2). The height of the legend zone is set to 0.2.

Top-left legend:

Top-left legend	Blank plot
box plot	Violin plot

# Top-left legend
grid.arrange(legend, blankPlot,  bp, vp,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Top-right legend:

Blank plot	Top right legend
box plot	Violin plot

# Top-right
grid.arrange(blankPlot, legend,  bp, vp,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Bottom-right and bottom-left legend can be drawn as follow:

# Bottom-left legend
grid.arrange(bp, vp, legend, blankPlot,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(2.5, 0.2))
# Bottom-right
grid.arrange( bp, vp, blankPlot, legend, 
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c( 2.5, 0.2))

Its also possible to use the argument layout_matrix to customize legend position. In the R code below, layout_matrix is a 2X2 matrix:

The first row (height = 2.5) is where the first plot (bp) and the second plot (vp) live
The second row (height = 0.2) is where the legend lives spanning 2 columns

Bottom-center legend:

grid.arrange(bp, vp, legend, ncol=2, nrow = 2, 
             layout_matrix = rbind(c(1,2), c(3,3)),
             widths = c(2.7, 2.7), heights = c(2.5, 0.2))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Top-center legend:

The legend (plot 1) lives in the first row (height = 0.2) spanning two columns
bp (plot 2) and vp (plot 3) live in the second row (height = 2.5)

grid.arrange(legend, bp, vp,  ncol=2, nrow = 2, 
             layout_matrix = rbind(c(1,1), c(2,3)),
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Scatter plot with marginal density plots

Step 1/3. Create some data :

set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
group <- as.factor(rep(c(1,2), each=500))
df2 <- data.frame(x, y, group)
head(df2)

##             x          y group
## 1 -2.20706575 -0.2053334     1
## 2 -0.72257076  1.3014667     1
## 3  0.08444118 -0.5391452     1
## 4 -3.34569770  1.6353707     1
## 5 -0.57087531  1.7029518     1
## 6 -0.49394411 -0.9058829     1

Step 2/3. Create the plots :

# Scatter plot of x and y variables and color by groups
scatterPlot <- ggplot(df2,aes(x, y, color=group)) + 
  geom_point() + 
  scale_color_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position=c(0,1), legend.justification=c(0,1))


# Marginal density plot of x (top panel)
xdensity <- ggplot(df2, aes(x, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")

# Marginal density plot of y (right panel)
ydensity <- ggplot(df2, aes(y, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")

Create a blank placeholder plot :

blankPlot <- ggplot()+geom_blank(aes(1,1))+
  theme(
    plot.background = element_blank(), 
   panel.grid.major = element_blank(),
   panel.grid.minor = element_blank(), 
   panel.border = element_blank(),
   panel.background = element_blank(),
   axis.title.x = element_blank(),
   axis.title.y = element_blank(),
   axis.text.x = element_blank(), 
   axis.text.y = element_blank(),
   axis.ticks = element_blank(),
   axis.line = element_blank()
     )

Step 3/3. Put the plots together:

Arrange ggplot2 with adapted height and width for each row and column :

library("gridExtra")
grid.arrange(xdensity, blankPlot, scatterPlot, ydensity, 
        ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Create a complex layout using the function viewport()

The different steps are :

Create plots : p1, p2, p3, .
Move to a new page on a grid device using the function grid.newpage()
Create a layout 2X2 - number of columns = 2; number of rows = 2
Define a grid viewport : a rectangular region on a graphics device
Print a plot into the viewport

# Move to a new page
grid.newpage()

# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))

# A helper function to define a region on the layout
define_region <- function(row, col){
  viewport(layout.pos.row = row, layout.pos.col = col)
} 

# Arrange the plots
print(scatterPlot, vp=define_region(1, 1:2))
print(xdensity, vp = define_region(2, 1))
print(ydensity, vp = define_region(2, 2))

ggplot2 arrange multiple graphs on the same page, R software and data visualization

ggExtra: Add marginal distributions plots to ggplot2 scatter plots

The package ggExtra is an easy-to-use package developped by Dean Attali, for adding marginal histograms, boxplots or density plots to ggplot2 scatter plots.

The package can be installed and used as follow:

# Install
install.packages("ggExtra")

# Load
library("ggExtra")

# Create some data
set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
df3 <- data.frame(x, y)

# Scatter plot of x and y variables and color by groups
sp2 <- ggplot(df3,aes(x, y)) + geom_point()

# Marginal density plot
ggMarginal(sp2 + theme_gray())

ggplot2 arrange multiple graphs on the same page, R software and data visualization

# Marginal histogram plot
ggMarginal(sp2 + theme_gray(), type = "histogram",
           fill = "steelblue", col = "darkblue")

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Insert an external graphical element inside a ggplot

The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements. The simplified format is :

annotation_custom(grob, xmin, xmax, ymin, ymax)

grob: the external graphical element to display
xmin, xmax : x location in data coordinates (horizontal location)
ymin, ymax : y location in data coordinates (vertical location)

The different steps are :

Create a scatter plot of y = f(x)
Add, for example, the box plot of the variables x and y inside the scatter plot using the function annotation_custom()

As the inset box plot overlaps with some points, a transparent background is used for the box plots.

# Create a transparent theme object
transparent_theme <- theme(
 axis.title.x = element_blank(),
 axis.title.y = element_blank(),
 axis.text.x = element_blank(), 
 axis.text.y = element_blank(),
 axis.ticks = element_blank(),
 panel.grid = element_blank(),
 axis.line = element_blank(),
 panel.background = element_rect(fill = "transparent",colour = NA),
 plot.background = element_rect(fill = "transparent",colour = NA))

Create the graphs :

p1 <- scatterPlot # see previous sections for the scatterPlot

# Box plot of the x variable
p2 <- ggplot(df2, aes(factor(1), x))+
  geom_boxplot(width=0.3)+coord_flip()+
  transparent_theme

# Box plot of the y variable
p3 <- ggplot(df2, aes(factor(1), y))+
  geom_boxplot(width=0.3)+
  transparent_theme

# Create the external graphical elements
# called a "grop" in Grid terminology
p2_grob = ggplotGrob(p2)
p3_grob = ggplotGrob(p3)

# Insert p2_grob inside the scatter plot
xmin <- min(x); xmax <- max(x)
ymin <- min(y); ymax <- max(y)
p1 + annotation_custom(grob = p2_grob, xmin = xmin, xmax = xmax, 
                       ymin = ymin-1.5, ymax = ymin+1.5)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

# Insert p3_grob inside the scatter plot
p1 + annotation_custom(grob = p3_grob,
                       xmin = xmin-1.5, xmax = xmin+1.5, 
                       ymin = ymin, ymax = ymax)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

If you have a solution to insert, at the same time, both p2_grob and p3_grob inside the scatter plot, please let me a comment. I got some errors trying to do this

Mix table, text and ggplot2 graphs

The functions below are required :

tableGrob() [in the package gridExtra] : for adding a data table to a graphic device
splitTextGrob() [in the package RGraphics] : for adding a text to a graph

Make sure that the package RGraphics is installed.

library(RGraphics)
library(gridExtra)

# Table
p1 <- tableGrob(head(ToothGrowth))

# Text
text <- "ToothGrowth data describes the effect of Vitamin C on tooth growth in Guinea pigs.  Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used."
p2 <- splitTextGrob(text)

# Box plot
p3 <- ggplot(df, aes(x=dose, y=len)) + geom_boxplot()

# Arrange the plots on the same page
grid.arrange(p1, p2, p3, ncol=1)

ggplot2 arrange multiple graphs on the same page, R software and data visualization

Infos

This analysis has been performed using R software (ver. 3.2.1) and ggplot2 (ver. 1.0.1)

↧

How to choose the appropriate clustering algorithms for your data? - Unsupervised Machine Learning

October 18, 2015, 9:56 am

≫ Next: Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning

≪ Previous: ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization

There are many clustering algorithms published in the literature, including:

Partitioning methods: k-means clustering, PAM and CLARA
Hierarchical clustering
Fuzzy analysis clustering (FANNY)
Self-organizing maps (SOM)
Model-based clustering

For a given dataset, choosing the appropriate clustering method and the optimal number of clusters can be a hard task for the analyst.

As described in two of my previous articles(determining the optimal number of clusters and cluster validation statistics), there are more than 30 indices for assessing the goodness of clustering results and for identifying the best performing clustering algorithm for a particular dataset.

This article describes the R package clValid (G. Brock et al., 2008) which can be used for simultaneously comparing multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.

The package clValid contains 3 different types of clustering validation measures:

Clustering internal validation, which uses intrinsic information in the data to assess the quality of the clustering.
Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.
Clustering biological validation, which evaluates the ability of a clustering algorithm to produce biologically meaningful clusters.

Well start by describing the different clustering validation measures in the package. Next, well present the function clValid() and finally well provide an R lab section for validating clustering results and comparing clustering algorithms.

1 Clustering validation measures in clValid package

1.1 Internal validation measures

The internal measures included in clValid package are:

Connectivity
Average Silhouette width
Dunn index

These methods has been already described in my previous article: clustering validation statistic.

Briefly, connectivity indicates the degree of connectedness of the clusters, as determined by k-nearest neighbors. Connectedness corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.

Silhouette width and Dunn index combine measures of compactness and separation of the clusters. Recall that the values of silhouette width range from -1 (poorly clustered observations) to 1 (well clustered observations). The Dunn index is the ratio between the smallest distance between observations not in the same cluster to the largest intra-cluster distance. It has a value between 0 and infinity and should be maximized.

1.2 Stability validation measures

The cluster stability measures includes:

The average proportion of non-overlap (APN)
The average distance (AD)
The average distance between means (ADM)
The figure of merit (FOM)

The APN, AD, and ADM are all based on the cross-classification table of the original clustering with the clustering based on the removal of one column.

The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
The AD measures the average distance between observations placed in the same cluster under both cases (full dataset and removal of one column).
The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.
The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns. It also has a value between zero and 1, and again smaller values are preferred.

The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.

1.3 Biological validation measures

Biological validation evaluates the ability of a clustering algorithm to produce biologically meaningful clusters. An application is microarray or RNAseq data where observations corresponds to genes.

There are two biological measures:

The biological homogeneity index (BHI)
The biological stability index (BSI)

The BHI measures the average proportion of gene pairs that are clustered together which have matching biological functional classes.

The BSI is similar to the other stability measures, but inspects the consistency of clustering for genes with similar biological functionality. Each sample is removed one at a time, and the cluster membership for genes with similar functional annotation is compared with the cluster membership using all available samples.

2 R function clValid()

2.1 Format

The main function in clValid package is clValid():

clValid(obj, nClust, clMethods = "hierarchical",
        validation = "stability", maxitems = 600,
        metric = "euclidean", method = "average")

obj: A numeric matrix or data frame. Rows are the items to be clustered and columns are samples.
nClust: A numeric vector specifying the numbers of clusters to be evaluated. e.g., 2:10
clMethods: The clustering method to be used. Available options are hierarchical, kmeans, diana, fanny, som, model, sota, pam, clara, and agnes, with multiple choices allowed.
validation: The type of validation measures to be used. Allowed values are internal, stability, and biological, with multiple choices allowed.
maxitems: The maximum number of items (rows in matrix) which can be clustered.
metric: The metric used to determine the distance matrix. Possible choices are euclidean, correlation, and manhattan.
method: For hierarchical clustering (hclust and agnes), the agglomeration method to be used. Available choices are ward, single, complete and average.

2.2 Examples of usage

2.2.1 Data

Well use mouse data [in clValid package ] which is an Affymetrix gene expression data of of mesenchymal cells from two distinct lineages (M and N). It contains 147 genes and 6 samples (3 samples for each lineage).

library(clValid)
# Load the data
data(mouse)
head(mouse)

##             ID       M1       M2       M3      NC1      NC2      NC3
## 1   1448995_at 4.706812 4.528291 4.325836 5.568435 6.915079 7.353144
## 2 1436392_s_at 3.867962 4.052354 3.474651 4.995836 5.056199 5.183585
## 3 1437434_a_at 2.875112 3.379619 3.239800 3.877053 4.459629 4.850978
## 4   1428922_at 5.326943 5.498930 5.629814 6.795194 6.535522 6.622577
## 5 1452671_s_at 5.370125 4.546810 5.704810 6.407555 6.310487 6.195847
## 6   1448147_at 3.471347 4.129992 3.964431 4.474737 5.185631 5.177967
##                       FC
## 1 Growth/Differentiation
## 2   Transcription factor
## 3          Miscellaneous
## 4          Miscellaneous
## 5          ECM/Receptors
## 6 Growth/Differentiation

# Extract gene expression data
exprs <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")]
rownames(exprs) <- mouse$ID[1:25]
head(exprs)

##                    M1       M2       M3      NC1      NC2      NC3
## 1448995_at   4.706812 4.528291 4.325836 5.568435 6.915079 7.353144
## 1436392_s_at 3.867962 4.052354 3.474651 4.995836 5.056199 5.183585
## 1437434_a_at 2.875112 3.379619 3.239800 3.877053 4.459629 4.850978
## 1428922_at   5.326943 5.498930 5.629814 6.795194 6.535522 6.622577
## 1452671_s_at 5.370125 4.546810 5.704810 6.407555 6.310487 6.195847
## 1448147_at   3.471347 4.129992 3.964431 4.474737 5.185631 5.177967

2.2.2 Compute clValid()

We start by internal cluster validation which measures the connectivity, silhouette width and Dunn index. Its possible to compute simultaneously these internal measures for multiple clustering algorithms in combination with a range of cluster numbers. The R code below can be used:

# Compute clValid
clmethods <- c("hierarchical","kmeans","pam")
intern <- clValid(exprs, nClust = 2:6,
              clMethods = clmethods, validation = "internal")
# Summary
summary(intern)

##
## Clustering Methods:
##  hierarchical kmeans pam
##
## Cluster sizes:
##  2 3 4 5 6
##
## Validation Measures:
##                                  2       3       4       5       6
##
## hierarchical Connectivity   4.6159 11.5865 19.5075 22.2075 24.5044
##              Dunn           0.4217  0.2315  0.3068  0.3456  0.3456
##              Silhouette     0.5997  0.4529  0.4324  0.4007  0.3891
## kmeans       Connectivity   4.6159  9.5607 20.4774 23.1774 26.2242
##              Dunn           0.4217  0.3924  0.1360  0.1556  0.1778
##              Silhouette     0.5997  0.5495  0.4235  0.3871  0.3618
## pam          Connectivity   4.6159  9.5607 18.5925 25.0631 31.8381
##              Dunn           0.4217  0.3924  0.3068  0.3068  0.2511
##              Silhouette     0.5997  0.5495  0.4401  0.4297  0.3506
##
## Optimal Scores:
##
##              Score  Method       Clusters
## Connectivity 4.6159 hierarchical 2
## Dunn         0.4217 hierarchical 2
## Silhouette   0.5997 hierarchical 2

It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures).

The plots of the connectivity, Dunn index, and silhouette width can be generated as follow:

plot(intern)

clValid - Unsupervised Machine Learning

Recall that the connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.

Thus, it appears that hierarchical clustering outperforms the other clustering algorithms under each validation measure, for nearly every number of clusters evaluated.

Regardless of the clustering algorithm, the optimal number of clusters seems to be two using the three measures.

Stability measures can be computed as follow:

# Stability measures
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(exprs, nClust = 2:6, clMethods = clmethods,
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)

##         Score       Method Clusters
## APN 0.0000000 hierarchical        2
## AD  0.9642344          pam        6
## ADM 0.0000000 hierarchical        2
## FOM 0.3925939          pam        6

Its also possible to display a complete summary:

summary(stab)

plot(stab)

For the APN and ADM measures, hierarchical clustering with two clusters again gives the best score. For the other measures, PAM with six clusters has the best score.

For cluster biological validation read the documentation of clValid() (?clValid).

3 Infos

This analysis has been performed using R software (ver. 3.2.1)

Brock, G., Pihur, V., Datta, S. and Datta, S. (2008) clValid: An R Package for Cluster Validation Journal of Statistical Software 25(4) http://www.jstatsoft.org/v25/i04

↧

Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning

October 28, 2015, 1:10 pm

≫ Next: HCPC: Hierarchical clustering on principal components - Hybrid approach (2/2) - Unsupervised Machine Learning

≪ Previous: How to choose the appropriate clustering algorithms for your data? - Unsupervised Machine Learning

Clustering algorithms are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.

The most popular clustering algorithms are:

k-means clustering, a partitioning method used for splitting a dataset into a set of k clusters.
hierarchical clustering, an alternative approach to k-means clustering for identifying clustering in the dataset by using pairwise distance matrix between observations as clustering criteria.

However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.

In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.

1 How this article is organized

Well start by demonstrating why we should combine k-means and hierarcical clustering. An application is provided using R software.

Finally, well provide an easy to use R function (in factoextra package) for computing hybrid hierachical k-means clustering.

2 Required R packages

Well use the R package factoextra which is very helpful for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load the package:

library(factoextra)

3 Data preparation

Well use USArrest dataset and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: Distances and scaling.

4 R function for clustering analyses

Well use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: Visual Enhancement of Clustering Analysis.

eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing hierarchical clustering and partititioning clustering in a single line function call.

4.1 Example of k-means clustering

Well split the data into 4 clusters using k-means clustering as follow:

library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
                 nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa 
##           3           2           1

# Visualize k-means clusters
fviz_cluster(km.res,  frame.type = "norm", frame.level = 0.68)

Clustering on principal component - Unsupervised Machine Learning

# Visualize the silhouette of clusters
fviz_silhouette(km.res)

##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

Clustering on principal component - Unsupervised Machine Learning

Note that, silhouette coefficient measures how well an observation is clustered and it estimates the average distance between clusters (i.e, the average silhouette width). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: cluster validation statistics

Samples with negative silhouette coefficient:

# Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

Read more about k-means clustering: K-means clustering

4.2 Example of hierarchical clustering

# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
head(res.hc$cluster, 15)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           1           2           2           3           2           2 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           4           3           2           1           3           4 
##    Illinois     Indiana        Iowa 
##           2           3           4

# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = TRUE, cex = 0.5)

Clustering on principal component - Unsupervised Machine Learning

# Visualize the silhouette of clusters
fviz_silhouette(res.hc)

##   cluster size ave.sil.width
## 1       1    7          0.40
## 2       2   12          0.26
## 3       3   18          0.38
## 4       4   13          0.35

Clustering on principal component - Unsupervised Machine Learning

It can be seen that three samples have negative silhouette coefficient indicating that they are not in the right cluster. These samples are:

# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##             cluster neighbor    sil_width
## Alaska            2        1 -0.005212336
## Nebraska          4        3 -0.044172624
## Connecticut       4        3 -0.078016589

Read more about hierarchical clustering: Hierarchical clustering

5 Combining hierarchical clustering and k-means

5.1 Why?

Recall that, in k-means algorithm, a random set of observations are chosen as the initial centers.

The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

To avoid this, a solution is to use an hybrid approach by combining the hierarchical clustering and the k-means methods. This process is named hybrid hierarchical k-means clustering (hkmeans).

5.2 How ?

The procedure is as follow:

Compute hierarchical clustering and cut the tree into k-clusters
compute the center (i.e the mean) of each cluster
Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

5.3 R codes

5.3.1 Compute hierarchical clustering and cut the tree into k-clusters:

res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
grp <- res.hc$cluster

5.3.2 Compute the centers of clusters defined by hierarchical clustering:

Cluster centers are defined as the means of variables in clusters. The function aggregate() can be used to compute the mean per group in a data frame.

# Compute cluster centers
clus.centers <- aggregate(df, list(grp), mean)
clus.centers

##   Group.1     Murder    Assault   UrbanPop        Rape
## 1       1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2       2  0.7298036  1.1188219  0.7571799  1.32135653
## 3       3 -0.3250544 -0.3231032  0.3733701 -0.17068130
## 4       4 -1.0745717 -1.1056780 -0.7972496 -1.00946922

# Remove the first column
clus.centers <- clus.centers[, -1]
clus.centers

##       Murder    Assault   UrbanPop        Rape
## 1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2  0.7298036  1.1188219  0.7571799  1.32135653
## 3 -0.3250544 -0.3231032  0.3733701 -0.17068130
## 4 -1.0745717 -1.1056780 -0.7972496 -1.00946922

5.3.3 K-means clustering using hierarchical clustering defined cluster-centers

km.res2 <- eclust(df, "kmeans", k = clus.centers, graph = FALSE)
fviz_silhouette(km.res2)

##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   13          0.27
## 3       3   16          0.34
## 4       4   13          0.37

Clustering on principal component - Unsupervised Machine Learning

5.3.4 Compare the results of hierarchical clustering and hybrid approach

The R code below compares the initial clusters defined using only hierarchical clustering and the final ones defined using hierarchical clustering + k-means:

# res.hc$cluster: Initial clusters defined using hierarchical clustering
# km.res2$cluster: Final clusters defined using k-means
table(km.res2$cluster, res.hc$cluster)

##    
##      1  2  3  4
##   1  7  0  1  0
##   2  0 12  1  0
##   3  0  0 15  1
##   4  0  0  1 12

It can be seen that, 3 of the observations defined as belonging to cluster 3 by hierarchical clustering has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering.

The difference can be easily visualized using the function fviz_dend() [in factoextra]. The labels are colored using k-means clusters:

fviz_dend(res.hc, k = 4, 
          k_colors = c("blue", "green3", "red", "black"),
          label_cols =  km.res$cluster[res.hc$order], cex = 0.6)

Clustering on principal component - Unsupervised Machine Learning

It can be seen that the hierarchical clustering result has been improved by the k-means algorithm.

5.3.5 Compare the results of standard k-means clustering and hybrid approach

# Final clusters defined using hierarchical k-means clustering
km.clust <- km.res$cluster

# Standard k-means clustering
set.seed(123)
res.km <- kmeans(df, centers = 4, iter.max = 100)


# comparison
table(km.clust, res.km$cluster)

##         
## km.clust  1  2  3  4
##        1 13  0  0  0
##        2  0 16  0  0
##        3  0  0 13  0
##        4  0  0  0  8

In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.

5.4 hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering

The function hkmeans() [in factoextra] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.

# Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)

##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

# Print the results
res.hk

## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

# Visualize the tree
fviz_dend(res.hk, cex = 0.6, rect = TRUE)

Clustering on principal component - Unsupervised Machine Learning

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, frame.type = "norm", frame.level = 0.68)

Clustering on principal component - Unsupervised Machine Learning

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

↧

HCPC: Hierarchical clustering on principal components - Hybrid approach (2/2) - Unsupervised Machine Learning

November 8, 2015, 8:53 am

≫ Next: facto_summarize - Subset and summarize the output of factor analyses - R software and data mining

≪ Previous: Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning

There are three standards methods for exploring multidimensional data:

Principal component methods, used to summarize and to visualize the information contained in a multivariate data table. Individuals and variables with same profiles are grouped together in the plot. Principal component methods include:
- Principal Component Analysis (PCA), used for analyzing a data set containing continuous variables
- Correspondence Analysis (CA), an extension of PCA suited to handle a contingency table formed by two qualitative variables (or categorical data).
- Multiple Correspondence Analysis (MCA), an extension of simple CA for analyzing a data table containing more than two categorical variables.
Hierarchical Clustering, used for identifying groups of similar observations in a data set.
Partitioning clustering such as k-means, used for splitting a data set into several groups.

In my previous article, Hybrid hierarchical k-means clustering, I described HOW and WHY, we should combine hierarchical clustering and k-means clustering.

In the present article, I will show how to combine the three methods: principal component methods, hierarchical clustering and partitioning methods such as k-means to better describe and visualize the similarity between observations. The approach described here has been implemented in the R package FactoMineR (F. Husson et al., 2010). Its named HCPC for Hierarchical Clustering on Principal Components.

1 Why combining principal component and clustering methods?

1.1 Case of continuous variables: Use PCA as denoising step

In the case of a multidimensional data set containing continuous variables, principal component analysis (PCA) can be used to reduce the dimensionality of the data into few continuous variables (i.e, principal components) containing the most important information in the data.
PCA step can be considered as denoising step which can lead to a more stable clustering. This is very useful if you have a large data set with multiple variables, such as in gene expression data.

1.2 Case of categorical variables: Use CA or MCA before clustering

CA (for analyzing contingency table formed by two categorical variables) and MCA (for analyzing multidimensional categorical variables) can be used to transform categorical variables into a set of few continuous variables (the principal components) and to remove the noise in the data.

CA and MCA can be considered as pre-processing steps which allow to compute clustering on categorical data

2 Algorithm of hierachical clustering on principal component (HCPC)

Compute principal component methods
- Use Principal Component Analysis (PCA) for continuous variables
- Use Correspondence Analysis (CA) for a contingency table formed by two categorical variables
- Use Multiple Correspondence Analysis (MCA) for a data set containing multiple categorical variables.
Compute hierarchical clustering: Hierarchical Clustering is performed using Wards criterion on the selected principal components. Ward criterion has to be used in the hierarchical clustering because it is based on the multidimensional variance (i.e.inertia) as well as principal component analysis.
Choose the number of clusters based on the hierarchical tree: An initial partitioning is performed by cutting the hierarchical tree.
K-means clustering is performed to improve the initial partition obtained from hierarchical clustering. The final partitioning solution, obtained after consolidation with k-means, can be (slightly) different from the one obtained with the hierarchical clustering. The importance of combining hierarchical clustering and k-means clustering has been described in my previous post: Hybrid hierarchical k-means clustering

3 Computing HCPC in R

3.1 Required R packages

Well use FactoMineR for computing HCPC() and factoextra for data visualizations.

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

FactoMineR can be installed as follow:

install.packages("FactoMineR")

Load the packages:

library(factoextra)
library(FactoMineR)

3.2 R function for HCPC

The function HCPC()[in FactoMineR package] can be used to compute hierarchical clustering on principal components.

A simplified format is:

HCPC(res, nb.clust = 0, iter.max = 10, min = 3, max = NULL, graph = TRUE)

res: a PCA result or a data frame
nb.clust: an integer specifying the number of clusters. Possible values are:
- 0: the tree is cut at the level the user clicks on
- -1: the tree is automatically cut at the suggested level
- Any positive integer: the tree is cut with nb.clusters clusters
iter.max: the maximum number of iterations for k-means
min, max: the minimum and the maximum number of clusters to be generated, respectively
graph: if TRUE, graphics are displayed

3.3 Case of continuous variables

We start by computing again the principal component analysis(PCA) is performed on the data set. The argument ncp = 3 is used in the function PCA() to keep only the first three principal components. Next HCPC is applied on the result of the PCA.

3.3.1 Data preparation

Well use USArrest data set and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: Distances and scaling.

3.3.2 Compute principal component analysis

Well use the package FactoMineR for computing HCPC and factoextra for the visualization of the output.

# Compute principal component analysis
library(FactoMineR)
res.pca <- PCA(USArrests, ncp = 5, graph=FALSE)
# Percentage of information retained by each
# dimensions
library(factoextra)
fviz_eig(res.pca)

Clustering on principal component - Unsupervised Machine Learning

# Visualize variables
fviz_pca_var(res.pca)

Clustering on principal component - Unsupervised Machine Learning

# Visualize individuals
fviz_pca_ind(res.pca)

Clustering on principal component - Unsupervised Machine Learning

The first three dimensions of the PCA retains 96% of the total variance (i.e information) contained in the data:

get_eig(res.pca)

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.4802416        62.006039                    62.00604
## Dim.2  0.9897652        24.744129                    86.75017
## Dim.3  0.3565632         8.914080                    95.66425
## Dim.4  0.1734301         4.335752                   100.00000

Read more about PCA: Principal Component Analysis (PCA)

3.3.3 Compute hierarchical clustering on the PCA results

The function HCPC() is used:

# Compute PCA with ncp = 3
res.pca <- PCA(USArrests, ncp = 3, graph = FALSE)
# Compute HCPC
res.hcpc <- HCPC(res.pca, graph = FALSE)

The function HCPC() returns a list containing:

data.clust: The original data with a supplementary row called class containing the partition.
desc.var: The variables describing clusters
call$t$res: The outputs of the principal component analysis
call$t$tree: The outputs of agnes() function [in cluster package]
call$t$nb.clust: The number of optimal clusters estimated

# Data with cluster assignements
head(res.hcpc$data.clust, 10)

##             Murder Assault UrbanPop Rape clust
## Alabama       13.2     236       58 21.2     3
## Alaska        10.0     263       48 44.5     4
## Arizona        8.1     294       80 31.0     4
## Arkansas       8.8     190       50 19.5     3
## California     9.0     276       91 40.6     4
## Colorado       7.9     204       78 38.7     4
## Connecticut    3.3     110       77 11.1     2
## Delaware       5.9     238       72 15.8     2
## Florida       15.4     335       80 31.9     4
## Georgia       17.4     211       60 25.8     3

# Variable describing clusters
res.hcpc$desc.var

## $quanti.var
##               Eta2      P-value
## Assault  0.7841402 2.376392e-15
## Murder   0.7771455 4.927378e-15
## Rape     0.7029807 3.480110e-12
## UrbanPop 0.5846485 7.138448e-09
## 
## $quanti
## $quanti$`1`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop -3.898420         52.07692       65.540       9.691087  14.329285
## Murder   -4.030171          3.60000        7.788       2.269870   4.311735
## Rape     -4.052061         12.17692       21.232       3.130779   9.272248
## Assault  -4.638172         78.53846      170.760      24.700095  82.500075
##               p.value
## UrbanPop 9.682222e-05
## Murder   5.573624e-05
## Rape     5.076842e-05
## Assault  3.515038e-06
## 
## $quanti$`2`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop  2.793185         73.87500       65.540       8.652131  14.329285
## Murder   -2.374121          5.65625        7.788       1.594902   4.311735
##              p.value
## UrbanPop 0.005219187
## Murder   0.017590794
## 
## $quanti$`3`
##             v.test Mean in category Overall mean sd in category Overall sd
## Murder    4.357187          13.9375        7.788       2.433587   4.311735
## Assault   2.698255         243.6250      170.760      46.540137  82.500075
## UrbanPop -2.513667          53.7500       65.540       7.529110  14.329285
##               p.value
## Murder   1.317449e-05
## Assault  6.970399e-03
## UrbanPop 1.194833e-02
## 
## $quanti$`4`
##            v.test Mean in category Overall mean sd in category Overall sd
## Rape     5.352124         33.19231       21.232       6.996643   9.272248
## Assault  4.356682        257.38462      170.760      41.850537  82.500075
## UrbanPop 3.028838         76.00000       65.540      10.347798  14.329285
## Murder   2.913295         10.81538        7.788       2.001863   4.311735
##               p.value
## Rape     8.692769e-08
## Assault  1.320491e-05
## UrbanPop 2.454964e-03
## Murder   3.576369e-03
## 
## 
## attr(,"class")
## [1] "catdes" "list "

3.3.4 Visualize the results of HCPC using base plot

The function plot.HCPC() [in FactoMineR] is used:

plot(x, axes = c(1,2), choice = "3D.map", 
     draw.tree = TRUE, ind.names = TRUE, title = NULL,
     tree.barplot = TRUE, centers.plot = FALSE)

x: an object of class HCPC
axes: the principal components to be plotted
choice: a string. Possible values are:
- tree: plots the tree (dendrogram)
- bar: plots bars of inertia gains
- map: plots a factor map. Individuals are colored by cluster
- 3D.map: plots the factor map. The tree is added on the plot
draw.tree: a logical value. If TRUE, the tree is plotted on the factor map if choice = map
ind.names: a logical value. If TRUE, individual names are shown
title: the title of the grap
tree.barplot: a logical value. If TRUE, the barplot of intra inertia losses is added on the tree graph.
centers.plot: a logical value. If TRUE, the centers of clusters are drawn on the factor maps

# Principal components + tree
plot(res.hcpc, choice = "3D.map")

Clustering on principal component - Unsupervised Machine Learning

# Plot the dendrogram only
plot(res.hcpc, choice ="tree", cex = 0.6)

Clustering on principal component - Unsupervised Machine Learning

# Draw only the factor map
plot(res.hcpc, choice ="map", draw.tree = FALSE)

Clustering on principal component - Unsupervised Machine Learning

# Remove labels and add cluster centers
plot(res.hcpc, choice ="map", draw.tree = FALSE,
     ind.names = FALSE, centers.plot = TRUE)

Clustering on principal component - Unsupervised Machine Learning

3.3.5 Visualize the results of HCPC using factoextra

The function fviz_cluster() can be used:

fviz_cluster(res.hcpc)

Clustering on principal component - Unsupervised Machine Learning

3.4 Case of categorical variables

Compute CA or MCA and then apply the function HCPC() on the results as described above. If you want to learn more about CA and MCA, read the following articles:

4 Infos

This analysis has been performed using R software (ver. 3.2.1)

Husson, F., Josse, J. & Pagès J. (2010). Principal component methods - hierarchical clustering - partitional clustering: why would we need to choose for visualizing data?. Technical report. pdf

↧

facto_summarize - Subset and summarize the output of factor analyses - R software and data mining

November 11, 2015, 4:02 am

≫ Next: fviz_ca: Quick Correspondence Analysis data visualization using factoextra - R software and data mining

≪ Previous: HCPC: Hierarchical clustering on principal components - Hybrid approach (2/2) - Unsupervised Machine Learning

Description

Subset and summarize the results of Principal Component Analysis (PCA), Correspondence Analysis (CA) and Multiple Correspondence Analysis (MCA) functions from several packages.

The function facto_summarize() [in factoextra package] is used.

Install and load factoextra

The package devtools is required for the installation as factoextra is hosted on github.

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load factoextra :

library("factoextra")

Usage

facto_summarize(X, element, result = c("coord", "cos2", "contrib"),
                axes = 1:2, select = NULL)

Arguments

Argument	Description
X	an object of class PCA, CA and MCA [FactoMineR]; prcomp and princomp [stats]; dudi, pca, coa and acm [ade4]; ca [ca package].
element	allowed values are row and col for CA; var and ind for PCA or MCA.
result	the result to be extracted for the element. Possible values are the combination of c(cos2, contrib, coord).
axes	a numeric vector specifying the axes of interest. Default values are 1:2 for axes 1 and 2.
select	a selection of variables. Allowed values are NULL or a list containing the arguments name, cos2 or contrib. Default is list(name = NULL, cos2 = NULL, contrib = NULL): name: is a character vector containing variable names to be selected cos2: if cos2 is in [0, 1], ex: 0.6, then variables with a cos2 > 0.6 are selected. if cos2 > 1, ex: 5, then the top 5 variables with the highest cos2 are selected contrib: if contrib > 1, ex: 5, then the top 5 variables with the highest contrib are selected.

Details

If length(axes) > 1, then the columns contrib and cos2 correspond to the total contributions and total cos2 of the axes. In this case, the column coord is calculated as x^2 + y^2 + +; x, y, are the coordinates of the points on the specified axes.

Value

A data frame containing the (total) coord, cos2 and the contribution for the axes.

Examples

Principal component analysis

A principal component analysis (PCA) is performed using the built-in R function prcomp() and the decathlon2 [in factoextra] data

data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
res.pca <- prcomp(decathlon2.active,  scale = TRUE)
# Summarize variables on axes 1:2
facto_summarize(res.pca, "var", axes = 1:2)[,-1]

                    Dim.1       Dim.2     coord      cos2  contrib
X100m        -0.850625692  0.17939806 0.7557477 0.7557477 75.57477
Long.jump     0.794180641 -0.28085695 0.7096035 0.7096035 70.96035
Shot.put      0.733912733 -0.08540412 0.5459218 0.5459218 54.59218
High.jump     0.610083985  0.46521415 0.5886267 0.5886267 58.86267
X400m        -0.701603377 -0.29017826 0.5764507 0.5764507 57.64507
X110m.hurdle -0.764125197  0.02474081 0.5844994 0.5844994 58.44994
Discus        0.743209016 -0.04966086 0.5548258 0.5548258 55.48258
Pole.vault   -0.217268042 -0.80745110 0.6991827 0.6991827 69.91827
Javeline      0.428226639 -0.38610928 0.3324584 0.3324584 33.24584
X1500m        0.004278487 -0.78448019 0.6154275 0.6154275 61.54275

# Select the top 5 contributing variables
facto_summarize(res.pca, "var", axes = 1:2,
           select = list(contrib = 5))[,-1]

                  Dim.1      Dim.2     coord      cos2  contrib
X100m      -0.850625692  0.1793981 0.7557477 0.7557477 75.57477
Long.jump   0.794180641 -0.2808570 0.7096035 0.7096035 70.96035
Pole.vault -0.217268042 -0.8074511 0.6991827 0.6991827 69.91827
X1500m      0.004278487 -0.7844802 0.6154275 0.6154275 61.54275
High.jump   0.610083985  0.4652142 0.5886267 0.5886267 58.86267

# Select variables with cos2 >= 0.6
facto_summarize(res.pca, "var", axes = 1:2,
           select = list(cos2 = 0.6))[,-1]

                  Dim.1      Dim.2     coord      cos2  contrib
X100m      -0.850625692  0.1793981 0.7557477 0.7557477 75.57477
Long.jump   0.794180641 -0.2808570 0.7096035 0.7096035 70.96035
Pole.vault -0.217268042 -0.8074511 0.6991827 0.6991827 69.91827
X1500m      0.004278487 -0.7844802 0.6154275 0.6154275 61.54275

# Select by names
facto_summarize(res.pca, "var", axes = 1:2,
     select = list(name = c("X100m", "Discus", "Javeline")))[,-1]

              Dim.1       Dim.2     coord      cos2  contrib
X100m    -0.8506257  0.17939806 0.7557477 0.7557477 75.57477
Discus    0.7432090 -0.04966086 0.5548258 0.5548258 55.48258
Javeline  0.4282266 -0.38610928 0.3324584 0.3324584 33.24584

# Summarize individuals on axes 1:2
facto_summarize(res.pca, "ind", axes = 1:2)[,-1]

                 Dim.1      Dim.2      coord      cos2   contrib
SEBRLE       0.1912074 -1.5541282  2.4518746 0.5050034 10.660324
CLAY         0.7901217 -2.4204156  6.4827039 0.5057178 28.185669
BERNARD     -1.3292592 -1.6118687  4.3650507 0.4871654 18.978481
YURKOV      -0.8694134  0.4328779  0.9432630 0.1199355  4.101143
ZSIVOCZKY   -0.1057450  2.0233632  4.1051806 0.5779938 17.848611
McMULLEN     0.1185550  0.9916237  0.9973729 0.1543704  4.336404
MARTINEAU   -2.3923532  1.2849234  7.3743818 0.5205607 32.062530
HERNU       -1.8910497 -1.1784614  4.9648401 0.5543447 21.586261
BARRAS      -1.7744575  0.4125321  3.3188820 0.6495490 14.429922
NOOL        -2.7770058  1.5726757 10.1850700 0.6469840 44.282913
BOURGUIGNON -4.4137335 -1.2635770 21.0776704 0.9301572 91.642045
Sebrle       3.4514485 -1.2169193 13.3933893 0.7593400 58.232127
Clay         3.3162243 -1.6232908 13.6324164 0.8523470 59.271375
Karpov       4.0703560  0.7983510 17.2051623 0.8138146 74.805053
Macey        1.8484623  2.0638828  7.6764252 0.8165181 33.375762
Warners      1.3873514 -0.2819083  2.0042163 0.2662078  8.713984
Zsivoczky    0.4715533  0.9267436  1.0812163 0.2190667  4.700940
Hernu        0.2763118  1.1657260  1.4352654 0.4666709  6.240284
Bernard      1.3672590  1.4780354  4.0539857 0.6274807 17.626025
Schwarzl    -0.7102777 -0.6584251  0.9380181 0.2170229  4.078340
Pogorelov   -0.2143524 -0.8610557  0.7873639 0.1337231  3.423321
Schoenbeck  -0.4953166 -1.3000530  1.9354762 0.5291161  8.415114
Barras      -0.3158867  0.8193681  0.7711485 0.1466237  3.352820

Correspondence Analysis

The function CA() in FactoMineR package is used:

# Install and load FactoMineR to compute CA
# install.packages("FactoMineR")
library("FactoMineR")
data("housetasks")
res.ca <- CA(housetasks, graph = FALSE)
# Summarize row variables on axes 1:2
facto_summarize(res.ca, "row", axes = 1:2)[,-1]

                Dim.1      Dim.2     coord      cos2   contrib
Laundry    -0.9918368  0.4953220 1.2290841 0.9245395 12.403601
Main_meal  -0.8755855  0.4901092 1.0068569 0.9739621  8.833091
Dinner     -0.6925740  0.3081043 0.5745869 0.9303433  3.558222
Breakfeast -0.5086002  0.4528038 0.4637054 0.9051733  3.722406
Tidying    -0.3938084 -0.4343444 0.3437401 0.9748275  2.404604
Dishes     -0.1889641 -0.4419662 0.2310416 0.7642703  1.497001
Shopping   -0.1176813 -0.4033171 0.1765136 0.8113088  1.214543
Official    0.2266324  0.2536132 0.1156819 0.1194711  0.636781
Driving     0.7417696  0.6534143 0.9771724 0.7672477  7.788243
Finances    0.2707669 -0.6178684 0.4550760 0.9973464  2.948600
Insurance   0.6470759 -0.4737832 0.6431778 0.8848140  5.126245
Repairs     1.5287787  0.8642647 3.0841176 0.9326072 29.178865
Holidays    0.2524863 -1.4350066 2.1229933 0.9921522 19.477003

# Summarize column variables on axes 1:2
facto_summarize(res.ca, "col", axes = 1:2)[,-1]

                  Dim.1      Dim.2      coord      cos2  contrib
Wife        -0.83762154  0.3652207 0.83499601 0.9543242 28.72693
Alternating -0.06218462  0.2915938 0.08889388 0.1098815  1.29467
Husband      1.16091847  0.6019199 1.71003929 0.9795683 37.35808
Jointly      0.14942609 -1.0265791 1.07619274 0.9979998 31.40952

Multiple Correspondence Analysis

The function MCA() in FactoMineR package is used:

library(FactoMineR)
data(poison)
res.mca <- MCA(poison, quanti.sup = 1:2,
              quali.sup = 3:4, graph=FALSE)
# Summarize variables on axes 1:2
res <- facto_summarize(res.mca, "var", axes = 1:2)
head(res)

             name      Dim.1       Dim.2      coord      cos2   contrib
Nausea_n Nausea_n  0.2673909  0.12139029 0.08623348 0.3090033 0.6128991
Nausea_y Nausea_y -0.9581506 -0.43498187 1.10726185 0.3090033 2.1962218
Vomit_n   Vomit_n  0.4790279 -0.40919465 0.39690803 0.5953620 2.1649529
Vomit_y   Vomit_y -0.7185419  0.61379197 0.89304306 0.5953620 3.2474293
Abdo_n     Abdo_n  1.3180221 -0.03574501 1.73845988 0.8457372 5.1722773
Abdo_y     Abdo_y -0.6411999  0.01738946 0.41143974 0.8457372 2.5162430

# Summarize individuals on axes 1:2
res <- facto_summarize(res.mca, "ind", axes = 1:2)
head(res)

  name      Dim.1       Dim.2     coord       cos2   contrib
1    1 -0.4525811 -0.26415072 0.2746052 0.46457063 0.4992822
2    2  0.8361700 -0.03193457 0.7002000 0.55670644 1.2730909
3    3 -0.4481892  0.13538726 0.2192032 0.59815656 0.3985513
4    4  0.8803694 -0.08536230 0.7823370 0.75476958 1.4224310
5    5 -0.4481892  0.13538726 0.2192032 0.59815656 0.3985513
6    6 -0.3594324 -0.43604390 0.3193260 0.06143111 0.5805927

Infos

This analysis has been performed using R software (ver. 3.1.2) and factoextra (ver. 1.0.2)

↧

fviz_ca: Quick Correspondence Analysis data visualization using factoextra - R software and data mining

November 11, 2015, 5:44 am

≫ Next: DBSCAN: density-based clustering for discovering clusters in large datasets with noise - Unsupervised Machine Learning

≪ Previous: facto_summarize - Subset and summarize the output of factor analyses - R software and data mining

Description

Graph of column/row variables from the output of Correspondence Analysis (CA).

The following functions, from factoextra package are use:

fviz_ca_row(): Graph of row variables
fviz_ca_col(): Graph of column variables
fviz_ca_biplot(): Biplot of row and column variables
fviz_ca(): An alias of fviz_ca_biplot()

These functions are included in factoextra package.

Install and load factoextra

The package devtools is required for the installation as factoextra is hosted on github.

# install.packages("devtools")
library("devtools")
install_github("kassambara/factoextra")

Load factoextra :

library("factoextra")

Usage

# Graph of row variables
fviz_ca_row(X, axes = c(1, 2), shape.row = 19,
  geom = c("point", "text"), label = "all", 
  invisible = "none", labelsize = 4, pointsize = 2,
  col.row = "blue", col.row.sup = "darkblue", alpha.row = 1,
  select.row = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric",
  jitter = list(what = "label", width = NULL, height = NULL), ...)

# Graph of column variables
fviz_ca_col(X, axes = c(1, 2), shape.col = 17,
  geom = c("point", "text"), label = "all",
  invisible = "none", labelsize = 4, pointsize = 2,
  col.col = "red", col.col.sup = "darkred", alpha.col = 1,
  select.col = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric",
 jitter = list(what = "label", width = NULL, height = NULL), ...)

# Biplot of row and column  variables
fviz_ca_biplot(X, axes = c(1, 2), shape.row = 19, shape.col = 17,
  geom = c("point", "text"), label = "all", invisible = "none",
  labelsize = 4, pointsize = 2, col.col = "red",
  col.col.sup = "darkred", alpha.col = 1, col.row = "blue",
  col.row.sup = "darkblue", alpha.row = 1,
  select.col = list(name = NULL, cos2 = NULL, contrib = NULL),
  select.row = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric", arrows = c(FALSE, FALSE),
  jitter = list(what = "label", width = NULL, height = NULL), ...)


# An alias of fviz_ca_biplot()
fviz_ca(X, ...)

Arguments

<td>**jitter**</td><td>a parameter used to jitter the points in order to reduce overplotting. It's a list containing the objects *what, width and height* (Ex.; jitter = list(what, width, height)). **what**: the element to be jittered. Possible values are "point" or "p"; "label" or "l"; "both" or "b". **width**: degree of jitter in x direction (ex: 0.2).  **height**: degree of jitter in y direction (ex: 0.2).</td></tr>

<td>**alpha.col,alpha.row**</td><td>controls the transparency of colors. The value can variate from 0 (total transparency) to 1 (no transparency). Default value is 1. Allowed values include also : "cos2", "contrib", "coord", "x" or "y" as for the arguments col.col and col.row..</td></tr>

Argument	Description
X	an object of class CA [FactoMineR], ca [ca], coa [ade4]; correspondence [MASS].
axes	a numeric vector of length 2 specifying the dimensions to be plotted.
shape.row,shape.col	the point shapes to be used for row/column variables. Default values are 19 for rows and 17 for columns.
geom	a text specifying the geometry to be used for the graph. Allowed values are the combination of c(point, arrow, text). Use point (to show only points); text to show only labels; c(point, text) or c(arrow, text) to show both types.
label	a character vector specifying the elements to be labelled. Default value is all. Allowed values are none or the combination of c(row, row.sup, col, col.sup). Use col to label only active column variables; col.sup to label only supplementary columns; etc
invisible	a character value specifying the elements to be hidden on the plot. Default value is none. Allowed values are the combination of c(row, row.sup, col, col.sup).
labelsize	font size for the labels.
pointsize	the size of points.
map	character string specifying the map type. Allowed options include: symmetric, rowprincipal, colprincipal, symbiplot, rowgab, colgab, rowgreen and colgreen. See details
col.col,col.row	color for column/row points. The default values are red and blue, respectively. Allowed values include also : cos2, contrib, coord, x or y. In this case, the colors for row/column variables are automatically controlled by their qualities (cos2), contributions (contrib), coordinates (x^2 + y^2, coord), x values(x) or y values(y)
col.col.sup,col.row.sup	colors for the supplementary column and row points, respectively.
select.col,select.row	a selection of columns/rows to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib: name: is a character vector containing columns/rows to be drawn cos2: if cos2 is in [0, 1], ex: 0.6, then columns/rows with a cos2 > 0.6 are drawn. if cos2 > 1, ex: 5, then the top 5 columns/rows with the highest cos2 are drawn. contrib: if contrib > 1, ex: 5, then the top 5 columns/rows with the highest cos2 are drawn
arrows	Vector of two logicals specifying if the plot should contain points (FALSE, default) or arrows (TRUE). First value sets the rows and the second value sets the columns.
	Optional arguments.

Details

The default plot of CA is a symmetric plot in which both rows and columns are in principal coordinates. In this situation, its not possible to interpret the distance between row points and column points. To overcome this problem, the simplest way is to make an asymmetric plot. This means that, the column profiles must be presented in row space or vice-versa. The allowed options for the argument map are:

rowprincipal or colprincipal: asymmetric plots with either rows in principal coordinates and columns in standard coordinates, or vice versa. These plots preserve row metric or column metric respectively.
symbiplot: Both rows and columns are scaled to have variances equal to the singular values (square roots of eigenvalues), which gives a symmetric biplot but does not preserve row or column metrics.
rowgab or colgab: Asymmetric maps, proposed by Gabriel & Odoroff (1990), with rows (respectively, columns) in principal coordinates and columns (respectively, rows) in standard coordinates multiplied by the mass of the corresponding point.
rowgreen or colgreen: The so-called contribution biplots showing visually the most contributing points (Greenacre 2006b). These are similar to rowgab and colgab except that the points in standard coordinates are multiplied by the square root of the corresponding masses, giving reconstructions of the standardized residuals.

Value

A ggplot2 plot

Examples

Correspondence Analysis

Correspondence Analysis (CA) is performed using the function CA() [in FactoMineR] and housetasks data [in factoextra]:

# Install and load FactoMineR to compute CA
# install.packages("FactoMineR")
library("FactoMineR")
data(housetasks)
head(housetasks)

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53

res.ca <- CA(housetasks, graph=FALSE)

fviz_ca_row(): Graph of row variables

# Default plot
fviz_ca_row(res.ca)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change title and axis labels
fviz_ca_row(res.ca) +
 labs(title = "CA", x = "Dim.1", y ="Dim.2" )

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change axis limits by specifying the min and max
fviz_ca_row(res.ca) +
   xlim(-1.3, 1.7) + ylim (-1.5, 1)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Use text only
fviz_ca_row(res.ca, geom = "text")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Use points only
fviz_ca_row(res.ca, geom="point")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change the size of points
fviz_ca_row(res.ca, geom="point", pointsize = 4)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change point color and theme
fviz_ca_row(res.ca, col.row = "violet")+
   theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Control automatically the color of row points
# using the cos2 or the contributions
# cos2 = the quality of the rows on the factor map
fviz_ca_row(res.ca, col.row="cos2")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Gradient color
fviz_ca_row(res.ca, col.row="cos2") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=0.5)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change the theme and use only points
fviz_ca_row(res.ca, col.row="cos2", geom = "point") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=0.4)+ theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Color by the contributions
fviz_ca_row(res.ca, col.row="contrib") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=10)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Control the transparency of the color by the
# contributions
fviz_ca_row(res.ca, alpha.row="contrib") +
     theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select and visualize rows with cos2 > 0.5
fviz_ca_row(res.ca, select.row = list(cos2 = 0.5))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select the top 7 according to the cos2
fviz_ca_row(res.ca, select.row = list(cos2 = 7))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select the top 7 contributing rows
fviz_ca_row(res.ca, select.row = list(contrib = 7))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select by names
fviz_ca_row(res.ca,
select.row = list(name = c("Breakfeast", "Repairs", "Holidays")))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

fviz_ca_col(): Graph of column categories

# Default plot
fviz_ca_col(res.ca)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Change color and theme
fviz_ca_col(res.ca, col.col="steelblue")+
 theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Control colors using their contributions
fviz_ca_col(res.ca, col.col = "contrib")+
 scale_color_gradient2(low = "white", mid = "blue",
           high = "red", midpoint = 25) +
 theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Control the transparency of variables using their contributions
fviz_ca_col(res.ca, alpha.col = "contrib") +
   theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select and visualize columns with cos2 >= 0.4
fviz_ca_col(res.ca, select.col = list(cos2 = 0.4))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select the top 3 contributing columns
fviz_ca_col(res.ca, select.col = list(contrib = 3))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select by names
fviz_ca_col(res.ca,
 select.col= list(name = c("Wife", "Husband", "Jointly")))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

fviz_ca_biplot(): Biplot of rows and columns

# Symetric Biplot of rows and columns
fviz_ca_biplot(res.ca)

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Asymetric biplot, use arrows for columns
fviz_ca_biplot(res.ca, map ="rowprincipal",
 arrow = c(FALSE, TRUE))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Keep only the labels for row points
fviz_ca_biplot(res.ca, label ="row")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Keep only labels for column points
fviz_ca_biplot(res.ca, label ="col")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Hide row points
fviz_ca_biplot(res.ca, invisible ="row")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Hide column points
fviz_ca_biplot(res.ca, invisible ="col")

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Control automatically the color of rows using the cos2
fviz_ca_biplot(res.ca, col.row="cos2") +
       theme_minimal()

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

# Select the top 7 contributing rows
# And the top 3 columns
fviz_ca_biplot(res.ca,
               select.row = list(contrib = 7),
               select.col = list(contrib = 3))

Quick Correspondence Analysis data visualization using factoextra - R software and data mining

Infos

This analysis has been performed using R software (ver. 3.2.1) and factoextra (ver. 1.0.3)

↧

DBSCAN: density-based clustering for discovering clusters in large datasets with noise - Unsupervised Machine Learning

November 11, 2015, 10:52 am

≫ Next: fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

≪ Previous: fviz_ca: Quick Correspondence Analysis data visualization using factoextra - R software and data mining

1 Concepts of density-based clustering

Partitioning methods (K-means, PAMclustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.

Unfortunately, real life data can contain: i) clusters of arbitrary shape such as those shown in the figure below (oval, linear and S shape clusters); ii) many outliers and noise.

The figure below shows a dataset containing nonconvex clusters and outliers/noises. The simulated dataset multishapes [in factoextra package] is used.

DBSCAN: density-based clustering

The plot above contains 5 clusters and outliers, including:

2 ovales clusters
2 linear clusters
1 compact cluster

Given such data, k-means algorithm has difficulties for identifying theses clusters with arbitrary shape. To illustrate this situation, the following R code computes K-means algorithm on the dataset multishapes [in factoextra package]. The function fviz_cluster() [in factoextra] is used to visualize the clusters.

The latest version of factoextra can be installed using the following R code:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Compute and visualize k-means clustering using the dataset multishapes:

library(factoextra)

data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df, frame = FALSE, geom = "point")

DBSCAN: density-based clustering

We know there are 5 five clusters in the data, but it can be seen that k-means method inaccurately identify the 5 clusters.

This chapter describes DBSCAN, a density-based clustering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in data set containing noise and outliers. DBSCAN stands for Density-Based Spatial Clustering and Application with Noise.

The advantages of DBSCAN are:

Unlike K-means, DBSCAN does not require the user to specify the number of clusters to be generated
DBSCAN can find any shape of clusters. The cluster doesnt have to be circular.
DBSCAN can identify outliers

The basic idea behind density-based clustering approach is derived from a human intuitive clustering method. For instance, by looking at the figure below, one can easily identify four clusters along with several points of noise, because of the differences in the density of points.

Density based clustering basic idea
(From Ester et al. 1996)

As illustrated in the figure above, clusters are dense regions in the data space, separated by regions of lower density of points. In other words, the density of points in a cluster is considerably higher than the density of points outside the cluster (areas of noise).

DBSCAN is based on this intuitive notion of clusters and noise. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

2 Algorithm of DBSCAN

The goal is to identify dense regions, which can be measured by the number of objects close to a given point.

Two important parameters are required for DBSCAN: epsilon (eps) and minimum points (MinPts). The parameter eps defines the radius of neighborhood around a point x. Its called called the $\epsilon$-neighborhood of x. The parameter MinPts is the minimum number of neighbors within eps radius.

Any point x in the dataset, with a neighbor count greater than or equal to MinPts, is marked as a core point. We say that x is border point, if the number of its neighbors is less than MinPts, but it belongs to the $\epsilon$-neighborhood of some core point z. Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.

The figure below shows the different types of points (core, border and outlier points) using MinPts = 6. Here x is a core point because $neighbours_\epsilon(x) = 6$, y is a border point because $neighbours_\epsilon(y) < MinPts$, but it belongs to the $\epsilon$-neighborhood of the core point x. Finally, z is a noise point.

Density based clustering basic idea - minimal point and epsilon

We define 3 terms, required for understanding the DBSCAN algorithm:

Direct density reachable: A point A is directly density reachable from another point B if: i) A is in the $\epsilon$-neighborhood of B and ii) B is a core point.
Density reachable: A point A is density reachable from B if there are a set of core points leading from B to A.
Density connected: Two points A and B are density connected if there are a core point C, such that both A and B are density reachable from C.

A density-based cluster is defined as a group of density connected points. The algorithm of density-based clustering (DBSCAN) works as follow:

The algorithm of density-based clustering works as follow:

For each point $x_i$, compute the distance between $x_i$ and the other points. Finds all neighbor points within distance eps of the starting point ($x_i$). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
For each core point, if its not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
Iterate through the remaining unvisited points in the dataset.

Those points that do not belong to any cluster are treated as outliers or noise.

3 R packages for computing DBSCAN

Three R packages are used in this article:

fpc and dbscan for computing density-based clustering
factoextra for visualizing clusters

The R packages fpc and dbscan can be installed as follow:

install.packages("fpc")
install.packages("dbscan")

4 R functions for DBSCAN

The function dbscan() [in fpc package] or dbscan() [in dbscan package] can be used.

As the name of DBSCAN functions is the same in the two packages, well explicitly use them as follow: fpc::dbscan() and dbscan::dbscan().

In the following examples, well use fpc package. A simplified format of the function is:

dbscan(data, eps, MinPts = 5, scale = FALSE, 
       method = c("hybrid", "raw", "dist"))

data: data matrix, data frame or dissimilarity matrix (dist-object). Specify method = dist if the data should be interpreted as dissimilarity matrix or object. Otherwise Euclidean distances will be used.
eps: Reachability maximum distance
MinPts: Reachability minimum number of points
scale: If TRUE, the data will be scaled
method: Possible values are:
- dist: Treats the data as distance matrix
- raw: Treats the data as raw data
- hybrid: Expect also raw data, but calculates partial distance matrices

Recall that, DBSCAN clusters require a minimum number of points (MinPts) within a maximum distance (eps) around one of its members (the seed).
Any point within eps around any point which satisfies the seed condition is a cluster member (recursively).
Some points may not belong to any clusters (noise).

In the following examples, well use the simulated multishapes data [in factoextra package]:

# Load the data 
# Make sure that the package factoextra is installed
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]

The function dbscan() can be used as follow:

library("fpc")
# Compute DBSCAN using fpc package
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
plot(db, df, main = "DBSCAN", frame = FALSE)

DBSCAN: density-based clustering

Note that, the function plot.dbscan() uses different point symbols for core points (i.e, seed points) and border points. Black points correspond to outliers. You can play with eps and MinPts for changing cluster configurations.

It can be seen that DBSCAN performs better for these data sets and can identify the correct set of clusters compared to k-means algorithms.

Its also possible to draw the plot above using the function fviz_cluster() [ in factoextra package]:

library("factoextra")
fviz_cluster(db, df, stand = FALSE, frame = FALSE, geom = "point")

DBSCAN: density-based clustering

The result of fpc::dbscan() function can be displayed as follow:

# Print DBSCAN
print(db)

## dbscan Pts=1100 MinPts=5 eps=0.15
##         0   1   2   3  4  5
## border 31  24   1   5  7  1
## seed    0 386 404  99 92 50
## total  31 410 405 104 99 51

In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot).

# Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 50)]

##  [1] 1 3 2 4 3 1 2 4 2 2 2 2 2 2 1 4 1 1 1 0 4 2 2 5 2 2 2 2 1 1 0 4 2 3 1
## [36] 2 2 1 1 1 1 2 2 1 1 1 3 2 1 3

The function print.dbscan() shows a statistic of the number of points belonging to the clusters that are seeds and border points.

DBSCAN algorithm requires users to specify the optimal eps values and the parameter MinPts. In the R code above, we used eps = 0.15 and MinPts = 5. One limitation of DBSCAN is that it is sensitive to the choice of $\epsilon$, in particular if clusters have different densities. If $\epsilon$ is too small, sparser clusters will be defined as noise. If $\epsilon$ is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single $\epsilon$ value may not suffice.

A natural question is:

How to define the optimal value of eps?

5 Method for determining the optimal eps value

The method proposed here consists of computing the he k-nearest neighbor distances in a matrix of points.

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.

Next, these k-distances are plotted in an ascending order. The aim is to determine the knee, which corresponds to the optimal eps parameter.

A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

The function kNNdistplot() [in dbscan package] can be used to draw the k-distance plot:

dbscan::kNNdistplot(df, k =  5)
abline(h = 0.15, lty = 2)

DBSCAN: density-based clustering

It can be seen that the optimal eps value is around a distance of 0.15.

6 Cluster predictions with DBSCAN algorithm

The function predict.dbscan(object, data, newdata) [in fpc package] can be used to predict the clusters for the points in newdata. For more details, read the documentation (?predict.dbscan).

7 Application of DBSCAN on a real data

The iris dataset is used:

# Load the data
data("iris")
iris <- as.matrix(iris[, 1:4])

The optimal value of eps parameter can be determined as follow:

dbscan::kNNdistplot(iris, k =  4)
abline(h = 0.4, lty = 2)

Compute DBSCAN using fpc::dbscan() and dbscan::dbscan(). Make sure that the 2 packages are installed:

set.seed(123)
# fpc package
res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)
# dbscan package
res.db <- dbscan::dbscan(iris, 0.4, 4)

The result of the function fpc::dbscan() provides an object of class dbscan containing the following components:
- cluster: integer vector coding cluster membership with noise observations (singletons) coded as 0
- isseed: logical vector indicating whether a point is a seed (not border, not noise)
- eps: parameter eps
- MinPts: parameter MinPts
The result of the function dbscan::dbscan() is an integer vector with cluster assignments. Zero indicates noise points.

Note that the function dbscan:dbscan() is a fast re-implementation of DBSCAN algorithm. The implementation is significantly faster and can work with larger data sets than the function fpc:dbscan().

Make sure that both version produce the same results:

all(res.fpc$cluster == res.db)

## [1] TRUE

The result can be visualized as follow:

fviz_cluster(res.fpc, iris, geom = "point")

DBSCAN: density-based clustering

Black points are outliers.

8 Infos

This analysis has been performed using R software (ver. 3.2.1)

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)

↧

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

November 11, 2015, 4:09 pm

≫ Next: fviz_pca: Quick Principal Component Analysis data visualization - R software and data mining

≪ Previous: DBSCAN: density-based clustering for discovering clusters in large datasets with noise - Unsupervised Machine Learning

Description

Draw the graph of individuals/variables from the output of Multiple Correspondence Analysis (MCA).

The following functions, from factoextra package are use:

fviz_mca_ind(): Graph of individuals
fviz_mca_var(): Graph of variable categories
fviz_mca_biplot() (or fviz_mca()): Biplot of individuals and variable categories

Install and load factoextra

The package devtools is required for the installation as factoextra is hosted on github.

# install.packages("devtools")
library("devtools")
install_github("kassambara/factoextra")

Load factoextra :

library("factoextra")

Usage

# Graph of individuals
fviz_mca_ind(X, axes = c(1, 2), 
  geom = c("point", "text"), label = "all", invisible = "none",
  labelsize = 4, pointsize = 2, habillage = "none",
  addEllipses = FALSE, ellipse.level = 0.95, col.ind = "blue",
  col.ind.sup = "darkblue", alpha.ind = 1, shape.ind = 19,
  select.ind = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric", 
  jitter = list(what = "label", width = NULL, height = NULL), ...)

# Graph of variables
fviz_mca_var(X, axes = c(1, 2),
  geom = c("point", "text"), label = "all",
  invisible = "none", labelsize = 4, pointsize = 2, col.var = "red",
  alpha.var = 1, shape.var = 17, col.quanti.sup = "blue",
  col.quali.sup = "darkgreen", col.circle = "grey70",
  select.var = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric", 
  jitter = list(what = "label", width = NULL, height = NULL))

# Biplot of individuals and variables
fviz_mca_biplot(X, axes = c(1, 2), geom = c("point", "text"),
  label = "all", invisible = "none", labelsize = 4, pointsize = 2,
  habillage = "none", addEllipses = FALSE, ellipse.level = 0.95,
  col.ind = "blue", col.ind.sup = "darkblue", alpha.ind = 1,
  col.var = "red", alpha.var = 1, col.quanti.sup = "blue",
  col.quali.sup = "darkgreen", shape.ind = 19, shape.var = 17,
  select.var = list(name = NULL, cos2 = NULL, contrib = NULL),
  select.ind = list(name = NULL, cos2 = NULL, contrib = NULL),
  map = "symmetric", arrows = c(FALSE, FALSE), 
  jitter = list(what = "label", width = NULL, height = NULL), ...)

# An alias of fviz_mca_biplot()
fviz_mca(X, ...)

Arguments

<td>character string specifying the map type. Allowed options include: "symmetric", "rowprincipal", "colprincipal", "symbiplot", "rowgab", "colgab", "rowgreen" and "colgreen". See details</td></tr>

Argument	Description
X	an object of class MCA [FactoMineR]; mca [ade4].
axes	a numeric vector of length 2 specifying the dimensions to be plotted.
geom	a text specifying the geometry to be used for the graph. Allowed values are the combination of c(point, arrow, text). Use point (to show only points); text to show only labels; c(point, text) or c(arrow, text) to show both types.
label	a text specifying the elements to be labelled. Default value is all. Allowed values are none or the combination of c(ind, ind.sup,var, quali.sup, quanti.sup). ind can be used to label only active individuals. ind.sup is for supplementary individuals. var is for active variable categories. quali.sup is for supplementary qualitative variable categories. quanti.sup is for quantitative supplementary variables.
invisible	a text specifying the elements to be hidden on the plot. Default value is none. Allowed values are the combination of c(ind, ind.sup,var, quali.sup, quanti.sup).
labelsize	font size for the labels.
pointsize	the size of points.
habillage	an optional factor variable for coloring the observations by groups. Default value is none. If X is a MCA object from FactoMineR package, habillage can also specify the supplementary qualitative variable (by its index or name) to be used for coloring individuals by groups (see ?MCA in FactoMineR).
addEllipses	logical value. If TRUE, draws ellipses around the individuals when habillage != none.
ellipse.level	the size of the concentration ellipse in normal probability.
col.ind,col.var	colors for individuals and variables, respectively. Possible values include also : cos2, contrib, coord, x or y. In this case, the colors for individuals/variables are automatically controlled by their qualities of representation (cos2), contributions (contrib), coordinates (x^2 + y^2 , coord), x values (x) or y values (y). To use automatic coloring (by cos2, contrib, .), make sure that habillage =none.
col.ind.sup	color for supplementary individuals.
alpha.ind,alpha.var	controls the transparency of individual and variable colors, respectively. The value can variate from 0 (total transparency) to 1 (no transparency). Default value is 1. Possible values include also : cos2, contrib, coord, x or y. In this case, the transparency for the individual/variable colors are automatically controlled by their qualities (cos2), contributions (contrib), coordinates (x^2+y2 , coord), x values(x) or y values(y). To use this, make sure that habillage =none.
select.ind,select.var	a selection of individuals/variables to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib: name: is a character vector containing individuals/variables to be drawn cos2: if cos2 is in [0, 1], ex: 0.6, then individuals/variables with a cos2 > 0.6 are drawn. if cos2 > 1, ex: 5, then the top 5 individuals/variables with the highest cos2 are drawn. contrib: if contrib > 1, ex: 5, then the top 5 individuals/variables with the highest cos2 are drawn
map
jitter	a parameter used to jitter the points in order to reduce overplotting. Its a list containing the objects what, width and height (Ex.; jitter = list(what, width, height)). what: the element to be jittered. Possible values are point or p; label or l; both or b. width: degree of jitter in x direction (ex: 0.2). height: degree of jitter in y direction (ex: 0.2).
col.quanti.sup, col.quali.sup	a color for the quantitative/qualitative supplementary variables.
arrows	Vector of two logicals specifying if the plot should contain points (FALSE, default) or arrows (TRUE). First value sets the rows and the second value sets the columns.
	Arguments to be passed to the function fviz_mca_biplot().

Details

The default plot of MCA is a symmetric plot in which both rows and columns are in principal coordinates. In this situation, its not possible to interpret the distance between row points and column points. To overcome this problem, the simplest way is to make an asymmetric plot. This means that, the column profiles must be presented in row space or vice-versa. The allowed options for the argument map are:

rowprincipal or colprincipal: asymmetric plots with either rows in principal coordinates and columns in standard coordinates, or vice versa. These plots preserve row metric or column metric respectively.
symbiplot: Both rows and columns are scaled to have variances equal to the singular values (square roots of eigenvalues), which gives a symmetric biplot but does not preserve row or column metrics.
rowgab or colgab: Asymmetric maps, proposed by Gabriel & Odoroff (1990), with rows (respectively, columns) in principal coordinates and columns (respectively, rows) in standard coordinates multiplied by the mass of the corresponding point.
rowgreen or colgreen: The so-called contribution biplots showing visually the most contributing points (Greenacre 2006b). These are similar to rowgab and colgab except that the points in standard coordinates are multiplied by the square root of the corresponding masses, giving reconstructions of the standardized residuals.

Value

A ggplot2 plot

Examples

Multiple Correspondence Analysis

A Multiple Correspondence Analysis (MCA) is performed using the function MCA() [in FactoMineR] and poison data [in FactoMineR]:

# Install and load FactoMineR to compute MCA
# install.packages("FactoMineR")
library("FactoMineR")
data(poison)
poison.active <- poison[1:55, 5:15]
head(poison.active[, 1:6])

    Nausea Vomiting Abdominals   Fever   Diarrhae   Potato
1 Nausea_y  Vomit_n     Abdo_y Fever_y Diarrhea_y Potato_y
2 Nausea_n  Vomit_n     Abdo_n Fever_n Diarrhea_n Potato_y
3 Nausea_n  Vomit_y     Abdo_y Fever_y Diarrhea_y Potato_y
4 Nausea_n  Vomit_n     Abdo_n Fever_n Diarrhea_n Potato_y
5 Nausea_n  Vomit_y     Abdo_y Fever_y Diarrhea_y Potato_y
6 Nausea_n  Vomit_n     Abdo_y Fever_y Diarrhea_y Potato_y

res.mca <- MCA(poison.active, graph=FALSE)

fviz_mca_ind(): Graph of individuals

# Default plot
fviz_mca_ind(res.mca)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change title and axis labels
fviz_mca_ind(res.mca) +
 labs(title = "MCA", x = "Dim.1", y ="Dim.2" )

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change axis limits by specifying the min and max
fviz_mca_ind(res.mca) +
   xlim(-0.8, 1.5) + ylim (-1.5, 1.5)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Use text only
fviz_mca_ind(res.mca, geom = "text")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Use points only
fviz_mca_ind(res.mca, geom="point")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change the size of points
fviz_mca_ind(res.mca, geom="point", pointsize = 4)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change point color and theme
fviz_mca_ind(res.mca, col.ind = "blue")+
   theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Reduce overplotting
fviz_mca_ind(res.mca, 
             jitter = list(width = 0.2, height = 0.2))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Control automatically the color of individuals
# using the cos2 or the contributions
# cos2 = the quality of the individuals on the factor map
fviz_mca_ind(res.mca, col.ind="cos2")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Gradient color
fviz_mca_ind(res.mca, col.ind="cos2") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=0.4)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change the theme and use only points
fviz_mca_ind(res.mca, col.ind="cos2", geom = "point") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=0.4)+ theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Color by the contributions
fviz_mca_ind(res.mca, col.ind="contrib") +
      scale_color_gradient2(low="white", mid="blue",
      high="red", midpoint=1.5)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Control the transparency of the color by the
# contributions
fviz_mca_ind(res.mca, alpha.ind="contrib") +
     theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Color individuals by groups
grp <- as.factor(poison.active[, "Vomiting"])
fviz_mca_ind(res.mca, label="none", habillage=grp)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Add ellipses
p <- fviz_mca_ind(res.mca, label="none", habillage=grp,
             addEllipses=TRUE, ellipse.level=0.95)
print(p)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change group colors using RColorBrewer color palettes
p + scale_color_brewer(palette="Dark2") +
   theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

p + scale_color_brewer(palette="Paired") +
     theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

p + scale_color_brewer(palette="Set1") +
     theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change color manually
p + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select and visualize individuals with cos2 >= 0.4
fviz_mca_ind(res.mca, select.ind = list(cos2 = 0.4))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select the top 20 according to the cos2
fviz_mca_ind(res.mca, select.ind = list(cos2 = 20))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select the top 20 contributing individuals
fviz_mca_ind(res.mca, select.ind = list(contrib = 20))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select by names
fviz_mca_ind(res.mca,
select.ind = list(name = c("44", "38", "53",  "39")))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

fviz_mca_var(): Graph of variable categories

# Default plot
fviz_mca_var(res.mca)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change color and theme
fviz_mca_var(res.mca, col.var="steelblue")+
 theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Control variable colors using their contributions
fviz_mca_var(res.mca, col.var = "contrib")+
 scale_color_gradient2(low = "white", mid = "blue",
           high = "red", midpoint = 2) +
 theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Control the transparency of variables using their contributions
fviz_mca_var(res.mca, alpha.var = "contrib") +
   theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select and visualize categories with cos2 >= 0.4
fviz_mca_var(res.mca, select.var = list(cos2 = 0.4))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select the top 10 contributing variable categories
fviz_mca_var(res.mca, select.var = list(contrib = 10))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select by names
fviz_mca_var(res.mca,
 select.var= list(name = c("Courg_n", "Fever_y", "Fever_n")))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

fviz_mca_biplot(): Biplot of individuals of variable categories

fviz_mca_biplot(res.mca)

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Keep only the labels for variable categories
fviz_mca_biplot(res.mca, label ="var")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Keep only labels for individuals
fviz_mca_biplot(res.mca, label ="ind")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Hide variable categories
fviz_mca_biplot(res.mca, invisible ="var")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Hide individuals
fviz_mca_biplot(res.mca, invisible ="ind")

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Control automatically the color of individuals using the cos2
fviz_mca_biplot(res.mca, label ="var", col.ind="cos2") +
       theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Change the color by groups, add ellipses
fviz_mca_biplot(res.mca, label="var", col.var ="blue",
   habillage=grp, addEllipses=TRUE, ellipse.level=0.95) +
   theme_minimal()

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

# Select the top 30 contributing individuals
# And the top 10 variables
fviz_mca_biplot(res.mca,
               select.ind = list(contrib = 30),
               select.var = list(contrib = 10))

fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

Infos

This analysis has been performed using R software (ver. 3.2.1) and factoextra (ver. 1.0.3)

↧

fviz_pca: Quick Principal Component Analysis data visualization - R software and data mining

November 11, 2015, 4:15 pm

≫ Next: qplot: Quick plot with ggplot2 - R software and data visualization

≪ Previous: fviz_mca: Quick Multiple Correspondence Analysis data visualization - R software and data mining

Description

Draw the graph of individuals/variables from the output of Principal Component Analysis (PCA).

The following functions, from factoextra package are use:

fviz_pca_ind(): Graph of individuals
fviz_pca_var(): Graph of variables
fviz_pca_biplot() (or fviz_pca()): Biplot of individuals and variables

Install and load factoextra

The package devtools is required for the installation as factoextra is hosted on github.

# install.packages("devtools")
library("devtools")
install_github("kassambara/factoextra")

Load factoextra :

library("factoextra")

Usage

# Graph of individuals
fviz_pca_ind(X, axes = c(1, 2), geom = c("point", "text"),
       label = "all", invisible = "none", labelsize = 4,
       pointsize = 2, habillage = "none",
       addEllipses = FALSE, ellipse.level = 0.95, 
       col.ind = "black", col.ind.sup = "blue", alpha.ind = 1,
       select.ind = list(name = NULL, cos2 = NULL, contrib = NULL),
       jitter = list(what = "label", width = NULL, height = NULL),  ...)

# Graph of variables
fviz_pca_var(X, axes = c(1, 2), geom = c("arrow", "text"),
       label = "all", invisible = "none", labelsize = 4,
       col.var = "black", alpha.var = 1, col.quanti.sup = "blue",
       col.circle = "grey70",
       select.var = list(name =NULL, cos2 = NULL, contrib = NULL),
       jitter = list(what = "label", width = NULL, height = NULL))

# Biplot of individuals and variables
fviz_pca_biplot(X, axes = c(1, 2), geom = c("point", "text"),
   label = "all", invisible = "none", labelsize = 4, pointsize = 2,
    habillage = "none", addEllipses = FALSE, ellipse.level = 0.95,
    col.ind = "black", col.ind.sup = "blue", alpha.ind = 1,
    col.var = "steelblue", alpha.var = 1, col.quanti.sup = "blue",
    col.circle = "grey70", 
    select.var = list(name = NULL, cos2 = NULL, contrib= NULL), 
    select.ind = list(name = NULL, cos2 = NULL, contrib = NULL),
    jitter = list(what = "label", width = NULL, height = NULL), ...)

# An alias of fviz_pca_biplot()
fviz_pca(X, ...)

Arguments

Argument	Description
X	an object of class PCA [FactoMineR]; prcomp and princomp [stats]; dudi and pca [ade4].
axes	a numeric vector of length 2 specifying the dimensions to be plotted.
geom	a text specifying the geometry to be used for the graph. Allowed values are the combination of c(point, arrow, text). Use point (to show only points); text to show only labels; c(point, text) or c(arrow, text) to show both types.
label	a text specifying the elements to be labelled. Default value is all. Allowed values are none or the combination of c(ind, ind.sup, quali, var, quanti.sup). ind can be used to label only active individuals. ind.sup is for supplementary individuals. quali is for supplementary qualitative variables. var is for active variables. quanti.sup is for quantitative supplementary variables.
invisible	a text specifying the elements to be hidden on the plot. Default value is none. Allowed values are the combination of c(ind, ind.sup, quali, var, quanti.sup).
labelsize	font size for the labels.
pointsize	the size of points.
habillage	an optional factor variable for coloring the observations by groups. Default value is none. If X is a PCA object from FactoMineR package, habillage can also specify the supplementary qualitative variable (by its index or name) to be used for coloring individuals by groups (see ?PCA in FactoMineR).
addEllipses	logical value. If TRUE, draws ellipses around the individuals when habillage != none.
ellipse.level	the size of the concentration ellipse in normal probability.
col.ind,col.var	colors for individuals and variables, respectively. Possible values include also : cos2, contrib, coord, x or y. In this case, the colors for individuals/variables are automatically controlled by their qualities of representation (cos2), contributions (contrib), coordinates (x^2 + y^2, coord), x values (x) or y values (y). To use automatic coloring (by cos2, contrib, .), make sure that habillage =none.
col.ind.sup	color for supplementary individuals.
alpha.ind,alpha.var	controls the transparency of individual and variable colors, respectively. The value can variate from 0 (total transparency) to 1 (no transparency). Default value is 1. Possible values include also : cos2, contrib, coord, x or y. In this case, the transparency for the individual/variable colors are automatically controlled by their qualities (cos2), contributions (contrib), coordinates (x^2 + y^2 , coord), x values(x) or y values(y). To use this, make sure that habillage =none.
select.ind,select.var	a selection of individuals/variables to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib: name: is a character vector containing individuals/variables to be drawn cos2: if cos2 is in [0, 1], ex: 0.6, then individuals/variables with a cos2 > 0.6 are drawn. if cos2 > 1, ex: 5, then the top 5 individuals/variables with the highest cos2 are drawn. contrib: if contrib > 1, ex: 5, then the top 5 individuals/variables with the highest cos2 are drawn
jitter	a parameter used to jitter the points in order to reduce overplotting. Its a list containing the objects what, width and height (Ex.; jitter = list(what, width, height)). what: the element to be jittered. Possible values are point or p; label or l; both or b. width: degree of jitter in x direction (ex: 0.2). height: degree of jitter in y direction (ex: 0.2).
col.quanti.sup	a color for the quantitative supplementary variables.
col.circle	a color for the correlation circle.
	Arguments to be passed to the function fviz_pca_biplot().

Value

A ggplot2 plot

Examples

Principal component analysis

A principal component analysis (PCA) is performed using the built-in R function prcomp() and iris data:

data(iris)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# The variable Species (index = 5) is removed
# before the PCA analysis
res.pca <- prcomp(iris[, -5],  scale = TRUE)