Plot titles can be specified either directly to the plotting functions during the plot creation or by using the title() function (to add titles on an existing plot).
The legend() function can be used. A simplified format is :
legend(x, y=NULL, legend, col)
x and y : the co-ordinates to be used for the legend. Keywords can also be used for x : bottomright, bottom, bottomleft, left, topleft, top, topright, right and center.
legend : the text of the legend
col : colors of lines and points beside the text for legends
# Generate some data
x<-1:10; y1=x*x; y2=2*y1
# First line plot
plot(x, y1, type="b", pch=19, col="red", xlab="x", ylab="y")
# Add a second line
lines(x, y2, pch=18, col="blue", type="b", lty=2)
# Add legends
legend("topleft", legend=c("Line 1", "Line 2"),
col=c("red", "blue"), lty=1:2, cex=0.8)
To add a text to a plot in R, the text() function [to draw a text inside the plotting area] and mtext()[to put a text in one of the four margins of the plot] function can be used.
Line types can be changed using the graphical parameter lty.
x=1:10; y=x*x
plot(x, y, type="l") # Solid line (by default)
plot(x, y, type="l", lty="dashed")# Use dashed line type
plot(x, y, type="l", lty="dashed", lwd=3)# Change line width
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Image may be NSFW. Clik here to view. Colors can be specified either by name (e.g col = red) or as a hexadecimal code (such as col = #FFCC00). You can also use other color systems such as ones taken from the RColorBrewer package.
Image may be NSFW. Clik here to view. Plot titles can be specified either directly to the plotting functions during the plot creation or by using the title() function (to add titles on an existing plot).
Image may be NSFW. Clik here to view. The goal of this article is to show you how to set x and y axis limites by specifying the minimum and the maximum values of each axis. Well also see how to set the log scale.
Infos
This analysis has been performed using R statistical software (ver. 3.2.4).
ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics.
The concept behind ggplot2 divides plot into three different fundamental parts: Plot = data + Aesthetics + Geometry.
The principal components of every plot can be defined as follow:
data is a data frame
Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc ..
Geometry defines the type of graphics (histogram, box plot, line plot, density plot, dot plot, .)
There are two major functions in ggplot2 package: qplot() and ggplot() functions.
qplot() stands for quick plot, which can be used to produce easily simple plots.
ggplot() function is more flexible and robust than qplot for building a plot piece by piece.
This document provides R course material for producing different types of plots using ggplot2.
factoextra - Extract and Visualize the outputs of a multivariate analysis: PCA (Principal Component Analysis), CA (Correspondence Analysis), MCA (Multiple Correspondence Analysis) and clustering analyses.
easyggplot2: Perform and customize easily a plot with ggplot2: box plot, dot plot, strip chart, violin plot, histogram, density plot, scatter plot, bar plot, line plot, etc,
ggfortify: Allow ggplot2 to handle some popular R packages. These include plotting 1) Matrix; 2) Linear Model and Generalized Linear Model; 3) Time Series; 4) PCA/Clustering; 5) Survival Curve; 6) Probability distribution
GGally: GGally extends ggplot2 for visualizing correlation matrix, scatterplot plot matrix, survival plot and more.
ggRandomForests: Graphical analysis of random forests with the randomForestSRC and ggplot2 packages.
ggdendro: Create dendrograms and tree diagrams using ggplot2
ggmcmc: Tools for Analyzing MCMC Simulations from Bayesian Inference
The cowplot package is an extension to ggplot2 and it can be used to provide a publication-ready plots.
Basic plots
library(cowplot)
# Default plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_boxplot() +
theme(legend.position = "none")
bp
# Add gridlines
bp + background_grid(major = "xy", minor = "none")
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Recall that, the function ggsave()[in ggplot2 package] can be used to save ggplots. However, when working with cowplot, the function save_plot() [in cowplot package] is preferred. Its an alternative to ggsave with a better support for multi-figure plots.
save_plot("mpg.pdf", bp,
base_aspect_ratio = 1.3 # make room for figure legend
)
Arranging multiple graphs using cowplot
# Scatter plot
sp <- ggplot(mpg, aes(x = cty, y = hwy, colour = factor(cyl)))+
geom_point(size=2.5)
sp
# Bar plot
bp <- ggplot(diamonds, aes(clarity, fill = cut)) +
geom_bar() +
theme(axis.text.x = element_text(angle=70, vjust=0.5))
bp
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Combine the two plots (the scatter plot and the bar plot):
grid.arrange() and arrangeGrob(): Change column/row span of a plot
Using the R code below:
The box plot will live in the first column
The dot plot and the strip chart will live in the second column
grid.arrange(bp, arrangeGrob(dp, sc), ncol = 2)
Image may be NSFW. Clik here to view.
Its also possible to use the argument layout_matrix in grid.arrange(). In the R code below layout_matrix is a 2X3 matrix (2 columns and three rows). The first column is all 1s, thats where the first plot lives, spanning the three rows; second column contains plots 2, 3, 4, each occupying one row.
# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_boxplot()
# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_violin()+
geom_boxplot(width=0.1)+
theme(legend.position="none")
# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)
# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")
# 4. Arrange ggplot2 graphs with a specific width
grid.arrange(bp, vp, legend, ncol=3, widths=c(2.3, 2.3, 0.8))
Image may be NSFW. Clik here to view.
Change legend position
# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot with a top legend position
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_boxplot()+theme(legend.position = "top")
# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_violin()+
geom_boxplot(width=0.1)+
theme(legend.position="none")
# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)
# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")
# 4. Create a blank plot
blankPlot <- ggplot()+geom_blank(aes(1,1)) +
cowplot::theme_nothing()
Change legend position by changing the order of plots using the following R code. Grids with four cells are created (2X2). The height of the legend zone is set to 0.2.
set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
group <- as.factor(rep(c(1,2), each=500))
df2 <- data.frame(x, y, group)
head(df2)
# Scatter plot of x and y variables and color by groups
scatterPlot <- ggplot(df2,aes(x, y, color=group)) +
geom_point() +
scale_color_manual(values = c('#999999','#E69F00')) +
theme(legend.position=c(0,1), legend.justification=c(0,1))
# Marginal density plot of x (top panel)
xdensity <- ggplot(df2, aes(x, fill=group)) +
geom_density(alpha=.5) +
scale_fill_manual(values = c('#999999','#E69F00')) +
theme(legend.position = "none")
# Marginal density plot of y (right panel)
ydensity <- ggplot(df2, aes(y, fill=group)) +
geom_density(alpha=.5) +
scale_fill_manual(values = c('#999999','#E69F00')) +
theme(legend.position = "none")
Create a complex layout using the function viewport()
The different steps are :
Create plots : p1, p2, p3, .
Move to a new page on a grid device using the function grid.newpage()
Create a layout 2X2 - number of columns = 2; number of rows = 2
Define a grid viewport : a rectangular region on a graphics device
Print a plot into the viewport
require(grid)
# Move to a new page
grid.newpage()
# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))
# A helper function to define a region on the layout
define_region <- function(row, col){
viewport(layout.pos.row = row, layout.pos.col = col)
}
# Arrange the plots
print(scatterPlot, vp=define_region(1, 1:2))
print(xdensity, vp = define_region(2, 1))
print(ydensity, vp = define_region(2, 2))
Image may be NSFW. Clik here to view.
ggExtra: Add marginal distributions plots to ggplot2 scatter plots
The package ggExtra is an easy-to-use package developped by Dean Attali, for adding marginal histograms, boxplots or density plots to ggplot2 scatter plots.
The package can be installed and used as follow:
# Install
install.packages("ggExtra")
# Load
library("ggExtra")
# Create some data
set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
df3 <- data.frame(x, y)
# Scatter plot of x and y variables and color by groups
sp2 <- ggplot(df3,aes(x, y)) + geom_point()
# Marginal density plot
ggMarginal(sp2 + theme_gray())
Image may be NSFW. Clik here to view.
# Marginal histogram plot
ggMarginal(sp2 + theme_gray(), type = "histogram",
fill = "steelblue", col = "darkblue")
Image may be NSFW. Clik here to view.
Insert an external graphical element inside a ggplot
The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements. The simplified format is :
annotation_custom(grob, xmin, xmax, ymin, ymax)
grob: the external graphical element to display
xmin, xmax : x location in data coordinates (horizontal location)
ymin, ymax : y location in data coordinates (vertical location)
The different steps are :
Create a scatter plot of y = f(x)
Add, for example, the box plot of the variables x and y inside the scatter plot using the function annotation_custom()
As the inset box plot overlaps with some points, a transparent background is used for the box plots.
If you have a solution to insert, at the same time, both p2_grob and p3_grob inside the scatter plot, please let me a comment. I got some errors trying to do this
Mix table, text and ggplot2 graphs
The functions below are required :
tableGrob() [in the package gridExtra] : for adding a data table to a graphic device
splitTextGrob() [in the package RGraphics] : for adding a text to a graph
Make sure that the package RGraphics is installed.
library(RGraphics)
library(gridExtra)
# Table
p1 <- tableGrob(head(ToothGrowth))
# Text
text <- "ToothGrowth data describes the effect of Vitamin C on tooth growth in Guinea pigs. Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used."
p2 <- splitTextGrob(text)
# Box plot
p3 <- ggplot(df, aes(x=dose, y=len)) + geom_boxplot()
# Arrange the plots on the same page
grid.arrange(p1, p2, p3, ncol=1)
Image may be NSFW. Clik here to view.
Infos
This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. 2.1.0)
This chapter describes how to create static and interactive three-dimension (3D) graphs. We provide also an R package named graph3d to easily build and customize, step by step, 3D graphs in R software.
ggplot2 by Hadley Wickham is an excellent and flexible package for elegant data visualization in R. However the default generated plots requires some formatting before we can send them for publication. Furthermore, to customize a ggplot, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills.
The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
## sex weight
## 1 F 53.79293
## 2 F 55.27743
## 3 F 56.08444
## 4 F 52.65430
Density plot with mean lines and marginal rug
# Change outline and fill colors by groups ("sex")
# Use custom palette
ggdensity(wdata, x = "weight",
add = "mean", rug = TRUE,
color = "sex", fill = "sex",
palette = c("#00AFBB", "#E7B800"))
Image may be NSFW. Clik here to view.
Note that:
the argument palette is used for coloring or filling by groups. Allowed values include:
or custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
the argument add can be used to add mean or median lines to density and to histogram plots. Allowed values are: “mean” and “median”.
Histogram plot with mean lines and marginal rug
# Change outline and fill colors by groups ("sex")
# Use custom color palette
gghistogram(wdata, x = "weight",
add = "mean", rug = TRUE,
color = "sex", fill = "sex",
palette = c("#00AFBB", "#E7B800"))
Image may be NSFW. Clik here to view.
If you want to create the above histogram with the standard ggplot2 functions, the syntax is extremely complex for beginners (see the R script below). The ggpubr package is a wrapper around ggplot2 functions to make your life easier and to produce quickly a publication ready plot.
# Change outline colors by groups: dose
# Use custom color palette
# Add jitter points and change the shape by groups
ggboxplot(df, x = "dose", y = "len",
color = "dose", palette =c("#00AFBB", "#E7B800", "#FC4E07"),
add = "jitter", shape = "dose")
Image may be NSFW. Clik here to view.
Note that, when using ggpubr functions for drawing box plots, violin plots, dot plots, strip charts, bar plots, line plots or error plots, the argument add can be used for adding another plot element (e.g.: dot plot or error bars).
In this case, allowed values for the argument add are one or the combination of: “none”, “dotplot”, “jitter”, “boxplot”, “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”; see ?desc_statby for more details.
Violin plots with box plots inside
# Change fill color by groups: dose
# add boxplot with white fill color
ggviolin(df, x = "dose", y = "len", fill = "dose",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "boxplot", add.params = list(fill = "white"))
Image may be NSFW. Clik here to view.
Dot plots with summary statistics
# Change outline and fill colors by groups: dose
# Add mean + sd
ggdotplot(df, x = "dose", y = "len", color = "dose", fill = "dose",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "mean_sd", add.params = list(color = "gray"))
Image may be NSFW. Clik here to view.
Recall that, possible summary statistics include “boxplot”, “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”; see ?desc_statby for more details.
Strip chart with summary statistics
# Change points size
# Change point colors and shapes by groups: dose
# Use custom color palette
ggstripchart(df, "dose", "len", size = 2, shape = "dose",
color = "dose", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "mean_sd")
# Change ouline and fill colors by groups: dose
# Use custom color palette
# Add labels
ggbarplot(df2, x = "dose", y = "len",
fill = "dose", color = "dose",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
label = TRUE)
Image may be NSFW. Clik here to view.
Use lab.pos = “in”, to put labels inside bars
Use lab.col, to change label colors
Bar plot with multiple groups
# Create some data
df3 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("D0.5", "D1", "D2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
print(df3)
# Plot "len" by "dose" and change color by a second group: "supp"
# Add labels inside bars
ggbarplot(df3, x = "dose", y = "len",
fill = "supp", color = "supp", palette = c("#00AFBB", "#E7B800"),
label = TRUE, lab.col = "white", lab.pos = "in")
Image may be NSFW. Clik here to view.
Bar plot visualizing the mean of each group with error bars
# Data: ToothGrowth data set we'll be used.
df <- ToothGrowth
head(df, 10)
# Visualize the mean of each group
# Change point and outline colors by groups: dose
# Add jitter points and errors (mean_se)
ggbarplot(df, x = "dose", y = "len", color = "dose",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = c("mean_se", "jitter"))
Image may be NSFW. Clik here to view.
Line plots
Line plots with multiple groups
# Plot "len" by "dose" and
# Change line types and point shapes by a second groups: "supp"
# Change color by groups "supp"
ggline(df3, x = "dose", y = "len",
linetype = "supp", shape = "supp",
color = "supp", palette = c("#00AFBB", "#E7B800"))
Image may be NSFW. Clik here to view.
Line plot visualizing the mean of each group with error bars
# Visualize the mean of each group: dose
# Change colors by a second groups: supp
# Add jitter points and errors (mean_se)
ggline(df, x = "dose", y = "len",
color = "supp",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = c("mean_se", "jitter"))
Image may be NSFW. Clik here to view.
Pie chart
Create some data
df4 <- data.frame(
group = c("Male", "Female", "Child"),
value = c(25, 25, 50))
head(df4)
## group value
## 1 Male 25
## 2 Female 25
## 3 Child 50
Pie chart
# Change fill color by group
# set outline line color to white
# Use custom color palette
# Show group names and value as labels
labs <- paste0(df4$group, " (", df4$value, "%)")
ggpie(df4, x = "value", fill = "group", color = "white",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
label = labs, lab.pos = "in", lab.font = "white")
Scatter plots with regression line and confidence interval
ggscatter(df5, x = "wt", y = "mpg",
color = "black", shape = 21, size = 4, # Points color, shape and size
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE, # Add confidence interval
cor.coef = TRUE # Add correlation coefficient
)
Image may be NSFW. Clik here to view.
Note that, when using ggpubr functions for drawing scatter plots, allowed values for the argument add are one of “none”, “reg.line” (for adding linear regression line) or “loess” (for adding local regression fitting).
Scatter plot with concentration ellipses and labels
# Change point colors and shapes by groups: cyl
# Use custom palette
# Add concentration ellipses with mean points (barycenters)
# Add marginal rug
# Add label and use repel = TRUE to avoid label overplotting
ggscatter(df5, x = "wt", y = "mpg",
color = "cyl", shape = "cyl",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
ellipse = TRUE, mean.point = TRUE,
rug = TRUE, label = "name", font.label = 10, repel = TRUE)
Image may be NSFW. Clik here to view.
Note that, it’s possible to change the ellipse type by using the argument ellipse.type. Possible values are ‘convex’, ‘confidence’ or types supported by ggplot2::stat_ellipse() including one of c(“t”, “norm”, “euclid”).
Cleveland’s dot plots
# Change colors by group cyl
ggdotchart(df5, x = "mpg", label = "name",
group = "cyl", color = "cyl",
palette = c("#00AFBB", "#E7B800", "#FC4E07") )
Image may be NSFW. Clik here to view.
ggpar(): customize ggplot easily
The function ggpar() [in ggpubr] can be used to simply and easily customize any ggplot2-based graphs. The graphical parameters that can be changed using ggpar() include:
Main titles, axis labels and legend titles
Legend position and appearance
colors
Axis limits
Axis transformations: log and sqrt
Axis ticks
Themes
Rotate a plot
Note that all the arguments accepted by the function ggpar() can be also directly passed to the plotting functions in ggpubr package.
We start by creating a basic box plot colored by groups as follow:
df <- ToothGrowth
p <- ggboxplot(df, x = "dose", y = "len",
color = "dose")
print(p)
Image may be NSFW. Clik here to view.
Main titles, axis labels and legend titles
# Change title texts and fonts
ggpar(p, main = "Plot of length \n by dose",
xlab ="Dose (mg)", ylab = "Teeth length",
legend.title = "Dose (mg)",
font.main = c(14,"bold.italic", "red"),
font.x = c(14, "bold", "#2E9FDF"),
font.y = c(14, "bold", "#E7B800"))
# Hide titles
ggpar(p, xlab = FALSE, ylab = FALSE)
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Note that,
font.main, font.x, font.y are vectors of length 3 indicating respectively the size (e.g.: 14), the style (e.g.: “plain”, “bold”, “italic”, “bold.italic”) and the color (e.g.: “red”) of main title, xlab and ylab, respectively. For example font.x = c(14, “bold”, “red”). Use font.x = 14, to change only font size; or use font.x = “bold”, to change only font face.
you can use \n, to split long title into multiple lines.
Note that, the legend argument is a character vector specifying legend position. Allowed values are one of c(“top”, “bottom”, “left”, “right”, “none”). Default is “bottom” side position. to remove the legend use legend = “none”. Legend position can be also specified using a numeric vector c(x, y). Their values should be between 0 and 1. c(0,0) corresponds to the “bottom left” and c(1,1) corresponds to the “top right” position.
Color palettes
As mentioned above, the argument palette is used to change group color palettes. Allowed values include:
Custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
# Use custom color palette
ggpar(p, palette = c("#00AFBB", "#E7B800", "#FC4E07"))
# Use brewer palette
ggpar(p, palette = "Dark2" )
# Use grey palette
ggpar(p, palette = "grey")
# Use scientific journal palette from ggsci package
# Allowed values: "npg", "aaas", "lancet", "jco",
# "ucscgb", "uchicago", "simpsons" and "rickandmorty".
ggpar(p, palette = "npg") # nature
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Axis limits and scales
The following arguments can be used:
xlim, ylim: a numeric vector of length 2, specifying x and y axis limits (minimum and maximum values), respectively. e.g.: ylim = c(0, 50).
xscale, yscale: x and y axis scale, respectively. Allowed values are one of c(“none”, “log2”, “log10”, “sqrt”); e.g.: yscale=“log2”.
format.scale: logical value. If TRUE, axis tick mark labels will be formatted when xscale or yscale = “log2” or “log10”.
# Change y axis limits
ggpar(p, ylim = c(0, 50))
# Change y axis scale to log2
ggpar(p, yscale = "log2")
# Format axis scale
ggpar(p, yscale = "log2", format.scale = TRUE)
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Axis ticks: customize tick marks and labels
The following arguments can be used:
ticks: logical value. Default is TRUE. If FALSE, hide axis tick marks.
tickslab: logical value. Default is TRUE. If FALSE, hide axis tick labels.
font.tickslab: Font style (size, face, color) for tick labels, e.g.: c(14, “bold”, “red”).
xtickslab.rt, ytickslab.rt: Rotation angle of x and y axis tick labels, respectively. Default value is 0.
xticks.by, yticks.by: numeric value controlling x and y axis breaks, respectively. For example, if yticks.by = 5, a tick mark is shown on every 5. Default value is NULL.
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Themes
The R package ggpubr contains two main functions for changing the default ggplot theme to a publication ready theme:
theme_pubr(): change the theme to a publication ready theme
labs_pubr(): Format only plot labels to a publication ready style
theme_pubr() will produce plots with bold axis labels, bold tick mark labels and legend at the bottom leaving extra space for the plotting area.
The argument ggtheme can be used in any ggpubr plotting functions to change the plot theme. Default value is theme_pubr() for publication ready theme. Allowed values include ggplot2 official themes: theme_gray(), theme_bw(), theme_minimal(), theme_classic(), theme_void(), etc. It’s also possible to use the function “+” to add a theme.
# Gray theme
p + theme_gray()
# Minimal theme
p + theme_minimal()
# Format only plot labels to a publication ready style
# by using the function labs_pubr()
p + theme_minimal() + labs_pubr(base_size = 16)
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
In this section, you’ll find R packages developed by STHDA for easy data analyses.
factoextra
factoextra let you extract and create ggplot2-based elegant visualizations of multivariate data analyse results, including PCA, CA, MCA, MFA, HMFA and clustering methods.
The default plots generated by ggplot2 requires some formatting before we can send them for publication. To customize a ggplot, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills. ggpubr provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
Clustering algorithms are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.
The most popular clustering algorithms are:
[url=/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning]k-means clustering[/url], a partitioning method used for splitting a dataset into a set of k clusters.
[url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]hierarchical clustering[/url], an alternative approach to k-means clustering for identifying clustering in the dataset by using [url=/wiki/clarifying-distance-measures-unsupervised-machine-learning]pairwise distance matrix[/url] between observations as clustering criteria.
However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.
In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.
1 How this article is organized
We’ll start by demonstrating why we should combine k-means and hierarcical clustering. An application is provided using R software.
Finally, we’ll provide an easy to use R function (in factoextra package) for computing hybrid hierachical k-means clustering.
2 Required R packages
We’ll use the R package factoextra which is very helpful for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system
If you want to understand why the data are scaled before the analysis, then you should read this section: [url=/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling]Distances and scaling[/url].
4 R function for clustering analyses
We’ll use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: [url=/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning]Visual Enhancement of Clustering Analysis[/url].
eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing [url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]hierarchical clustering[/url] and [url=/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning]partititioning clustering[/url] in a single line function call.
4.1 Example of k-means clustering
We’ll split the data into 4 clusters using k-means clustering as follow:
library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15)
Note that, silhouette coefficient measures how well an observation is clustered and it estimates the average distance between clusters (i.e, the average silhouette width). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: [url=/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning]cluster validation statistics[/url]
Samples with negative silhouette coefficient:
# Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
Read more about hierarchical clustering: [url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]Hierarchical clustering[/url]
5 Combining hierarchical clustering and k-means
5.1 Why?
Recall that, in k-means algorithm, a random set of observations are chosen as the initial centers.
The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.
To avoid this, a solution is to use an hybrid approach by combining the hierarchical clustering and the k-means methods. This process is named hybrid hierarchical k-means clustering (hkmeans).
5.2 How ?
The procedure is as follow:
Compute hierarchical clustering and cut the tree into k-clusters
compute the center (i.e the mean) of each cluster
Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers
Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.
5.3 R codes
5.3.1 Compute hierarchical clustering and cut the tree into k-clusters:
5.3.4 Compare the results of hierarchical clustering and hybrid approach
The R code below compares the initial clusters defined using only hierarchical clustering and the final ones defined using hierarchical clustering + k-means:
# res.hc$cluster: Initial clusters defined using hierarchical clustering
# km.res2$cluster: Final clusters defined using k-means
table(km.res2$cluster, res.hc$cluster)
It can be seen that, 3 of the observations defined as belonging to cluster 3 by hierarchical clustering has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering.
The difference can be easily visualized using the function fviz_dend() [in factoextra]. The labels are colored using k-means clusters:
In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.
5.4 hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering
The function hkmeans() [in factoextra] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.
# Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)
The ExpressionSet is generally used for array-based experiments, where the rows are features, and the SummarizedExperiment is generally used for sequencing-based experiments, where the rows are GenomicRanges. ExpressionSet is in the Biobase library.
There’s a library GEOquery which lets you pull down ExpressionSets by identifier.
library(Biobase) #Download data from GEO library(GEOquery) geoq <- getGEO("GSE9514") #The list has a single element #Save it to letter e for simplicity names(geoq)
## [1] "GSE9514_series_matrix.txt.gz"
e <- geoq[[1]]
ExpressionSet
ExpressionSets are basically matrices with a lot of metadata around them. Here, we have a matrix which is 9,000 by 8. It has phenotypic data, feature and annotation information. You can use the functions dim, ncol, nrow to have the dimension, the number of columns and rows respectively.
The matrix of the expression data, is stored in the exprs slot and you can access that with the exprs function. The phenotypic data, can be accessed using pData and this gives us a data frame, which is information about the samples, including the accession number, the submission date, etc). The feature data is accessible with fData function and this is information about genes or probe sets. The names of the data in the feature data are, for example, the gene title, the gene symbol, the ENREZ_Gene_ID or Gene Ontology information, which might be useful for downstream analysis.
#Phenotypic data: information about the samples pData(e)[1:3,1:6]
## title ## GSM241146 hem1 strain grown in YPD with 250 uM ALA (08-15-06_Philpott_YG_S98_1) ## GSM241147 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_10) ## GSM241148 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_11) ## geo_accession status submission_date ## GSM241146 GSM241146 Public on Nov 06 2007 Nov 02 2007 ## GSM241147 GSM241147 Public on Nov 06 2007 Nov 02 2007 ## GSM241148 GSM241148 Public on Nov 06 2007 Nov 02 2007 ## last_update_date type ## GSM241146 Aug 14 2011 RNA ## GSM241147 Aug 14 2011 RNA ## GSM241148 Aug 14 2011 RNA
dim(pData(e))
## [1] 8 31
#Column names of the phenotypic data names(pData(e))
I’m going to load a bioconductor annotation package, which is the parathyroid SummarizedExperiment library. The loaded data is a SummarizedExperiment, which summarizes counts of RNA sequencing reads in genes for an experiment on human cell culture. The SummarizedExperiment object has 63,000 rows, which are genes, and 27 columns, which are samples, and the matrix, in this case, is called counts. And we have the row names, which are ensemble genes, and metadata about the row data, and metadata about the column data.
library(parathyroidSE) #RNA sequencing reads data(parathyroidGenesSE) se <- parathyroidGenesSE se
assay function can be used get access to the counts of RNA sequencing reads. colData function , the column data, is equivalent to the pData on the ExpressionSet. Each row in this data frame corresponds to a column in the SummarizedExperiment. We can see that there are indeed 27 rows here, which give information about the columns. Each sample in this case is treated with two treatments or control and we can see the number of replicates for each, using the as.numeric function again.
#Dimension of the SummarizedExperiment dim(se)
## [1] 63193 27
#Get access to the counts of RNA sequencing reads, using assay function. assay(se)[1:3,1:3]
#Get access to treatment column of sample characteristics colData(se)$treatment
## [1] Control Control DPN DPN OHT OHT Control Control ## [9] DPN DPN DPN OHT OHT OHT Control Control ## [17] DPN DPN OHT OHT Control DPN DPN DPN ## [25] OHT OHT OHT ## Levels: Control DPN OHT
The rows in this case correspond to genes. Genes are collections of exons. The rows of the SummarizedExperiment is a GRangesList where each row corresponds to a GRanges which contains the exons, which were used to count the RNA sequencing reads. Some metadata are included in the row data and is accessible with the metadata function. This information tells us, how this GRangesList was constructed. if it was constructed from the genomic features package using a transcript database. Homo sapiens was the organism, and the database was ENSEMBL GENES number 72, and etc. In addition, there’s some more information under experiment data, using exptData and then specifying the MIAME, which is minimal information about a microarray experiment, Although we’re not using microarrays, we’ve still used the same slots to describe extra information about this object.
#Extract out a single GRanges object: 17 ranges and 2 metada columns which is #the ensembl id for the exon. rowData(se)[1]
## Experiment data ## Experimenter name: Felix Haglund ## Laboratory: Science for Life Laboratory Stockholm ## Contact information: Mikael Huss ## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells ## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 ## PMIDs: 23024189 ## ## Abstract: A 251 word abstract is available. Use 'abstract' method.
abstract(exptData(se)$MIAME)
## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."
This R tutorial describes how to create an ECDF plot (or Empirical Cumulative Density Function) using R software and ggplot2 package. ECDF reports for any given number the percent of individuals that are below that threshold.
Descriptive statistics consist of describing simply the data using some summary statistics and graphics. Here, we’ll describe how to compute summary statistics using R software.
Some R functions for computing descriptive statistics:
Description
R function
Mean
mean()
Standard deviation
sd()
Variance
var()
Minimum
min()
Maximum
maximum()
Median
median()
Range of values (minimum and maximum)
range()
Sample quantiles
quantile()
Generic function
summary()
Interquartile range
IQR()
The function mfv(), for most frequent value, [in modeest package] can be used to find the statistical mode of a numeric vector.
Descriptive statistics for a single group
Measure of central tendency: mean, median, mode
Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:
the mean: the average value. It’s sensitive to outliers.
the median: the middle value. It’s a robust alternative to mean.
and the mode: the most frequent value
In R,
The function mean() and median() can be used to compute the mean and the median, respectively;
The function mfv() [in the modeest R package] can be used to compute the mode of a variable.
The R code below computes the mean, median and the mode of the variable Sepal.Length [in my_data data set]:
# Compute the mean value
mean(my_data$Sepal.Length)
[1] 5.843333
# Compute the median value
median(my_data$Sepal.Length)
[1] 5.8
# Compute the mode
# install.packages("modeest")
require(modeest)
mfv(my_data$Sepal.Length)
[1] 5
Measure of variablity
Measures of variability gives how “spread out” the data are.
Range: minimum & maximum
Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
[1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
[1] 7.9
# Range
range(my_data$Sepal.Length)
[1] 4.3 7.9
Interquartile range
Recall that, quartiles divide the data into 4 parts. Note that, the interquartile range (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation.
R function:
quantile(x, probs = seq(0, 1, 0.25))
x: numeric vector whose sample quantiles are wanted.
probs: numeric vector of probabilities with values in [0,1].
Example:
quantile(my_data$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles).
To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:
quantile(my_data$Sepal.Length, seq(0, 1, 0.1))
To compute the interquartile range, type this:
IQR(my_data$Sepal.Length)
[1] 1.3
Variance and standard deviation
The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.
# Compute the variance
var(my_data$Sepal.Length)
# Compute the standard deviation =
# square root of th variance
sd(my_data$Sepal.Length)
Median absolute deviation
The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.
# Compute the median
median(my_data$Sepal.Length)
# Compute the median absolute deviation
mad(my_data$Sepal.Length)
Which measure to use?
Range. It’s not often used because it’s very sensitive to outliers.
Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination with the median.
Variance. It’s completely uninterpretable because it doesn’t use the same units as the data. It’s almost never used except as a mathematical tool
Standard deviation. This is the square root of the variance. It’s expressed in the same units as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency.
Median absolute deviation. It’s a robust way to estimate the standard deviation, for data with outliers. It’s not used very often.
In summary, the IQR and the standard deviation are the two most common measures used to report the variability of the data.
Computing an overall summary of a variable and an entire data frame
summary() function
The function summary() can be used to display several statistic summaries of either one variable or an entire data frame.
Summary of a single variable. Five values are returned: the mean, median, 25th and 75th quartiles, min and max in one single line call:
summary(my_data$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
Summary of a data frame. In this case, the function summary() is automatically applied to each column. The format of the result depends on the type of the data contained in the column. For example:
If the column is a numeric variable, mean, median, min, max and quartiles are returned.
If the column is a factor variable, the number of observations in each group is returned.
summary(my_data, digits = 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
Median :6 Median :3 Median :4 Median :1.3 virginica :50
Mean :6 Mean :3 Mean :4 Mean :1.2
3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
Max. :8 Max. :4 Max. :7 Max. :2.5
sapply() function
It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …
# Compute the mean of each column
sapply(my_data[, -5], mean)
Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.
For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells to the function to remove any NAs before calculations. An example using the mean function is as follow:
mean(my_data$Sepal.Length, na.rm = TRUE)
Graphical display of distributions
The R package ggpubr will be used to create graphs.
Source: local data frame [3 x 4]
Species count mean sd
(fctr) (int) (dbl) (dbl)
1 setosa 50 5.006 0.3524897
2 versicolor 50 5.936 0.5161711
3 virginica 50 6.588 0.6358796
Graphics for grouped data:
library("ggpubr")
# Box plot colored by groups: Species
ggboxplot(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"))
Image may be NSFW. Clik here to view.
# Stripchart colored by groups: Species
ggstripchart(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "mean_sd")
Image may be NSFW. Clik here to view.
Note that, when the number of observations per groups is small, it’s recommended to use strip chart compared to box plots.
Frequency tables
A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.
R function to generate tables: table()
Create some data
Distribution of hair and eye color by sex of 592 students:
# Hair/eye color data
df <- as.data.frame(HairEyeColor)
hair_eye_col <- df[rep(row.names(df), df$Freq), 1:3]
rownames(hair_eye_col) <- 1:nrow(hair_eye_col)
head(hair_eye_col)
Hair Eye Sex
1 Black Brown Male
2 Black Brown Male
3 Black Brown Male
4 Black Brown Male
5 Black Brown Male
6 Black Brown Male
Simple frequency distribution: one categorical variable
Table of counts
# Frequency distribution of hair color
table(Hair)
Hair
Black Brown Red Blond
108 286 71 127
# Frequency distribution of eye color
table(Eye)
Eye
Brown Blue Hazel Green
220 215 93 64
Graphics: to create the graphics, we start by converting the table as a data frame.
# Compute table and convert as data frame
df <- as.data.frame(table(Hair))
df
Hair Freq
1 Black 108
2 Brown 286
3 Red 71
4 Blond 127
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq")
Image may be NSFW. Clik here to view.
Two-way contingency table: Two categorical variables
tbl2 <- table(Hair , Eye)
tbl2
Eye
Hair Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
It’s also possible to use the function xtabs(), which will create cross tabulation of data frames with a formula interface.
xtabs(~ Hair + Eye, data = hair_eye_col)
Graphics: to create the graphics, we start by converting the table as a data frame.
df <- as.data.frame(tbl2)
head(df)
Hair Eye Freq
1 Black Brown 68
2 Brown Brown 119
3 Red Brown 26
4 Blond Brown 7
5 Black Blue 20
6 Brown Blue 84
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq",
color = "Eye",
palette = c("brown", "blue", "gold", "green"))
Image may be NSFW. Clik here to view.
# position dodge
ggbarplot(df, x = "Hair", y = "Freq",
color = "Eye", position = position_dodge(),
palette = c("brown", "blue", "gold", "green"))
Image may be NSFW. Clik here to view.
Multiway tables: More than two categorical variables
Hair and Eye color distributions by sex using xtabs():
xtabs(~Hair + Eye + Sex, data = hair_eye_col)
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
You can also use the function ftable() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:
ftable(Sex + Hair ~ Eye, data = hair_eye_col)
Sex Male Female
Hair Black Brown Red Blond Black Brown Red Blond
Eye
Brown 32 53 10 3 36 66 16 4
Blue 11 50 10 30 9 34 7 64
Hazel 10 25 7 5 5 29 7 5
Green 3 15 7 8 2 14 7 8
Compute table margins and relative frequency
Table margins correspond to the sums of counts along rows or columns of the table. Relative frequencies express table entries as proportions of table margins (i.e., row or column totals).
The function margin.table() and prop.table() can be used to compute table margins and relative frequencies, respectively.
Eye
Hair Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
# Margin of rows
margin.table(he.tbl, 1)
Hair
Black Brown Red Blond
108 286 71 127
# Margin of columns
margin.table(he.tbl, 2)
Eye
Brown Blue Hazel Green
220 215 93 64
Compute relative frequencies:
# Frequencies relative to row total
prop.table(he.tbl, 1)
Eye
Hair Brown Blue Hazel Green
Black 0.62962963 0.18518519 0.13888889 0.04629630
Brown 0.41608392 0.29370629 0.18881119 0.10139860
Red 0.36619718 0.23943662 0.19718310 0.19718310
Blond 0.05511811 0.74015748 0.07874016 0.12598425
# Table of percentages
round(prop.table(he.tbl, 1), 2)*100
Eye
Hair Brown Blue Hazel Green
Black 63 19 14 5
Brown 42 29 19 10
Red 37 24 20 20
Blond 6 74 8 13
To express the frequencies relative to the grand total, use this:
he.tbl/sum(he.tbl)
Infos
This analysis has been performed using R software (ver. 3.2.4).
Here, we’ll describe how to create quantile-quantile plots in R. QQ plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted. QQ plots are used to visually check the normality of the data.
Many of statistical tests including correlation, regression, t-test, and analysis of variance (ANOVA) assume some certain characteristics about the data. They require the data to follow a normal distribution or Gaussian distribution. These tests are called parametric tests, because their validity depends on the distribution of the data.
Normality and the other assumptions made by these tests should be taken seriously to draw reliable interpretation and conclusions of the research.
Before using a parametric test, we should perform some preleminary tests to make sure that the test assumptions are met. In the situations where the assumptions are violated, non-paramatric tests are recommended.
Here, we’ll describe how to check the normality of the data by visual inspection and by significance tests.
Install required R packages
dplyr for data manipulation
install.packages("dplyr")
ggpubr for an easy ggplot2-based data visualization
We want to test if the variable len (tooth length) is normally distributed.
Case of large sample sizes
If the sample size is large enough (n > 30), we can ignore the distribution of the data and use parametric tests.
The central limit theorem tells us that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).
However, to be consistent, normality can be checked by visual inspection [normal plots (histogram), Q-Q plot (quantile-quantile plot)] or by significance tests].
Visual methods
Density plot and Q-Q plot can be used to check normality visually.
Density plot: the density plot provides a visual judgment about whether the distribution is bell shaped.
library("ggpubr")
ggdensity(my_data$len,
main = "Density plot of tooth length",
xlab = "Tooth length")
Image may be NSFW. Clik here to view.
Q-Q plot: Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted.
library(ggpubr)
ggqqplot(my_data$len)
Image may be NSFW. Clik here to view.
It’s also possible to use the function qqPlot() [in car package]:
library("car")
qqPlot(my_data$len)
As all the points fall approximately along this reference line, we can assume normality.
Normality test
Visual inspection, described in the previous section, is usually unreliable. It’s possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.
There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.
The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.
Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
Note that, normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.
The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):
shapiro.test(my_data$len)
Shapiro-Wilk normality test
data: my_data$len
W = 0.96743, p-value = 0.1091
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
Infos
This analysis has been performed using R software (ver. 3.2.4).
Here we’ll describe research questions and the corresponding statistical tests, as well as, the test assumptions.
Image may be NSFW. Clik here to view.
Statistical tests and assumptions
Research questions and corresponding statistical tests
The most popular research questions include:
whether two variables (n = 2) are correlated (i.e., associated)
whether multiple variables (n > 2) are correlated
whether two groups (n = 2) of samples differ from each other
whether multiple groups (n >= 2) of samples differ from each other
whether the variability of two samples differ
Each of these questions can be answered using the following statistical tests:
Correlation test between two variables
Correlation matrix between multiple variables
Comparing the means of two groups:
Student’s t-test (parametric)
Wilcoxon rank test (non-parametric)
Comaring the means of more than two groups
ANOVA test (analysis of variance, parametric): extension of t-test to compare more than two groups.
Kruskal-Wallis rank sum test (non-parametric): extension of Wilcoxon rank test to compare more than two groups
Comparing the variances:
Comparing the variances of two groups: F-test (parametric)
Comparison of the variances of more than two groups: Bartlett’s test (parametric), Levene’s test (parametric) and Fligner-Killeen test (non-parametric)
Statistical test requirements (assumptions)
Many of the statistical procedures including correlation, regression, t-test, and analysis of variance assume some certain characteristic about the data. Generally they assume that:
the data are normally distributed
and the variances of the groups to be compared are homogeneous (equal).
These assumptions should be taken seriously to draw reliable interpretation and conclusions of the research.
These tests - correlation, t-test and ANOVA - are called parametric tests, because their validity depends on the distribution of the data.
Before using parametric test, we should perform some preleminary tests to make sure that the test assumptions are met. In the situations where the assumptions are violated, non-paramatric tests are recommended.
How to assess the normality of the data?
With large enough sample sizes (n > 30) the violation of the normality assumption should not cause major problems (central limit theorem). This implies that we can ignore the distribution of the data and use parametric tests.
However, to be consistent, we can use Shapiro-Wilk’s significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.
How to assess the equality of variances?
The standard Student’s t-test (comparing two independent samples) and the ANOVA test (comparing multiple samples) assume also that the samples to be compared have equal variances.
If the samples, being compared, follow normal distribution, then it’s possible to use:
F-test to compare the variances of two samples
Bartlett’s Test or Levene’s Test to compare the variances of multiple samples.
Infos
This analysis has been performed using R software (ver. 3.2.4).
Correlation test is used to evaluate the association between two or more variables.
For instance, if we are interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question.
If there is no relationship between the two variables (father and son heights), the average height of son should be the same regardless of the height of the fathers and vice versa.
Here, we’ll describe the different correlation methods and we’ll provide pratical examples using R software.
Install and load required R packages
We’ll use the [url=/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots]ggpubr R package[/url] for an easy ggplot2-based data visualization
Install the latest version from GitHub as follow (recommended):
There are different methods to perform correlation analysis:
Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.
Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-parametric)
The most commonly used method is the Pearson correlation method.
Correlation formula
In the formula below,
x and y are two vectors of length n
\(m_x\) and \(m_y\) corresponds to the means of x and y, respectively.
Pearson correlation formula
\[
r = \frac{\sum{(x-m_x)(y-m_y)}}{\sqrt{\sum{(x-m_x)^2}\sum{(y-m_y)^2}}}
\]
\(m_x\) and \(m_y\) are the means of x and y variables.
The p-value (significance level) of the correlation can be determined :
by using the correlation coefficient table for the degrees of freedom : \(df = n-2\), where \(n\) is the number of observation in x and y variables.
or by calculating the t value as follow:
\[
t = \frac{r}{\sqrt{1-r^2}}\sqrt{n-2}
\]
In the case 2) the corresponding p-value is determined using [url=/wiki/t-distribution-table]t distribution table[/url] for \(df = n-2\)
If the p-value is < 5%, then the correlation between x and y is significant.
Spearman correlation formula
The Spearman correlation method computes the correlation between the rank of x and the rank of y variables.
The Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is \(n(n-1)/2\), where n is the size of x and y.
The procedure is as follow:
Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders.
Now, for each \(y_i\), count the number of \(y_j > y_i\) (concordant pairs (c)) and the number of \(y_j < y_i\) (discordant pairs (d)).
Kendall correlation distance is defined as follow:
\[
tau = \frac{n_c - n_d}{\frac{1}{2}n(n-1)}
\]
Where,
\(n_c\): total number of concordant pairs
\(n_d\): total number of discordant pairs
\(n\): size of x and y
Compute correlation in R
R functions
Correlation coefficient can be computed using the functions cor() or cor.test():
cor() computes the correlation coefficient
cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation .
The simplified formats are:
cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))
x, y: numeric vectors with the same length
method: correlation method
If your data contain missing values, use the following R code to handle missing values by case-wise deletion.
cor(x, y, method = "pearson", use = "complete.obs")
Import your data into R
Prepare your data as specified here: [url=/wiki/best-practices-for-preparing-your-data-set-for-r]Best practices for preparing your data set for R[/url]
Save your data in an external .txt tab or .csv files
Import your data into R as follow:
# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())
Here, we’ll use the built-in R data set mtcars as an example.
The R code below computes the correlation between mpg and wt variables in mtcars data set:
We want to compute the correlation between mpg and wt variables.
Visualize your data using scatter plots
To use R base graphs, click this link: [url=/wiki/scatter-plots-r-base-graphs]scatter plot - R base graphs[/url]. Here, we’ll use the [url=/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots]ggpubr R package[/url].
Correlation Test Between Two Variables in R software
Preleminary test to check the test assumptions
Is the covariation linear? Yes, form the plot above, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Are the data from each of the 2 variables (x, y) follow a normal distribution?
Use Shapiro-Wilk normality test –> R function: shapiro.test()
and look at the normality plot —> R function: ggpubr::ggqqplot()
Shapiro-Wilk test can be performed as follow:
Null hypothesis: the data are normally distributed
Alternative hypothesis: the data are not normally distributed
# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229
# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09
From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
Visual inspection of the data normality using Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Correlation Test Between Two Variables in R software
From the normality plots, we conclude that both populations may come from normal distributions.
Note that, if the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.
Pearson correlation test
Correlation test between mpg and wt variables:
res <- cor.test(my_data$wt, my_data$mpg,
method = "pearson")
res
Pearson's product-moment correlation
data: my_data$wt and my_data$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
In the result above :
t is the t-test statistic value (t = -9.559),
df is the degrees of freedom (df= 30),
p-value is the significance level of the t-test (p-value = 1.29410^{-10}).
conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -0.7441]);
sample estimates is the correlation coefficient (Cor.coeff = -0.87).
Interpretation of the result
The p-value of the test is 1.29410^{-10}, which is less than the significance level alpha = 0.05. We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .
Access to the values returned by cor.test() function
The function cor.test() returns a list containing the following components:
p.value: the p-value of the test
estimate: the correlation coefficient
# Extract the p.value
res$p.value
[1] 1.293959e-10
# Extract the correlation coefficient
res$estimate
cor
-0.8676594
Kendall rank correlation test
The Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.
Kendall's rank correlation tau
data: my_data$wt and my_data$mpg
z = -5.7981, p-value = 6.706e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.7278321
tau is the Kendall correlation coefficient.
The correlation coefficient between x and y are -0.7278 and the p-value is 6.70610^{-9}.
Spearman rank correlation coefficient
Spearman’s rho statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.
Spearman's rank correlation rho
data: my_data$wt and my_data$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.886422
rho is the Spearman’s correlation coefficient.
The correlation coefficient between x and y are -0.8864 and the p-value is 1.48810^{-11}.
Interpret correlation coefficient
Correlation coefficient is comprised between -1 and 1:
-1 indicates a strong negative correlation : this means that every time x increases, y decreases (left panel figure)
0 means that there is no association between the two variables (x and y) (middle panel figure)
1 indicates a strong positive correlation : this means that y increases with x (right panel figure)
Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.Image may be NSFW. Clik here to view.
Correlation Test Between Two Variables in R software
Online correlation coefficient calculator
You can compute correlation test between two variables, online, without any installation by clicking the following link: