Quantcast
Channel: Easy Guides
Viewing all 183 articles
Browse latest View live

Histogram and Density Plots - R Base Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


Here, we’ll describe how to create histogram and density plots in R.


Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Create some data

The data set contains the value of weight by sex for 200 individuals.

set.seed(1234)
x <- c(rnorm(200, mean=55, sd=5),
     rnorm(200, mean=65, sd=5))
head(x)
## [1] 48.96467 56.38715 60.42221 43.27151 57.14562 57.53028

Create histogram plots: hist()

  • A histogram can be created using the function hist(), which simplified format is as follow:
hist(x, breaks = "Sturges")

  • x: a numeric vector
  • breaks: breakpoints between histogram cells.


  • Create histograms
hist(x, col = "steelblue", frame = FALSE)

# Change the number of breaks
hist(x, col = "steelblue", frame = FALSE,
     breaks = 30)

Create density plots: density()

The function density() is used to estimate kernel density.

# Compute the density data
dens <- density(mtcars$mpg)
# plot density
plot(dens, frame = FALSE, col = "steelblue", 
     main = "Density plot of mpg") 

# Fill the density plot using polygon()
plot(dens, frame = FALSE, col = "steelblue", 
     main = "Density plot of mpg") 
polygon(dens, col = "steelblue")

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).


Dot Charts - R Base Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


Here, we’ll describe how to draw a Cleveland dot plot in R.


Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in mtcars data set.

Data

We’ll use mtcars data sets. We start by ordering the data set according to mpg variable.

mtcars <- mtcars[order(mtcars$mpg), ]

R base function: dotchart()

The function dotchart() is used to draw a cleveland dot plot.

dotchart(x, labels = NULL, groups = NULL, 
         gcolor = par("fg"), color = par("fg"))

  • x: numeric vector or matrix
  • labels: a vector of labels for each point.
  • groups: a grouping variable indicating how the elements of x are grouped.
  • gcolor: color to be used for group labels and values.
  • color: the color(s) to be used for points and labels.


Dot chart of one numeric vector

# Dot chart of a single numeric vector
dotchart(mtcars$mpg, labels = row.names(mtcars),
         cex = 0.6, xlab = "mpg")

# Plot and color by groups cyl
grps <- as.factor(mtcars$cyl)
my_cols <- c("#999999", "#E69F00", "#56B4E9")
dotchart(mtcars$mpg, labels = row.names(mtcars),
         groups = grps, gcolor = my_cols,
         color = my_cols[grps],
         cex = 0.6,  pch = 19, xlab = "mpg")

Dot chart of a matrix

dotchart(VADeaths, cex = 0.6,
         main = "Death Rates in Virginia - 1940")

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

Plot Group Means and Confidence Intervals - R Base Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


Here, we’ll describe how to create mean plots with confidence intervals in R.


Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in ToothGrowth data set.

Data

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Plot group means

The function plotmeans() [in gplots package] can be used.

library(gplots)
# Plot the mean of teeth length by dose groups
plotmeans(len ~ dose, data = ToothGrowth, frame = FALSE)

# Add mean labels (mean.labels = TRUE)
# Remove line connection (connect = FALSE)
plotmeans(len ~ dose, data = ToothGrowth, frame = FALSE,
          mean.labels = TRUE, connect = FALSE)

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

Lattice Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R. We also showed how to visualize data using R base graphs.


Here, we’ll present the basics lattice package, which is a powerful and elegant data visualization system that aims to improve on base R graphs.


Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Briefly, if your data is saved in an external .txt tab or .csv files, use the following script to import the data into R:

# If .txt tab file use this:
my_data <- read.delim(file.choose())

# or if .csv file:
my_data <- read.csv(file.choose())

In the following sections, we’ll use R built-in data sets.

Installing and loading the lattice package

# Install
install.packages("lattice")

# Load
library("lattice")

Main functions in the lattice package

FunctionDescription
xyplot()Scatter plot
splom()Scatter plot matrix
cloud()3D scatter plot
stripplot()strip plots (1-D scatter plots)
bwplot()Box plot
dotplot()Dot plot
barchart()bar chart
histogram()Histogram
densityplotKernel density plot
qqmath()Theoretical quantile plot
qq()Two-sample quantile plot
contourplot()3D contour plot of surfaces
levelplot()False color level plot of surfaces
parallel()Parallel coordinates plot
wireframe()3D wireframe graph

Note that, other functions (ecdfplot() and mapplot()) are available in the latticeExtra package.

xyplot(): Scatter plot

  • R function: The R function xyplot() is used to produce bivariate scatter plots or time-series plots. The simplified format is as follow:
xyplot(y ~ x, data)
  • Data set: mtcars
my_data <- iris
head(my_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  • Basic scatter plot: y ~ x
# Default plot
xyplot(Sepal.Length ~ Petal.Length, data = my_data)

# Color by groups
xyplot(Sepal.Length ~ Petal.Length, group = Species, 
       data = my_data, auto.key = TRUE)

# Show points ("p"), grids ("g") and smoothing line
# Change xlab and ylab
xyplot(Sepal.Length ~ Petal.Length, data = my_data,
       type = c("p", "g", "smooth"),
       xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")

  • Multiple panels by groups: y ~ x | group
xyplot(Sepal.Length ~ Petal.Length | Species, 
       group = Species, data = my_data,
       type = c("p", "smooth"),
       scales = "free")

cloud(): 3D scatter plot

  • Data set: iris
my_data <- iris
head(my_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  • Scatter 3D plot: z ~ x * y
# Basic 3D scatter plot
cloud(Sepal.Length ~ Sepal.Length * Petal.Width, 
       data = iris)

# Color by groups; auto.key = TRUE to show legend
cloud(Sepal.Length ~ Sepal.Length * Petal.Width, 
       group = Species, data = iris,
       auto.key = TRUE)

Box plot, Dot plot, Strip plot

  • Data set: ToothGrowth
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
  • Basic plot: Plot len by dose
# Basic box plot
bwplot(len ~ dose,  data = ToothGrowth,
       xlab = "Dose", ylab = "Length")

# Violin plot using panel = panel.violin
bwplot(len ~ dose,  data = ToothGrowth,
       panel = panel.violin,
       xlab = "Dose", ylab = "Length")
# Basic dot plot
dotplot(len ~ dose,  data = ToothGrowth,
        xlab = "Dose", ylab = "Length")

# Basic stip plot
stripplot(len ~ dose,  data = ToothGrowth,
          jitter.data = TRUE, pch = 19,
          xlab = "Dose", ylab = "Length")

  • Plot with multiple groups: Additional argument layout is used: c(3, 1) specifying the number of column and row, respectively
# Box plot
bwplot(len ~ supp | dose,  data = ToothGrowth,
       layout = c(3, 1),
        xlab = "Dose", ylab = "Length")

# Violin plot
bwplot(len ~ supp | dose,  data = ToothGrowth,
       layout = c(3, 1), panel = panel.violin,
        xlab = "Dose", ylab = "Length")

# Dot plot
dotplot(len ~ supp | dose,  data = ToothGrowth,
       layout = c(3, 1),
        xlab = "Dose", ylab = "Length")

# Strip plot
stripplot(len ~ supp | dose,  data = ToothGrowth,
       layout = c(3, 1), jitter.data = TRUE,
        xlab = "Dose", ylab = "Length")

Density plot and Histogram

  • Basic plots
densityplot(~ len, data = ToothGrowth,
            plot.points = FALSE)

histogram(~ len, data = ToothGrowth,
            breaks = 20)

  • Plot with multiple groups
densityplot(~ len, groups = dose, data = ToothGrowth,
            plot.points = FALSE, auto.key = TRUE)

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

Graphical parameters

$
0
0


This article provides a quick start guide to change and customize Rgraphical parameters, including:

  • adding titles, legends, texts, axis and straight lines
  • changing axis scales, plotting symbols, line types and colors

For each of these graphical parameters, you will learn the simplified format of the R functions to use and some examples.




Add and customize titles

How this chapter is organized?

  • Change main title and axis labels
  • title colors
  • The font style for titles
  • Change the font size
  • Use the title() function
  • Customize the titles using par() function.

Read more —> Add titles to a plot in R software.

Plot titles can be specified either directly to the plotting functions during the plot creation or by using the title() function (to add titles on an existing plot).

# Add titles
barplot(c(2,5), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        sub="Sub-title",
        col.main="red", col.lab="blue", col.sub="black")

# Increase the size of titles
barplot(c(2,5), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        sub="Sub-title",
        cex.main=2, cex.lab=1.7, cex.sub=1.2)

Read more —> Add titles to a plot in R software.

Add legends

How this chapter is organized?

  • R legend function
  • Title, text font and background color of the legend box
  • Border of the legend box
  • Specify legend position by keywords

Read more —> Add legends to plots

The legend() function can be used. A simplified format is :

legend(x, y=NULL, legend, col)

  • x and y : the co-ordinates to be used for the legend. Keywords can also be used for x : “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”.
  • legend : the text of the legend
  • col : colors of lines and points beside the text for legends


# Generate some data
x<-1:10; y1=x*x; y2=2*y1
# First line plot
plot(x, y1, type="b", pch=19, col="red", xlab="x", ylab="y")
# Add a second line
lines(x, y2, pch=18, col="blue", type="b", lty=2)
# Add legends
legend("topleft", legend=c("Line 1", "Line 2"),
       col=c("red", "blue"), lty=1:2, cex=0.8)

Read more —> Add legends to plots

Add texts

How this chapter is organized?

  • Add texts within the graph
  • Add text in the margins of the graph
  • Add mathematical annotation to a plot

Read more —> Add text to a plot

To add a text to a plot in R, the text() function [to draw a text inside the plotting area] and mtext()[to put a text in one of the four margins of the plot] function can be used.

A simplified format for text() is :

text(x, y, labels)

  • x and y are the coordinates of the texts
  • labels : vector of texts to be drawn


plot(cars[1:10,], pch=19)
text(cars[1:10,],  row.names(cars[1:10,]), 
     cex=0.65, pos=1,col="red") 

Read more —> Add text to a plot

Add straight lines

How this chapter is organized?

  • Add a vertical line
  • Add an horizontal line
  • Add regression line

Read more —> abline R function : An easy way to add straight lines to a plot using R software

The R function abline() can be used to add straight lines (vertical, horizontal or regression lines) to a graph.

A simplified format is :

abline(a=NULL, b=NULL, h=NULL, v=NULL, ...)

  • a, b : single values specifying the intercept and the slope of the line
  • h : the y-value(s) for horizontal line(s)
  • v : the x-value(s) for vertical line(s)


# Add horizontal and vertical lines
#++++++++++++++++++++++++++++++++++
plot(cars, pch=19)
abline(v=15, col="blue") # Add vertical line
# Add horizontal line, change line color, size and type
abline(h=60, col="red", lty=2, lwd=3)

# Fit regression line
#++++++++++++++++++++++++++++++++++
require(stats)
reg<-lm(dist ~ speed, data = cars)
coeff=coefficients(reg)
# equation of the regression line : 
eq = paste0("y = ", round(coeff[2],1), "*x ", round(coeff[1],1))
plot(cars, main=eq, pch=18)
abline(reg, col="blue", lwd=2)

Read more —> abline R function : An easy way to add straight lines to a plot using R software

Add an axis to a plot

axis() function can be used.

A simplified format is :

axis(side, at=NULL, labels=TRUE)

  • side : the side of the graph the axis is to be drawn on; Possible values are 1(below), 2(left), 3(above) and 4(right).
  • at: the points at which tick-marks are to be drawn.
  • labels: vector of texts for the labels of tick-marks.


x<-1:4; y=x*x
plot(x, y, pch=18, col="red", type="b",
     frame=FALSE, xaxt="n") # Remove x axis
axis(1, 1:4, LETTERS[1:4], col.axis="blue")
axis(3, col = "darkgreen", lty = 2, lwd = 0.5)
axis(4, col = "violet", col.axis = "dark violet", lwd = 2)

Read more —> Add an axis to a plot with R software.

Change axis scale : minimum, maximum and log scale

xlim and ylim arguments can be used to change the limits for x and y axis. Format : xlim = c(min, max); ylim = c(min, max).

log transformation can be performed using the parameters : log=“x”, log=“y” or log=“xy”.

x<-1:10; y=x*x
plot(x, y) # Simple graph
plot(x, y, xlim=c(1,15), ylim=c(1,150))# Enlarge the scale
plot(x, y, log="y")# Log scale

Read more —> Axis scale in R software : minimum, maximum and log scale.

Customize tick mark labels

  • Color, font style and font size of tick mark labels :
  • Orientation of tick mark labels
  • Hide tick marks
  • Change the string rotation of tick mark labels
  • Use the par() function
x<-1:10; y<-x*x
# Simple graph
plot(x, y)
# Custom plot : blue text, italic-bold, magnification
plot(x,y, col.axis="blue", font.axis=4, cex.axis=1.5)

Read more —> Customize tick mark labels.

Change plotting symbols

The following points symbols can be used in R :

Point symbols can be changed using the argument pch.

x<-c(2.2, 3, 3.8, 4.5, 7, 8.5, 6.7, 5.5)
y<-c(4, 5.5, 4.5, 9, 11, 15.2, 13.3, 10.5)
# Change plotting symbol using pch
plot(x, y, pch = 19, col="blue")
plot(x, y, pch = 18, col="red")
plot(x, y, pch = 24, cex=2, col="blue", bg="red", lwd=2)

Read more —> R plot pch symbols : The different point shapes available in R.

Change line types

The following line types are available in R :

Line types can be changed using the graphical parameter lty.

x=1:10; y=x*x
plot(x, y, type="l") # Solid line (by default)
plot(x, y, type="l", lty="dashed")# Use dashed line type
plot(x, y, type="l", lty="dashed", lwd=3)# Change line width

Read more —> Line types in R : lty.

Change colors

  • Built-in color names in R
  • Specifying colors by hexadecimal code
  • Using RColorBrewer palettes
  • Use Wes Anderson color palettes
  • Create a vector of n contiguous colors

Colors can be specified by names (e.g col=red) or with hexadecimal code (e.gcol = “#FFCC00”).

# use color names
barplot(c(2,5), col=c("blue", "red"))
# use hexadecimal color code
barplot(c(2,5), col=c("#009999", "#0000FF"))

Hexadecimal color charts :

Hexadecimal color code
(Source: http://www.visibone.com)

RColorBrewer package can also be used to create a nice looking color palettes. Read our article : Colors in R.

Read more —> Colors in R.

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

R Base Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


This chapter contains articles describring how to visualize data using R base graphs.


Creating and saving graphs

  • Creating graphs
  • Saving graphs
  • File formats for exporting plots

create and save plots

Read more: —> Creating and Saving Graphs in R.

Generic plot types in R

Read more: —> Generic plot types in R.

Scatter plots

  • R base scatter plot: plot()
  • Enhanced scatter plots: car::scatterplot()
  • 3D scatter plots

Read more: —> Scatter Plots.

Scatter plot matrices

  • R base scatter plot matrices: pairs()
  • Use the R package psych

Read more —> Scatter Plot Matrices.

Box plots

  • R base box plots: boxplot()
  • Box plot with the number of observations: gplots::boxplot2()

Read more —> Box Plots.

Strip Charts: 1-D scatter Plots

Read more —> Strip Charts: 1-D scatter Plots.

Bar plots

  • Basic bar plots
    • Change group names
    • Change color
    • Change main title and axis labels
  • Stacked bar plots
  • Grouped bar plots

Read more —> Bar Plots.

Line plots

  • R base functions: plot() and lines()
  • Basic line plots
  • Plots with multiple lines

Read more —> Line Plots.

Pie charts

  • Create basic pie charts: pie()
  • Create 3D pie charts: plotix::pie3D()

Read more —> Pie Charts.

Histogram and density plots

  • Create histogram plots: hist()
  • Create density plots: density()

Read more —> Histogram and Density Plots.

Dot charts

  • R base function: dotchart()
  • Dot chart of one numeric vector
  • Dot chart of a matrix
mtcars <- mtcars[order(mtcars$mpg), ]
# Plot and color by groups cyl
grps <- as.factor(mtcars$cyl)
my_cols <- c("#999999", "#E69F00", "#56B4E9")
dotchart(mtcars$mpg, labels = row.names(mtcars),
         groups = grps, gcolor = my_cols,
         color = my_cols[grps],
         cex = 0.6,  pch = 19, xlab = "mpg")

Read more —> Dot charts.

Plot group means and confidence intervals

R base graphical parameters

  • Add and customize titles
  • Add legends
  • Add texts
  • Add straight lines
  • Add an axis to a plot
  • Change axis scale : minimum, maximum and log scale
  • Customize tick mark labels
  • Change plotting symbols
  • Change line types
  • Change colors

Read more —> R base graphical Parameters.

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

ggplot2 - Essentials

$
0
0

Introduction

ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics.

The concept behind ggplot2 divides plot into three different fundamental parts: Plot = data + Aesthetics + Geometry.

The principal components of every plot can be defined as follow:

  • data is a data frame
  • Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
  • Geometry defines the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)

There are two major functions in ggplot2 package: qplot() and ggplot() functions.

  • qplot() stands for quick plot, which can be used to produce easily simple plots.
  • ggplot() function is more flexible and robust than qplot for building a plot piece by piece.

This document provides R course material for producing different types of plots using ggplot2.

If you want be highly effective, download our book: Guide to Create Beautiful Graphics in R

ggplot2 book

Install and load ggplot2 package

# Installation
install.packages('ggplot2')

# Loading
library(ggplot2)

Data format and preparation

The data should be a data.frame (columns are variables and rows are observations).

The data set mtcars is used in the examples below:

# Load the data
data(mtcars)
df <- mtcars[, c("mpg", "cyl", "wt")]
head(df)
##                    mpg cyl    wt
## Mazda RX4         21.0   6 2.620
## Mazda RX4 Wag     21.0   6 2.875
## Datsun 710        22.8   4 2.320
## Hornet 4 Drive    21.4   6 3.215
## Hornet Sportabout 18.7   8 3.440
## Valiant           18.1   6 3.460

Plotting with ggplot2

  1. qplot(): Quick plot with ggplot2
    • Scatter plots
    • Bar plot
    • Box plot, violin plot and dot plot
    • Histogram and density plots
  2. Box plots
    • Basic box plots
    • Box plot with dots
    • Change box plot colors by groups
      • Change box plot line colors
      • Change box plot fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Box plot with multiple groups
    • Functions: geom_boxplot(), stat_boxplot(), stat_summary()

  1. Violin plots
    • Basic violin plots
    • Add summary statistics on a violin plot
      • Add mean and median points
      • Add median and quartile
      • Add mean and standard deviation
    • Violin plot with dots
    • Change violin plot colors by groups
      • Change violin plot line colors
      • Change violin plot fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Violin plot with multiple groups
    • Functions: geom_violin(), stat_ydensity()

  1. Dot plots
    • Basic dot plots
    • Add summary statistics on a dot plot
      • Add mean and median points
      • Dot plot with box plot and violin plot
      • Add mean and standard deviation
    • Change dot plot colors by groups
    • Change the legend position
    • Change the order of items in the legend
    • Dot plot with multiple groups
    • Functions: geom_dotplot()

  1. Stripcharts
    • Basic stripcharts
    • Add summary statistics on a stripchart
      • Add mean and median points
      • Stripchart with box blot and violin plot
      • Add mean and standard deviation
    • Change point shapes by groups
    • Change stripchart colors by groups
    • Change the legend position
    • Change the order of items in the legend
    • Stripchart with multiple groups
    • Functions: geom_jitter(), stat_summary()

  1. Density plots
    • Basic density plots
    • Change density plot line types and colors
    • Change density plot colors by groups
      • Calculate the mean of each group :
      • Change line colors
      • Change fill colors
    • Change the legend position
    • Combine histogram and density plots
    • Use facets
    • Functions: geom_density(), stat_density()

  1. Histogram plots
    • Basic histogram plots
    • Add mean line and density plot on the histogram
    • Change histogram plot line types and colors
    • Change histogram plot colors by groups
      • Calculate the mean of each group
      • Change line colors
      • Change fill colors
    • Change the legend position
    • Use facets
    • Functions: geom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge().

  1. Scatter plots
    • Basic scatter plots
    • Label points in the scatter plot
      • Add regression lines
      • Change the appearance of points and lines
    • Scatter plots with multiple groups
      • Change the point color/shape/size automatically
      • Add regression lines
      • Change the point color/shape/size manually
    • Add marginal rugs to a scatter plot
    • Scatter plots with the 2d density estimation
    • Scatter plots with ellipses
    • Scatter plots with rectangular bins
    • Scatter plot with marginal density distribution plot
    • Functions: geom_point(), geom_smooth(), stat_smooth(), geom_rug(), geom_density_2d(), stat_density_2d(), stat_bin_2d(), geom_bin2d(), stat_summary_2d(), geom_hex() (see stat_bin_hex()), stat_summary_hex()

  1. Bar plots
    • Basic bar plots
      • Bar plot with labels
      • Bar plot of counts
    • Change bar plot colors by groups
      • Change outline colors
      • Change fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Bar plot with multiple groups
    • Bar plot with a numeric x-axis
    • Bar plot with error bars
    • Functions: geom_bar(), geom_errorbar()

  1. Line plots
    • Line types in R
    • Basic line plots
    • Line plot with multiple groups
      • Change globally the appearance of lines
      • Change automatically the line types by groups
      • Change manually the appearance of lines
    • Functions: geom_line(), geom_step(), geom_path(), geom_errorbar()

  1. Error bars
    • Add error bars to a bar and line plots
      • Bar plot with error bars
      • Line plot with error bars
    • Dot plot with mean point and error bars
    • Functions: geom_errorbarh(), geom_errorbar(), geom_linerange(), geom_pointrange(), geom_crossbar(), stat_summary()
  2. Pie chart
    • Simple pie charts
    • Change the pie chart fill colors
    • Create a pie chart from a factor variable
    • Functions: coord_polar()

  1. QQ plots
    • Basic qq plots
    • Change qq plot point shapes by groups
    • Change qq plot colors by groups
    • Change the legend position
    • Functions: stat_qq()

  1. ECDF plots

  1. ggsave(): Save a ggplot
    • print(): print a ggplot to a file
    • ggsave: save the last ggplot
    • Functions: print(), ggsave()

Graphical parameters

  1. Main title, axis labels and legend title
    • Change the main title and axis labels
    • Change the appearance of the main title and axis labels
    • Remove x and y axis labels
    • Functions: labs(), ggtitle(), xlab(), ylab(), update_labels()

  1. Legend position and appearance
    • Change the legend position
    • Change the legend title and text font styles
    • Change the background color of the legend box
    • Change the order of legend items
    • Remove the plot legend
    • Remove slashes in the legend of a bar plot
    • guides() : set or remove the legend for a specific aesthetic
    • Functions: guides(), guide_legend(), guide_colourbar()

  1. Change colors automatically and manually
    • Use a single color
    • Change colors by groups
      • Default colors
      • Change colors manually
      • Use RColorBrewer palettes
      • Use Wes Anderson color palettes
    • Use gray colors
    • Continuous colors: Gradient colors
    • Functions:
      • Brewer palettes: scale_colour_brewer(), scale_fill_brewer(), scale_color_brewer()
      • Gray scales: scale_color_grey(), scale_fill_grey()
      • Manual colors: scale_color_manual(), scale_fill_manual()
      • Hue colors: scale_colour_hue()
      • Gradient, continuous colors: scale_color_gradient(), scale_fill_gradient(), scale_fill_continuous(), scale_color_continuous()
      • Gradient, diverging colors: scale_color_gradient2(), scale_fill_gradient2(), scale_colour_gradientn()

  1. Point shapes, colors and size
    • Change the point shapes, colors and sizes automatically
    • Change point shapes, colors and sizes manually
    • Functions: scale_shape_manual(), scale_color_manual(), scale_size_manual()

Points shapes available in R:

r point shape

  1. Add text annotations to a graph
    • Text annotations using the function geom_text
    • Change the text color and size by groups
    • Add a text annotation at a particular coordinate
    • annotation_custom : Add a static text annotation in the top-right, top-left, …
    • Functions: geom_text(), annotate(), annotation_custom()

  1. Line types
    • Line types in R
    • Basic line plots
    • Line plot with multiple groups
      • Change globally the appearance of lines
      • Change automatically the line types by groups
      • Change manually the appearance of lines
    • Functions: scale_linetype(), scale_linetype_manual(), scale_color_manual(), scale_size_manual()

  1. Themes and background colors
    • Quick functions to change plot themes
    • Customize the appearance of the plot background
      • Change the colors of the plot panel background and the grid lines
      • Remove plot panel borders and grid lines
      • Change the plot background color (not the panel)
    • Use a custom theme
      • theme_tufte : a minimalist theme
      • theme_economist : theme based on the plots in the economist magazine
      • theme_stata: theme based on Stata graph schemes.
      • theme_wsj: theme based on plots in the Wall Street Journal
      • theme_calc : theme based on LibreOffice Calc
      • theme_hc : theme based on Highcharts JS
      • Functions: theme(), theme_bw(), theme_grey(), theme_update(), theme_blank(), theme_classic(), theme_minimal(), theme_void(), theme_dark(), element_blank(), element_line(), element_rect(), element_text(), rel()

  1. Axis scales and transformations
    • Change x and y axis limits
      • Use xlim() and ylim() functions
      • Use expand_limts() function
      • Use scale_xx() functions
    • Axis transformations
      • Log and sqrt transformations
      • Format axis tick mark labels
      • Display log tick marks
    • Format date axes
      • Plot with dates
      • Format axis tick mark labels
      • Date axis limits
    • Functions:
      • xlim(), ylim(), expand_limits() : x, y axis limits
      • scale_x_continuous(), scale_y_continuous()
      • scale_x_log10(), scale_y_log10(): log10 transformation
      • scale_x_sqrt(), scale_y_sqrt(): sqrt transformation
      • coord_trans()
      • scale_x_reverse(), scale_y_reverse()
      • annotation_logticks()
      • scale_x_date(), scale_y_date()
      • scale_x_datetime(), scale_y_datetime()

  1. Axis ticks: customize tick marks and labels, reorder and select items
    • Change the appearance of the axis tick mark labels
    • Hide x and y axis tick mark labels
    • Change axis lines
    • Set axis ticks for discrete and continuous axes
      • Customize a discrete axis
        • Change the order of items
        • Change tick mark labels
        • Choose which items to display
      • Customize a continuous axis
        • Set the position of tick marks
        • Format the text of tick mark labels
    • Functions: theme(), scale_x_discrete(), scale_y_discrete(), scale_x_continuous(), scale_y_continuous()

  1. Add straight lines to a plot: horizontal, vertical and regression lines
    • geom_hline : Add horizontal lines
    • geom_vline : Add vertical lines
    • geom_abline : Add regression lines
    • geom_segment : Add a line segment
    • Functions: geom_hline(), geom_vline(), geom_abline(), geom_segment()

  1. Rotate a plot: flip and reverse
    • Horizontal plot : coord_flip()
    • Reverse y axis
    • Functions: coord_flip(), scale_x_reverse(), scale_y_reverse()

  1. Faceting: split a plot into a matrix of panels
    • Facet with one variable
    • Facet with two variables
    • Facet scales
    • Facet labels
    • facet_wrap
    • Functions: facet_grid(), facet_wrap(), label_both(), label_bquote(), label_parsed()

Extensions to ggplot2: R packages and functions

Acknowledgment

Infos

This analysis was performed using R (ver. 3.2.4) and ggplot2 (ver 2.1.0).

ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization

$
0
0


To arrange multiple ggplot2 graphs on the same page, the standard R functions - par() and layout() - cannot be used.

This R tutorial will show you, step by step, how to put several ggplots on a single page.

The functions grid.arrange()[in the package gridExtra] and plot_grid()[in the package cowplot], will be used.

Install and load required packages

Install and load the package gridExtra

install.packages("gridExtra")
library("gridExtra")

Install and load the package cowplot

cowplot can be installed as follow:

install.packages("cowplot")

OR

as follow using devtools package (devtools should be installed before using the code below):

devtools::install_github("wilkelab/cowplot")

Load cowplot:

library("cowplot")

Prepare some data

ToothGrowth data is used :

df <- ToothGrowth
# Convert the variable dose from a numeric to a factor variable
df$dose <- as.factor(df$dose)
head(df)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Cowplot: Publication-ready plots

The cowplot package is an extension to ggplot2 and it can be used to provide a publication-ready plots.

Basic plots

library(cowplot)
# Default plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")
bp

# Add gridlines
bp + background_grid(major = "xy", minor = "none")

Recall that, the function ggsave()[in ggplot2 package] can be used to save ggplots. However, when working with cowplot, the function save_plot() [in cowplot package] is preferred. It’s an alternative to ggsave with a better support for multi-figure plots.

save_plot("mpg.pdf", bp,
          base_aspect_ratio = 1.3 # make room for figure legend
          )

Arranging multiple graphs using cowplot

# Scatter plot
sp <- ggplot(mpg, aes(x = cty, y = hwy, colour = factor(cyl)))+ 
  geom_point(size=2.5)
sp

# Bar plot
bp <- ggplot(diamonds, aes(clarity, fill = cut)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle=70, vjust=0.5))
bp

Combine the two plots (the scatter plot and the bar plot):

plot_grid(sp, bp, labels=c("A", "B"), ncol = 2, nrow = 1)

The function draw_plot() can be used to place graphs at particular locations with a particular sizes. The format of the function is:

draw_plot(plot, x = 0, y = 0, width = 1, height = 1)
  • plot: the plot to place (ggplot2 or a gtable)
  • x: The x location of the lower left corner of the plot.
  • y: The y location of the lower left corner of the plot.
  • width, height: the width and the height of the plot

The function ggdraw() is used to initialize an empty drawing canvas.

plot.iris <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_point() + facet_grid(. ~ Species) + stat_smooth(method = "lm") +
  background_grid(major = 'y', minor = "none") + # add thin horizontal lines 
  panel_border() # and a border around each panel
# plot.mpt and plot.diamonds were defined earlier
ggdraw() +
  draw_plot(plot.iris, 0, .5, 1, .5) +
  draw_plot(sp, 0, 0, .5, .5) +
  draw_plot(bp, .5, 0, .5, .5) +
  draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.5, 0.5), size = 15)

grid.arrange: Create and arrange multiple plots

The R code below creates a box plot, a dot plot, a violin plot and a strip chart (jitter plot) :

library(ggplot2)
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")

# Create a dot plot
# Add the mean point and the standard deviation
dp <- ggplot(df, aes(x=dose, y=len, fill=dose)) +
  geom_dotplot(binaxis='y', stackdir='center')+
  stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), 
                 geom="pointrange", color="red")+
   theme(legend.position = "none")

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len)) +
  geom_violin()+
  geom_boxplot(width=0.1)

# Create a stripchart
sc <- ggplot(df, aes(x=dose, y=len, color=dose, shape=dose)) +
  geom_jitter(position=position_jitter(0.2))+
  theme(legend.position = "none") +
  theme_gray()

Combine the plots using the function grid.arrange() [in gridExtra] :

library(gridExtra)
grid.arrange(bp, dp, vp, sc, ncol=2, nrow =2)

grid.arrange() and arrangeGrob(): Change column/row span of a plot

Using the R code below:

  • The box plot will live in the first column
  • The dot plot and the strip chart will live in the second column
grid.arrange(bp, arrangeGrob(dp, sc), ncol = 2)

It’s also possible to use the argument layout_matrix in grid.arrange(). In the R code below layout_matrix is a 2X3 matrix (2 columns and three rows). The first column is all 1s, that’s where the first plot lives, spanning the three rows; second column contains plots 2, 3, 4, each occupying one row.

grid.arrange(bp, dp, sc, vp, ncol = 2, 
             layout_matrix = cbind(c(1,1,1), c(2,3,4)))

Add a common legend for multiple ggplot2 graphs

This can be done in four simple steps :

  1. Create the plots : p1, p2, ….
  2. Save the legend of the plot p1 as an external graphical element (called a “grob” in Grid terminology)
  3. Remove the legends from all plots
  4. Draw all the plots with only one legend in the right panel

To save the legend of a ggplot, the helper function below can be used :

library(gridExtra)
get_legend<-function(myggplot){
  tmp <- ggplot_gtable(ggplot_build(myggplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)
}

(The function above is derived from this forum. )

# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot()

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_violin()+
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)

# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")

# 4. Arrange ggplot2 graphs with a specific width
grid.arrange(bp, vp, legend, ncol=3, widths=c(2.3, 2.3, 0.8))

Change legend position

# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot with a top legend position
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot()+theme(legend.position = "top")

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_violin()+
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)

# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")

# 4. Create a blank plot
blankPlot <- ggplot()+geom_blank(aes(1,1)) + 
  cowplot::theme_nothing()

Change legend position by changing the order of plots using the following R code. Grids with four cells are created (2X2). The height of the legend zone is set to 0.2.

Top-left legend:

Top-left legendBlank plot
box plotViolin plot
# Top-left legend
grid.arrange(legend, blankPlot,  bp, vp,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

Top-right legend:

Blank plotTop right legend
box plotViolin plot
# Top-right
grid.arrange(blankPlot, legend,  bp, vp,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

Bottom-right and bottom-left legend can be drawn as follow:

# Bottom-left legend
grid.arrange(bp, vp, legend, blankPlot,
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c(2.5, 0.2))
# Bottom-right
grid.arrange( bp, vp, blankPlot, legend, 
             ncol=2, nrow = 2, 
             widths = c(2.7, 2.7), heights = c( 2.5, 0.2))

It’s also possible to use the argument layout_matrix to customize legend position. In the R code below, layout_matrix is a 2X2 matrix:

  • The first row (height = 2.5) is where the first plot (bp) and the second plot (vp) live
  • The second row (height = 0.2) is where the legend lives spanning 2 columns

Bottom-center legend:

grid.arrange(bp, vp, legend, ncol=2, nrow = 2, 
             layout_matrix = rbind(c(1,2), c(3,3)),
             widths = c(2.7, 2.7), heights = c(2.5, 0.2))

Top-center legend:

  • The legend (plot 1) lives in the first row (height = 0.2) spanning two columns
  • bp (plot 2) and vp (plot 3) live in the second row (height = 2.5)
grid.arrange(legend, bp, vp,  ncol=2, nrow = 2, 
             layout_matrix = rbind(c(1,1), c(2,3)),
             widths = c(2.7, 2.7), heights = c(0.2, 2.5))

Scatter plot with marginal density plots

Step 1/3. Create some data :

set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
group <- as.factor(rep(c(1,2), each=500))
df2 <- data.frame(x, y, group)
head(df2)
##             x          y group
## 1 -2.20706575 -0.2053334     1
## 2 -0.72257076  1.3014667     1
## 3  0.08444118 -0.5391452     1
## 4 -3.34569770  1.6353707     1
## 5 -0.57087531  1.7029518     1
## 6 -0.49394411 -0.9058829     1

Step 2/3. Create the plots :

# Scatter plot of x and y variables and color by groups
scatterPlot <- ggplot(df2,aes(x, y, color=group)) + 
  geom_point() + 
  scale_color_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position=c(0,1), legend.justification=c(0,1))


# Marginal density plot of x (top panel)
xdensity <- ggplot(df2, aes(x, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")

# Marginal density plot of y (right panel)
ydensity <- ggplot(df2, aes(y, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")

Create a blank placeholder plot :

blankPlot <- ggplot()+geom_blank(aes(1,1))+
  theme(
    plot.background = element_blank(), 
   panel.grid.major = element_blank(),
   panel.grid.minor = element_blank(), 
   panel.border = element_blank(),
   panel.background = element_blank(),
   axis.title.x = element_blank(),
   axis.title.y = element_blank(),
   axis.text.x = element_blank(), 
   axis.text.y = element_blank(),
   axis.ticks = element_blank(),
   axis.line = element_blank()
     )

Step 3/3. Put the plots together:

Arrange ggplot2 with adapted height and width for each row and column :

library("gridExtra")
grid.arrange(xdensity, blankPlot, scatterPlot, ydensity, 
        ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))

Create a complex layout using the function viewport()

The different steps are :

  1. Create plots : p1, p2, p3, ….
  2. Move to a new page on a grid device using the function grid.newpage()
  3. Create a layout 2X2 - number of columns = 2; number of rows = 2
  4. Define a grid viewport : a rectangular region on a graphics device
  5. Print a plot into the viewport
require(grid)
# Move to a new page
grid.newpage()

# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))

# A helper function to define a region on the layout
define_region <- function(row, col){
  viewport(layout.pos.row = row, layout.pos.col = col)
} 

# Arrange the plots
print(scatterPlot, vp=define_region(1, 1:2))
print(xdensity, vp = define_region(2, 1))
print(ydensity, vp = define_region(2, 2))

ggExtra: Add marginal distributions plots to ggplot2 scatter plots

The package ggExtra is an easy-to-use package developped by Dean Attali, for adding marginal histograms, boxplots or density plots to ggplot2 scatter plots.

The package can be installed and used as follow:

# Install
install.packages("ggExtra")
# Load
library("ggExtra")

# Create some data
set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
df3 <- data.frame(x, y)

# Scatter plot of x and y variables and color by groups
sp2 <- ggplot(df3,aes(x, y)) + geom_point()

# Marginal density plot
ggMarginal(sp2 + theme_gray())

# Marginal histogram plot
ggMarginal(sp2 + theme_gray(), type = "histogram",
           fill = "steelblue", col = "darkblue")

Insert an external graphical element inside a ggplot

The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements. The simplified format is :

annotation_custom(grob, xmin, xmax, ymin, ymax)

  • grob: the external graphical element to display
  • xmin, xmax : x location in data coordinates (horizontal location)
  • ymin, ymax : y location in data coordinates (vertical location)


The different steps are :

  1. Create a scatter plot of y = f(x)
  2. Add, for example, the box plot of the variables x and y inside the scatter plot using the function annotation_custom()

As the inset box plot overlaps with some points, a transparent background is used for the box plots.

# Create a transparent theme object
transparent_theme <- theme(
 axis.title.x = element_blank(),
 axis.title.y = element_blank(),
 axis.text.x = element_blank(), 
 axis.text.y = element_blank(),
 axis.ticks = element_blank(),
 panel.grid = element_blank(),
 axis.line = element_blank(),
 panel.background = element_rect(fill = "transparent",colour = NA),
 plot.background = element_rect(fill = "transparent",colour = NA))

Create the graphs :

p1 <- scatterPlot # see previous sections for the scatterPlot

# Box plot of the x variable
p2 <- ggplot(df2, aes(factor(1), x))+
  geom_boxplot(width=0.3)+coord_flip()+
  transparent_theme

# Box plot of the y variable
p3 <- ggplot(df2, aes(factor(1), y))+
  geom_boxplot(width=0.3)+
  transparent_theme

# Create the external graphical elements
# called a "grop" in Grid terminology
p2_grob = ggplotGrob(p2)
p3_grob = ggplotGrob(p3)

# Insert p2_grob inside the scatter plot
xmin <- min(x); xmax <- max(x)
ymin <- min(y); ymax <- max(y)
p1 + annotation_custom(grob = p2_grob, xmin = xmin, xmax = xmax, 
                       ymin = ymin-1.5, ymax = ymin+1.5)

# Insert p3_grob inside the scatter plot
p1 + annotation_custom(grob = p3_grob,
                       xmin = xmin-1.5, xmax = xmin+1.5, 
                       ymin = ymin, ymax = ymax)

If you have a solution to insert, at the same time, both p2_grob and p3_grob inside the scatter plot, please let me a comment. I got some errors trying to do this…

Mix table, text and ggplot2 graphs

The functions below are required :

  • tableGrob() [in the package gridExtra] : for adding a data table to a graphic device
  • splitTextGrob() [in the package RGraphics] : for adding a text to a graph

Make sure that the package RGraphics is installed.

library(RGraphics)
library(gridExtra)

# Table
p1 <- tableGrob(head(ToothGrowth))

# Text
text <- "ToothGrowth data describes the effect of Vitamin C on tooth growth in Guinea pigs.  Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used."
p2 <- splitTextGrob(text)

# Box plot
p3 <- ggplot(df, aes(x=dose, y=len)) + geom_boxplot()

# Arrange the plots on the same page
grid.arrange(p1, p2, p3, ncol=1)

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. 2.1.0)


3D graphics

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as visualizing data using R base graphs.


This chapter describes how to create static and interactive three-dimension (3D) graphs. We provide also an R package named graph3d to easily build and customize, step by step, 3D graphs in R software.


Simple 3D Scatter Plots: scatterplot3d Package

  • Install and load scaterplot3d
  • Prepare the data
  • The function scatterplot3d()
  • Basic 3D scatter plots
  • Change the main title and axis labels
  • Change the shape and the color of points
  • Change point shapes by groups
  • Change point colors by groups
  • Change the global appearance of the graph
    • Remove the box around the plot
    • Add grids on scatterplot3d
  • Add bars
  • Modification of scatterplot3d output
    • Add legends
    • Add point labels
    • Add regression plane and supplementary points

Read more: —>Simple 3D Scatter Plots: scatterplot3d Package.

Advanced 3D Graphs: plot3D Package

  • Install and load plot3D package
  • Prepare the data
  • Scatter plots
    • Basic scatter plot
    • Change the type of the box around the plot
    • Change the color by groups
    • Change the position of the legend
    • 3D viewing direction
    • Titles and axis labels
    • Tick marks and labels
    • Add points and text to an existing plot
  • Line plots
    • Add confidence interval
    • 3D fancy Scatter plot with small dots on basal plane
    • Regression plane
  • text3D: plot 3-dimensionnal texts
  • text3D and scatter3D
  • 3D Histogram
  • scatter2D: 2D scatter plot
  • text2D
  • Interactive plot

Read more: —>Advanced 3D Graphs: plot3D Package.

Interactive 3D Scatter Plots

  • Install and load required packages
  • Prepare the data
  • The function scatter3d
  • Basic 3D scatter plots
  • Plot the points by groups
    • Default plot
    • Remove the surfaces
    • Add concentration ellipsoids
    • Change point colors by groups
  • Axes
    • Change axis labels
    • Remove axis scales
    • Change axis colors
  • Add text labels for the points
  • Export images
3d scatter plot rgl

3d scatter plot rgl

Read more: —>Interactive 3D Scatter Plots.

Guide to RGL 3D Visualization System

  • Install the RGL package
  • Load the RGL package
  • Prepare the data
  • Start and close RGL device
  • 3D scatter plot
    • Basic graph
    • Change the background and point colors
    • Change the shape of points
  • rgl_init(): A custom function to initialize RGL device
  • Add a bounding box decoration
  • Add axis lines and labels
  • Set the aspect ratios of the x, y and z axes
  • Change the color of points by groups
  • Change the shape of points
  • Add an ellipse of concentration
  • Regression plane
  • Create a movie of RGL scene
  • Export images as png or pdf
  • Export the plot into an interactive HTML file
  • Select a rectangle in an RGL scene
  • Identify points in a plot
  • R3D Interface
  • RGL functions
    • Device management
    • Shape functions
    • Scene management
    • Setup the environment
    • Appearance setup
    • Export screenshot
    • Assign focus to an RGL window
RGL movie 3d

RGL movie 3d

Read more —>Guide to RGL 3D Visualization System.

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

Data Visualization

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


This chapter describes how to plot data in R and make elegant data visualization.


Lattice Graphs


Lattice package is a powerful and elegant data visualization system that aims to improve on base R graphs.


  • xyplot(): Scatter plot
  • cloud(): 3D scatter plot
  • Box plot, Dot plot, Strip plot
  • Density plot and Histogram

Read more: —>Lattice Graphs.

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

ggpubr R Package: ggplot2-Based Publication Ready Plots

$
0
0


Why ggpubr?

ggplot2 by Hadley Wickham is an excellent and flexible package for elegant data visualization in R. However the default generated plots requires some formatting before we can send them for publication. Furthermore, to customize a ggplot, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills.

The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

Installation and loading

  • Install from CRAN as follow:
install.packages("ggpubr")
  • Or, install the latest version from GitHub as follow:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Load ggpubr as follow:
library(ggpubr)

Geting started

See the online documentation (http://www.sthda.com/english/rpkgs/ggpubr) for a complete list.

Density and histogram plots

  1. Create some data
set.seed(1234)
wdata = data.frame(
   sex = factor(rep(c("F", "M"), each=200)),
   weight = c(rnorm(200, 55), rnorm(200, 58)))
head(wdata, 4)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
  1. Density plot with mean lines and marginal rug
# Change outline and fill colors by groups ("sex")
# Use custom palette
ggdensity(wdata, x = "weight",
   add = "mean", rug = TRUE,
   color = "sex", fill = "sex",
   palette = c("#00AFBB", "#E7B800"))


Note that:

  1. the argument palette is used for coloring or filling by groups. Allowed values include:
    • “grey” for grey color palettes;
    • brewer palettes e.g. “RdBu”, “Blues”, …; click here to see all brewer palettes.
    • or custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
    • and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
  2. the argument add can be used to add mean or median lines to density and to histogram plots. Allowed values are: “mean” and “median”.


  1. Histogram plot with mean lines and marginal rug
# Change outline and fill colors by groups ("sex")
# Use custom color palette
gghistogram(wdata, x = "weight",
   add = "mean", rug = TRUE,
   color = "sex", fill = "sex",
   palette = c("#00AFBB", "#E7B800"))

If you want to create the above histogram with the standard ggplot2 functions, the syntax is extremely complex for beginners (see the R script below). The ggpubr package is a wrapper around ggplot2 functions to make your life easier and to produce quickly a publication ready plot.

# ggplot2 standard syntax for creating histogram
# +++++++++++++++++++++++++++++++++++++
# Compute group mean
library("dplyr")
mu <- wdata %>%
group_by(sex) %>%
summarise(grp.mean = mean(weight))
# Plot
ggplot(data = wdata, aes(weight)) +
  geom_histogram(aes(color = sex, fill = sex),
                 position = "identity", alpha = 0.5)+
  geom_vline(data = mu, aes(xintercept=grp.mean, color = sex),
             linetype="dashed", size=1) +
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  scale_fill_manual(values = c("#00AFBB", "#E7B800"))+
  theme_classic()+
  theme(
    axis.text.x = element_text(size = 12, colour = "black",face = "bold"),
    axis.text.y = element_text(size = 12, colour = "black",face = "bold"),
    axis.line.x = element_line(colour = "black", size = 1),
    axis.line.y = element_line(colour = "black", size = 1),
    legend.position = "bottom"
    )

Box plots, violin plots, dot plots and strip charts

  1. Load data
data("ToothGrowth")
df <- ToothGrowth
head(df, 4)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
  1. Box plots with jittered points
# Change outline colors by groups: dose
# Use custom color palette
# Add jitter points and change the shape by groups
 ggboxplot(df, x = "dose", y = "len",
    color = "dose", palette =c("#00AFBB", "#E7B800", "#FC4E07"),
    add = "jitter", shape = "dose")


Note that, when using ggpubr functions for drawing box plots, violin plots, dot plots, strip charts, bar plots, line plots or error plots, the argument add can be used for adding another plot element (e.g.: dot plot or error bars).

In this case, allowed values for the argument add are one or the combination of: “none”, “dotplot”, “jitter”, “boxplot”, “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”; see ?desc_statby for more details.


  1. Violin plots with box plots inside
# Change fill color by groups: dose
# add boxplot with white fill color
ggviolin(df, x = "dose", y = "len", fill = "dose",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
   add = "boxplot", add.params = list(fill = "white"))

  1. Dot plots with summary statistics
# Change outline and fill colors by groups: dose
# Add mean + sd
ggdotplot(df, x = "dose", y = "len", color = "dose", fill = "dose", 
          palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          add = "mean_sd", add.params = list(color = "gray"))


Recall that, possible summary statistics include “boxplot”, “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”; see ?desc_statby for more details.


  1. Strip chart with summary statistics
# Change points size
# Change point colors and shapes by groups: dose
# Use custom color palette
 ggstripchart(df, "dose", "len",  size = 2, shape = "dose",
   color = "dose", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
   add = "mean_sd")

Bar plots

  1. Basic plot with labels outsite
# Data
df2 <- data.frame(dose=c("D0.5", "D1", "D2"),
   len=c(4.2, 10, 29.5))
print(df2)
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5
# Change ouline and fill colors by groups: dose
# Use custom color palette
# Add labels
 ggbarplot(df2, x = "dose", y = "len",
   fill = "dose", color = "dose",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
   label = TRUE)


  • Use lab.pos = “in”, to put labels inside bars
  • Use lab.col, to change label colors


  1. Bar plot with multiple groups
# Create some data
df3 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
   dose=rep(c("D0.5", "D1", "D2"),2),
   len=c(6.8, 15, 33, 4.2, 10, 29.5))
print(df3)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5
# Plot "len" by "dose" and change color by a second group: "supp"
# Add labels inside bars
ggbarplot(df3, x = "dose", y = "len",
  fill = "supp", color = "supp", palette = c("#00AFBB", "#E7B800"),
  label = TRUE, lab.col = "white", lab.pos = "in")

  1. Bar plot visualizing the mean of each group with error bars
# Data: ToothGrowth data set we'll be used.
df <- ToothGrowth
head(df, 10)
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
# Visualize the mean of each group
# Change point and outline colors by groups: dose
# Add jitter points and errors (mean_se)
ggbarplot(df, x = "dose", y = "len", color = "dose",
          palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          add = c("mean_se", "jitter"))

Line plots

  1. Line plots with multiple groups
# Plot "len" by "dose" and
# Change line types and point shapes by a second groups: "supp"
# Change color by groups "supp"
ggline(df3, x = "dose", y = "len",
  linetype = "supp", shape = "supp",
  color = "supp",  palette = c("#00AFBB", "#E7B800"))

  1. Line plot visualizing the mean of each group with error bars
# Visualize the mean of each group: dose
# Change colors by a second groups: supp
# Add jitter points and errors (mean_se)
ggline(df, x = "dose", y = "len", 
       color = "supp", 
       palette = c("#00AFBB", "#E7B800", "#FC4E07"),
       add = c("mean_se", "jitter"))

Pie chart

  1. Create some data
df4 <- data.frame(
  group = c("Male", "Female", "Child"),
  value = c(25, 25, 50))
head(df4)
##    group value
## 1   Male    25
## 2 Female    25
## 3  Child    50
  1. Pie chart
# Change fill color by group
# set outline line color to white
# Use custom color palette
# Show group names and value as labels
labs <- paste0(df4$group, " (", df4$value, "%)")
ggpie(df4, x = "value", fill = "group", color = "white",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
   label = labs, lab.pos = "in", lab.font = "white")

Scatter plots

  1. Load and prepare data
data("mtcars")
df5 <- mtcars
df5$cyl <- as.factor(df5$cyl) # grouping variable
df5$name = rownames(df5) # for point labels
head(df5[, c("wt", "mpg", "cyl")], 3)
##                  wt  mpg cyl
## Mazda RX4     2.620 21.0   6
## Mazda RX4 Wag 2.875 21.0   6
## Datsun 710    2.320 22.8   4
  1. Scatter plots with regression line and confidence interval
ggscatter(df5, x = "wt", y = "mpg",
   color = "black", shape = 21, size = 4, # Points color, shape and size
   add = "reg.line",  # Add regressin line
   add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
   conf.int = TRUE, # Add confidence interval
   cor.coef = TRUE # Add correlation coefficient
   )


Note that, when using ggpubr functions for drawing scatter plots, allowed values for the argument add are one of “none”, “reg.line” (for adding linear regression line) or “loess” (for adding local regression fitting).


  1. Scatter plot with concentration ellipses and labels
# Change point colors and shapes by groups: cyl
# Use custom palette
# Add concentration ellipses with mean points (barycenters)
# Add marginal rug
# Add label and use repel = TRUE to avoid label overplotting
ggscatter(df5, x = "wt", y = "mpg",
   color = "cyl", shape = "cyl",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
   ellipse = TRUE, mean.point = TRUE,
   rug = TRUE, label = "name", font.label = 10, repel = TRUE)


Note that, it’s possible to change the ellipse type by using the argument ellipse.type. Possible values are ‘convex’, ‘confidence’ or types supported by ggplot2::stat_ellipse() including one of c(“t”, “norm”, “euclid”).


Cleveland’s dot plots

# Change colors by  group cyl
ggdotchart(df5, x = "mpg", label = "name",
   group = "cyl", color = "cyl",
   palette = c("#00AFBB", "#E7B800", "#FC4E07") )

ggpar(): customize ggplot easily

The function ggpar() [in ggpubr] can be used to simply and easily customize any ggplot2-based graphs. The graphical parameters that can be changed using ggpar() include:

  • Main titles, axis labels and legend titles
  • Legend position and appearance
  • colors
  • Axis limits
  • Axis transformations: log and sqrt
  • Axis ticks
  • Themes
  • Rotate a plot

Note that all the arguments accepted by the function ggpar() can be also directly passed to the plotting functions in ggpubr package.

We start by creating a basic box plot colored by groups as follow:

df <- ToothGrowth
p <- ggboxplot(df, x = "dose", y = "len",
               color = "dose")
print(p)

Main titles, axis labels and legend titles

# Change title texts and fonts
ggpar(p, main = "Plot of length \n by dose",
      xlab ="Dose (mg)", ylab = "Teeth length",
      legend.title = "Dose (mg)",
      font.main = c(14,"bold.italic", "red"),
      font.x = c(14, "bold", "#2E9FDF"),
      font.y = c(14, "bold", "#E7B800"))

# Hide titles
ggpar(p, xlab = FALSE, ylab = FALSE)


Note that,

  1. font.main, font.x, font.y are vectors of length 3 indicating respectively the size (e.g.: 14), the style (e.g.: “plain”, “bold”, “italic”, “bold.italic”) and the color (e.g.: “red”) of main title, xlab and ylab, respectively. For example font.x = c(14, “bold”, “red”). Use font.x = 14, to change only font size; or use font.x = “bold”, to change only font face.
  2. you can use \n, to split long title into multiple lines.


Legend position and appearance

ggpar(p,
 legend = "right", legend.title = "Dose (mg)",
 font.legend = c(10, "bold", "red"))


Note that, the legend argument is a character vector specifying legend position. Allowed values are one of c(“top”, “bottom”, “left”, “right”, “none”). Default is “bottom” side position. to remove the legend use legend = “none”. Legend position can be also specified using a numeric vector c(x, y). Their values should be between 0 and 1. c(0,0) corresponds to the “bottom left” and c(1,1) corresponds to the “top right” position.


Color palettes

As mentioned above, the argument palette is used to change group color palettes. Allowed values include:

  • Custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
  • “grey” for grey color palettes;
  • brewer palettes e.g. “RdBu”, “Blues”, …; click here to see all brewer palettes.
  • and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
# Use custom color palette
ggpar(p, palette = c("#00AFBB", "#E7B800", "#FC4E07"))

# Use brewer palette
ggpar(p, palette = "Dark2" )

# Use grey palette
ggpar(p, palette = "grey")
   

# Use scientific journal palette from ggsci package
# Allowed values: "npg", "aaas", "lancet", "jco", 
#   "ucscgb", "uchicago", "simpsons" and "rickandmorty".
ggpar(p, palette = "npg") # nature

Axis limits and scales

The following arguments can be used:


  • xlim, ylim: a numeric vector of length 2, specifying x and y axis limits (minimum and maximum values), respectively. e.g.: ylim = c(0, 50).
  • xscale, yscale: x and y axis scale, respectively. Allowed values are one of c(“none”, “log2”, “log10”, “sqrt”); e.g.: yscale=“log2”.
  • format.scale: logical value. If TRUE, axis tick mark labels will be formatted when xscale or yscale = “log2” or “log10”.


# Change y axis limits
ggpar(p, ylim = c(0, 50))

# Change y axis scale to log2
ggpar(p, yscale = "log2")

# Format axis scale
ggpar(p, yscale = "log2", format.scale = TRUE)

Axis ticks: customize tick marks and labels

The following arguments can be used:


  • ticks: logical value. Default is TRUE. If FALSE, hide axis tick marks.
  • tickslab: logical value. Default is TRUE. If FALSE, hide axis tick labels.
  • font.tickslab: Font style (size, face, color) for tick labels, e.g.: c(14, “bold”, “red”).
  • xtickslab.rt, ytickslab.rt: Rotation angle of x and y axis tick labels, respectively. Default value is 0.
  • xticks.by, yticks.by: numeric value controlling x and y axis breaks, respectively. For example, if yticks.by = 5, a tick mark is shown on every 5. Default value is NULL.


# Axis tick labels style: "plain", "italic", "bold" or "bold.italic"
# Rotation angle = 45
ggpar(p, font.tickslab = c(12, "bold", "#2E9FDF"),
      xtickslab.rt = 45, ytickslab.rt = 45)

# Hide ticks and tickslab
ggpar(p, ticks = FALSE, tickslab = FALSE)

Themes

The R package ggpubr contains two main functions for changing the default ggplot theme to a publication ready theme:

  • theme_pubr(): change the theme to a publication ready theme
  • labs_pubr(): Format only plot labels to a publication ready style

theme_pubr() will produce plots with bold axis labels, bold tick mark labels and legend at the bottom leaving extra space for the plotting area.


The argument ggtheme can be used in any ggpubr plotting functions to change the plot theme. Default value is theme_pubr() for publication ready theme. Allowed values include ggplot2 official themes: theme_gray(), theme_bw(), theme_minimal(), theme_classic(), theme_void(), etc. It’s also possible to use the function “+” to add a theme.


# Gray theme
p + theme_gray()

# Minimal theme
p + theme_minimal()

# Format only plot labels to a publication ready style
# by using the function labs_pubr()
p + theme_minimal() + labs_pubr(base_size = 16)

Rotate a plot

  • Create some data
set.seed(1234)
wdata = data.frame(
   sex = factor(rep(c("F", "M"), each=200)),
   weight = c(rnorm(200, 55), rnorm(200, 58)))
  • Create a density plot and change plot orientation
# Basic density plot
p <- ggdensity(wdata, x = "weight") + theme_gray()
p

# Horizontal plot
ggpar(p, orientation = "horizontal" ) + theme_gray()

# y axis reversed
ggpar(p, orientation = "reverse" ) + theme_gray()

More

See the online documentation (http://www.sthda.com/english/rpkgs/ggpubr) for a complete list.

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggpubr (ver. 0.1.0.999)

R packages

$
0
0


In this section, you’ll find R packages developed by STHDA for easy data analyses.


factoextra

factoextra let you extract and create ggplot2-based elegant visualizations of multivariate data analyse results, including PCA, CA, MCA, MFA, HMFA and clustering methods.

Overview >>
factoextra Site Link >>
survminer

survminer provides functions for facilitating survival analysis and visualization.

Overview >>
survminer Site Link >>
ggpubr

The default plots generated by ggplot2 requires some formatting before we can send them for publication. To customize a ggplot, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills. ggpubr provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

Overview >>
ggpubr Site Link >>

Infos

This analysis has been performed using R software (ver. 3.2.4)

Hybrid hierarchical k-means clustering for optimizing clustering outputs - Unsupervised Machine Learning

$
0
0


Clustering algorithms are used to split a dataset into several groups (i.e clusters), so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible.

The most popular clustering algorithms are:

  • [url=/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning]k-means clustering[/url], a partitioning method used for splitting a dataset into a set of k clusters.
  • [url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]hierarchical clustering[/url], an alternative approach to k-means clustering for identifying clustering in the dataset by using [url=/wiki/clarifying-distance-measures-unsupervised-machine-learning]pairwise distance matrix[/url] between observations as clustering criteria.

However, each of these two standard clustering methods has its limitations. K-means clustering requires the user to specify the number of clusters in advance and selects initial centroids randomly. Agglomerative hierarchical clustering is good at identifying small clusters but not large ones.

In this article, we document hybrid approaches for easily mixing the best of k-means clustering and hierarchical clustering.

1 How this article is organized

We’ll start by demonstrating why we should combine k-means and hierarcical clustering. An application is provided using R software.

Finally, we’ll provide an easy to use R function (in factoextra package) for computing hybrid hierachical k-means clustering.

2 Required R packages

We’ll use the R package factoextra which is very helpful for simplifying clustering workflows and for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Load the package:

library(factoextra)

3 Data preparation

We’ll use USArrest dataset and we start by scaling the data:

# Load the data
data(USArrests)
# Scale the data
df <- scale(USArrests)
head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

If you want to understand why the data are scaled before the analysis, then you should read this section: [url=/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling]Distances and scaling[/url].

4 R function for clustering analyses

We’ll use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: [url=/wiki/visual-enhancement-of-clustering-analysis-unsupervised-machine-learning]Visual Enhancement of Clustering Analysis[/url].

eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used for computing [url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]hierarchical clustering[/url] and [url=/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning]partititioning clustering[/url] in a single line function call.

4.1 Example of k-means clustering

We’ll split the data into 4 clusters using k-means clustering as follow:

library("factoextra")
# K-means clustering
km.res <- eclust(df, "kmeans", k = 4,
                 nstart = 25, graph = FALSE)
# k-means group number of each observation
head(km.res$cluster, 15)
##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa 
##           3           2           1
# Visualize k-means clusters
fviz_cluster(km.res,  frame.type = "norm", frame.level = 0.68)

Clustering on principal component - Unsupervised Machine Learning

# Visualize the silhouette of clusters
fviz_silhouette(km.res)
##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

Clustering on principal component - Unsupervised Machine Learning

Note that, silhouette coefficient measures how well an observation is clustered and it estimates the average distance between clusters (i.e, the average silhouette width). Observations with negative silhouette are probably placed in the wrong cluster. Read more here: [url=/wiki/clustering-validation-statistics-4-vital-things-everyone-should-know-unsupervised-machine-learning]cluster validation statistics[/url]

Samples with negative silhouette coefficient:

# Silhouette width of observation
sil <- km.res$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

Read more about k-means clustering: [url=/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning]K-means clustering[/url]

4.2 Example of hierarchical clustering

# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
head(res.hc$cluster, 15)
##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           1           2           2           3           2           2 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           4           3           2           1           3           4 
##    Illinois     Indiana        Iowa 
##           2           3           4
# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = TRUE, cex = 0.5) 

Clustering on principal component - Unsupervised Machine Learning

# Visualize the silhouette of clusters
fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1    7          0.40
## 2       2   12          0.26
## 3       3   18          0.38
## 4       4   13          0.35

Clustering on principal component - Unsupervised Machine Learning

It can be seen that three samples have negative silhouette coefficient indicating that they are not in the right cluster. These samples are:

# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
##             cluster neighbor    sil_width
## Alaska            2        1 -0.005212336
## Nebraska          4        3 -0.044172624
## Connecticut       4        3 -0.078016589

Read more about hierarchical clustering: [url=/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning]Hierarchical clustering[/url]

5 Combining hierarchical clustering and k-means

5.1 Why?

Recall that, in k-means algorithm, a random set of observations are chosen as the initial centers.

The final k-means clustering solution is very sensitive to this initial random selection of cluster centers. The result might be (slightly) different each time you compute k-means.

To avoid this, a solution is to use an hybrid approach by combining the hierarchical clustering and the k-means methods. This process is named hybrid hierarchical k-means clustering (hkmeans).

5.2 How ?

The procedure is as follow:

  1. Compute hierarchical clustering and cut the tree into k-clusters
  2. compute the center (i.e the mean) of each cluster
  3. Compute k-means by using the set of cluster centers (defined in step 3) as the initial cluster centers

Note that, k-means algorithm will improve the initial partitioning generated at the step 2 of the algorithm. Hence, the initial partitioning can be slightly different from the final partitioning obtained in the step 4.

5.3 R codes

5.3.1 Compute hierarchical clustering and cut the tree into k-clusters:

res.hc <- eclust(df, "hclust", k = 4,
                method = "ward.D2", graph = FALSE) 
grp <- res.hc$cluster

5.3.2 Compute the centers of clusters defined by hierarchical clustering:

Cluster centers are defined as the means of variables in clusters. The function aggregate() can be used to compute the mean per group in a data frame.

# Compute cluster centers
clus.centers <- aggregate(df, list(grp), mean)
clus.centers
##   Group.1     Murder    Assault   UrbanPop        Rape
## 1       1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2       2  0.7298036  1.1188219  0.7571799  1.32135653
## 3       3 -0.3250544 -0.3231032  0.3733701 -0.17068130
## 4       4 -1.0745717 -1.1056780 -0.7972496 -1.00946922
# Remove the first column
clus.centers <- clus.centers[, -1]
clus.centers
##       Murder    Assault   UrbanPop        Rape
## 1  1.5803956  0.9662584 -0.7775109  0.04844071
## 2  0.7298036  1.1188219  0.7571799  1.32135653
## 3 -0.3250544 -0.3231032  0.3733701 -0.17068130
## 4 -1.0745717 -1.1056780 -0.7972496 -1.00946922

5.3.3 K-means clustering using hierarchical clustering defined cluster-centers

km.res2 <- eclust(df, "kmeans", k = clus.centers, graph = FALSE)
fviz_silhouette(km.res2)
##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   13          0.27
## 3       3   16          0.34
## 4       4   13          0.37

Clustering on principal component - Unsupervised Machine Learning

5.3.4 Compare the results of hierarchical clustering and hybrid approach

The R code below compares the initial clusters defined using only hierarchical clustering and the final ones defined using hierarchical clustering + k-means:

# res.hc$cluster: Initial clusters defined using hierarchical clustering
# km.res2$cluster: Final clusters defined using k-means
table(km.res2$cluster, res.hc$cluster)
##    
##      1  2  3  4
##   1  7  0  1  0
##   2  0 12  1  0
##   3  0  0 15  1
##   4  0  0  1 12

It can be seen that, 3 of the observations defined as belonging to cluster 3 by hierarchical clustering has been reclassified to cluster 1, 2, and 4 in the final solution defined by k-means clustering.

The difference can be easily visualized using the function fviz_dend() [in factoextra]. The labels are colored using k-means clusters:

fviz_dend(res.hc, k = 4, 
          k_colors = c("black", "red",  "blue", "green3"),
          label_cols =  km.res2$cluster[res.hc$order], cex = 0.6)

Clustering on principal component - Unsupervised Machine Learning

It can be seen that the hierarchical clustering result has been improved by the k-means algorithm.

5.3.5 Compare the results of standard k-means clustering and hybrid approach

# Final clusters defined using hierarchical k-means clustering
km.clust <- km.res$cluster
# Standard k-means clustering
set.seed(123)
res.km <- kmeans(df, centers = 4, iter.max = 100)
# comparison
table(km.clust, res.km$cluster)
##         
## km.clust  1  2  3  4
##        1 13  0  0  0
##        2  0 16  0  0
##        3  0  0 13  0
##        4  0  0  0  8

In our current example, there was no further improvement of the k-means clustering result by the hybrid approach. An improvement might be observed using another dataset.

5.4 hkmeans(): Easy-to-use function for hybrid hierarchical k-means clustering

The function hkmeans() [in factoextra] can be used to compute easily the hybrid approach of k-means on hierarchical clustering. The format of the result is similar to the one provided by the standard kmeans() function.

# Compute hierarchical k-means clustering
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)
##  [1] "cluster""centers""totss""withinss"    
##  [5] "tot.withinss""betweenss""size""iter"        
##  [9] "ifault""data""hclust"
# Print the results
res.hk
## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster""centers""totss""withinss"    
##  [5] "tot.withinss""betweenss""size""iter"        
##  [9] "ifault""data""hclust"
# Visualize the tree
fviz_dend(res.hk, cex = 0.6, rect = TRUE)

Clustering on principal component - Unsupervised Machine Learning

# Visualize the hkmeans final clusters
fviz_cluster(res.hk, frame.type = "norm", frame.level = 0.68)

Clustering on principal component - Unsupervised Machine Learning

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

ExpressionSet and SummarizedExperiment

$
0
0
This analysis was performed using R (ver. 3.1.0).

Introduction

The ExpressionSet is generally used for array-based experiments, where the rows are features, and the SummarizedExperiment is generally used for sequencing-based experiments, where the rows are GenomicRanges. ExpressionSet is in the Biobase library.

There’s a library GEOquery which lets you pull down ExpressionSets by identifier.

library(Biobase) #Download data from GEO library(GEOquery) geoq <- getGEO("GSE9514") #The list has a single element #Save it to letter e for simplicity names(geoq)
## [1] "GSE9514_series_matrix.txt.gz"
e <- geoq[[1]]

ExpressionSet

ExpressionSets are basically matrices with a lot of metadata around them. Here, we have a matrix which is 9,000 by 8. It has phenotypic data, feature and annotation information. You can use the functions dim, ncol, nrow to have the dimension, the number of columns and rows respectively.

The matrix of the expression data, is stored in the exprs slot and you can access that with the exprs function. The phenotypic data, can be accessed using pData and this gives us a data frame, which is information about the samples, including the accession number, the submission date, etc). The feature data is accessible with fData function and this is information about genes or probe sets. The names of the data in the feature data are, for example, the gene title, the gene symbol, the ENREZ_Gene_ID or Gene Ontology information, which might be useful for downstream analysis.

#Number of dimension dim(e)
## Features Samples ## 9335 8
#The expression matrix exprs(e)[1:3,1:3]
## GSM241146 GSM241147 GSM241148 ## 10000_at 15.33 9.459 7.985 ## 10001_at 283.47 300.729 270.016 ## 10002_i_at 2569.45 2382.815 2711.814
#Phenotypic data: information about the samples pData(e)[1:3,1:6]
## title ## GSM241146 hem1 strain grown in YPD with 250 uM ALA (08-15-06_Philpott_YG_S98_1) ## GSM241147 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_10) ## GSM241148 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_11) ## geo_accession status submission_date ## GSM241146 GSM241146 Public on Nov 06 2007 Nov 02 2007 ## GSM241147 GSM241147 Public on Nov 06 2007 Nov 02 2007 ## GSM241148 GSM241148 Public on Nov 06 2007 Nov 02 2007 ## last_update_date type ## GSM241146 Aug 14 2011 RNA ## GSM241147 Aug 14 2011 RNA ## GSM241148 Aug 14 2011 RNA
dim(pData(e))
## [1] 8 31
#Column names of the phenotypic data names(pData(e))
## [1] "title""geo_accession" ## [3] "status""submission_date" ## [5] "last_update_date""type" ## [7] "channel_count""source_name_ch1" ## [9] "organism_ch1""characteristics_ch1" ## [11] "molecule_ch1""extract_protocol_ch1" ## [13] "label_ch1""label_protocol_ch1" ## [15] "taxid_ch1""hyb_protocol" ## [17] "scan_protocol""description" ## [19] "data_processing""platform_id" ## [21] "contact_name""contact_email" ## [23] "contact_department""contact_institute" ## [25] "contact_address""contact_city" ## [27] "contact_state""contact_zip/postal_code" ## [29] "contact_country""supplementary_file" ## [31] "data_row_count"
#Feature data : information about genes or probe sets fData(e)[1:3,1:3]
## ID ORF SPOT_ID ## 10000_at 10000_at YLR331C  ## 10001_at 10001_at YLR332W  ## 10002_i_at 10002_i_at YLR333C 
dim(fData(e))
## [1] 9335 17
names(fData(e))
## [1] "ID""ORF" ## [3] "SPOT_ID""Species Scientific Name" ## [5] "Annotation Date""Sequence Type" ## [7] "Sequence Source""Target Description" ## [9] "Representative Public ID""Gene Title" ## [11] "Gene Symbol""ENTREZ_GENE_ID" ## [13] "RefSeq Transcript ID""SGD accession number" ## [15] "Gene Ontology Biological Process""Gene Ontology Cellular Component" ## [17] "Gene Ontology Molecular Function"
head(fData(e)$"Gene Symbol")
## [1] JIP3 MID2 RPS25B NUP2 ## 4869 Levels: ACO1 ARV1 ATP14 BOP2 CDA1 CDA2 CDC25 CDC3 CDD1 CTS1 ... Il4
head(rownames(e))
## [1] "10000_at""10001_at""10002_i_at""10003_f_at""10004_at" ## [6] "10005_at"
#experiment data: experimenter name, laboratory, contact, abstract. #It's empty sometimes experimentData(e)
## Experiment data ## Experimenter name: ## Laboratory: ## Contact information: ## Title: ## URL: ## PMIDs: ## No abstract available.
#Annotation plateform annotation(e)
## [1] "GPL90"

Summarized Experiment

I’m going to load a bioconductor annotation package, which is the parathyroid SummarizedExperiment library. The loaded data is a SummarizedExperiment, which summarizes counts of RNA sequencing reads in genes for an experiment on human cell culture. The SummarizedExperiment object has 63,000 rows, which are genes, and 27 columns, which are samples, and the matrix, in this case, is called counts. And we have the row names, which are ensemble genes, and metadata about the row data, and metadata about the column data.

library(parathyroidSE) #RNA sequencing reads data(parathyroidGenesSE) se <- parathyroidGenesSE se
## class: SummarizedExperiment ## dim: 63193 27 ## exptData(1): MIAME ## assays(1): counts ## rownames(63193): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99 ## rowData metadata column names(0): ## colnames: NULL ## colData names(8): run experiment ... study sample

assay function can be used get access to the counts of RNA sequencing reads. colData function , the column data, is equivalent to the pData on the ExpressionSet. Each row in this data frame corresponds to a column in the SummarizedExperiment. We can see that there are indeed 27 rows here, which give information about the columns. Each sample in this case is treated with two treatments or control and we can see the number of replicates for each, using the as.numeric function again.

#Dimension of the SummarizedExperiment dim(se)
## [1] 63193 27
#Get access to the counts of RNA sequencing reads, using assay function. assay(se)[1:3,1:3]
## [,1] [,2] [,3] ## ENSG00000000003 792 1064 444 ## ENSG00000000005 4 1 2 ## ENSG00000000419 294 282 164
#Dimensions of this assay is a matrix, which has the same dimensions as the SummarizedExperiment. dim(assay(se))
## [1] 63193 27
#Get information about samples colData(se)[1:3,1:6]
## DataFrame with 3 rows and 6 columns ## run experiment patient treatment time submission ##  ## 1 SRR479052 SRX140503 1 Control 24h SRA051611 ## 2 SRR479053 SRX140504 1 Control 48h SRA051611 ## 3 SRR479054 SRX140505 1 DPN 24h SRA051611
#dimension of column data dim(colData(se))
## [1] 27 8
#characteristics of the samples names(colData(se))
## [1] "run""experiment""patient""treatment""time" ## [6] "submission""study""sample"
#Get access to treatment column of sample characteristics colData(se)$treatment
## [1] Control Control DPN DPN OHT OHT Control Control ## [9] DPN DPN DPN OHT OHT OHT Control Control ## [17] DPN DPN OHT OHT Control DPN DPN DPN ## [25] OHT OHT OHT ## Levels: Control DPN OHT

The rows in this case correspond to genes. Genes are collections of exons. The rows of the SummarizedExperiment is a GRangesList where each row corresponds to a GRanges which contains the exons, which were used to count the RNA sequencing reads. Some metadata are included in the row data and is accessible with the metadata function. This information tells us, how this GRangesList was constructed. if it was constructed from the genomic features package using a transcript database. Homo sapiens was the organism, and the database was ENSEMBL GENES number 72, and etc. In addition, there’s some more information under experiment data, using exptData and then specifying the MIAME, which is minimal information about a microarray experiment, Although we’re not using microarrays, we’ve still used the same slots to describe extra information about this object.

#Extract out a single GRanges object: 17 ranges and 2 metada columns which is #the ensembl id for the exon. rowData(se)[1]
## GRangesList of length 1: ## $ENSG00000000003 ## GRanges with 17 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ##  |  ## [1] X [99883667, 99884983] - | 664095 ENSE00001459322 ## [2] X [99885756, 99885863] - | 664096 ENSE00000868868 ## [3] X [99887482, 99887565] - | 664097 ENSE00000401072 ## [4] X [99887538, 99887565] - | 664098 ENSE00001849132 ## [5] X [99888402, 99888536] - | 664099 ENSE00003554016 ## ... ... ... ... ... ... ... ## [13] X [99890555, 99890743] - | 664106 ENSE00003512331 ## [14] X [99891188, 99891686] - | 664108 ENSE00001886883 ## [15] X [99891605, 99891803] - | 664109 ENSE00001855382 ## [16] X [99891790, 99892101] - | 664110 ENSE00001863395 ## [17] X [99894942, 99894988] - | 664111 ENSE00001828996 ## ## --- ## seqlengths: ## 1 2 ... LRG_99 ## 249250621 243199373 ... 13294
#rowData is indeed GRangesList class(rowData(se))
## [1] "GRangesList" ## attr(,"package") ## [1] "GenomicRanges"
#length gives the number of genes length(rowData(se))
## [1] 63193
#length of the first GRanges : gives the number of exons for the first gene. length(rowData(se)[[1]])
## [1] 17
head(rownames(se))
## [1] "ENSG00000000003""ENSG00000000005""ENSG00000000419""ENSG00000000457" ## [5] "ENSG00000000460""ENSG00000000938"
metadata(rowData(se))
## $genomeInfo ## $genomeInfo$`Db type` ## [1] "TranscriptDb" ## ## $genomeInfo$`Supporting package` ## [1] "GenomicFeatures" ## ## $genomeInfo$`Data source` ## [1] "BioMart" ## ## $genomeInfo$Organism ## [1] "Homo sapiens" ## ## $genomeInfo$`Resource URL` ## [1] "www.biomart.org:80" ## ## $genomeInfo$`BioMart database` ## [1] "ensembl" ## ## $genomeInfo$`BioMart database version` ## [1] "ENSEMBL GENES 72 (SANGER UK)" ## ## $genomeInfo$`BioMart dataset` ## [1] "hsapiens_gene_ensembl" ## ## $genomeInfo$`BioMart dataset description` ## [1] "Homo sapiens genes (GRCh37.p11)" ## ## $genomeInfo$`BioMart dataset version` ## [1] "GRCh37.p11" ## ## $genomeInfo$`Full dataset` ## [1] "yes" ## ## $genomeInfo$`miRBase build ID` ## [1] NA ## ## $genomeInfo$transcript_nrow ## [1] "213140" ## ## $genomeInfo$exon_nrow ## [1] "737783" ## ## $genomeInfo$cds_nrow ## [1] "531154" ## ## $genomeInfo$`Db created by` ## [1] "GenomicFeatures package from Bioconductor" ## ## $genomeInfo$`Creation time` ## [1] "2013-07-30 17:30:25 +0200 (Tue, 30 Jul 2013)" ## ## $genomeInfo$`GenomicFeatures version at creation time` ## [1] "1.13.21" ## ## $genomeInfo$`RSQLite version at creation time` ## [1] "0.11.4" ## ## $genomeInfo$DBSCHEMAVERSION ## [1] "1.0"
exptData(se)$MIAME
## Experiment data ## Experimenter name: Felix Haglund ## Laboratory: Science for Life Laboratory Stockholm ## Contact information: Mikael Huss ## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells ## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 ## PMIDs: 23024189 ## ## Abstract: A 251 word abstract is available. Use 'abstract' method.
abstract(exptData(se)$MIAME)
## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."

Licence

Licence

ggplot2 ECDF plot : Quick start guide for Empirical Cumulative Density Function - R software and data visualization

$
0
0


This R tutorial describes how to create an ECDF plot (or Empirical Cumulative Density Function) using R software and ggplot2 package. ECDF reports for any given number the percent of individuals that are below that threshold.

The function stat_ecdf() can be used.

Create some data

set.seed(1234)
df <- data.frame(height = round(rnorm(200, mean=60, sd=15)))
head(df)
##   height
## 1     42
## 2     64
## 3     76
## 4     25
## 5     66
## 6     68

ECDF plots

library(ggplot2)

ggplot(df, aes(height)) + stat_ecdf(geom = "point")

ggplot(df, aes(height)) + stat_ecdf(geom = "step")

For any value, say, height = 50, you can see that about 25% of our individuals are shorter than 50 inches

Customized ECDF plots

# Basic ECDF plot
ggplot(df, aes(height)) + stat_ecdf(geom = "step")+
labs(title="Empirical Cumulative \n Density Function",
     y = "F(height)", x="Height in inch")+
theme_classic()

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. 2.1.0)


Descriptive Statistics and Graphics

$
0
0



Descriptive statistics consist of describing simply the data using some summary statistics and graphics. Here, we’ll describe how to compute summary statistics using R software.


Descriptive statistics

Descriptive statistics

Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())

# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named iris.

# Store the data in the variable my_data
my_data <- iris

Check your data

You can inspect your data using the functions head() and tails(), which will display the first and the last part of the data, respectively.

# Print the first 6 rows
head(my_data, 6)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

R functions for computing descriptive statistics

Some R functions for computing descriptive statistics:

DescriptionR function
Meanmean()
Standard deviationsd()
Variancevar()
Minimummin()
Maximummaximum()
Medianmedian()
Range of values (minimum and maximum)range()
Sample quantilesquantile()
Generic functionsummary()
Interquartile rangeIQR()

The function mfv(), for most frequent value, [in modeest package] can be used to find the statistical mode of a numeric vector.

Descriptive statistics for a single group

Measure of central tendency: mean, median, mode

Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:

  • the mean: the average value. It’s sensitive to outliers.
  • the median: the middle value. It’s a robust alternative to mean.
  • and the mode: the most frequent value

In R,

  • The function mean() and median() can be used to compute the mean and the median, respectively;
  • The function mfv() [in the modeest R package] can be used to compute the mode of a variable.

The R code below computes the mean, median and the mode of the variable Sepal.Length [in my_data data set]:

# Compute the mean value
mean(my_data$Sepal.Length)
[1] 5.843333
# Compute the median value
median(my_data$Sepal.Length)
[1] 5.8
# Compute the mode
# install.packages("modeest")
require(modeest)
mfv(my_data$Sepal.Length)
[1] 5

Measure of variablity

Measures of variability gives how “spread out” the data are.

Range: minimum & maximum

  • Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
[1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
[1] 7.9
# Range
range(my_data$Sepal.Length)
[1] 4.3 7.9

Interquartile range

Recall that, quartiles divide the data into 4 parts. Note that, the interquartile range (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation.

  • R function:
quantile(x, probs = seq(0, 1, 0.25))

  • x: numeric vector whose sample quantiles are wanted.
  • probs: numeric vector of probabilities with values in [0,1].


  • Example:
quantile(my_data$Sepal.Length)
  0%  25%  50%  75% 100% 
 4.3  5.1  5.8  6.4  7.9 

By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles).

To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:

quantile(my_data$Sepal.Length, seq(0, 1, 0.1))

To compute the interquartile range, type this:

IQR(my_data$Sepal.Length)
[1] 1.3

Variance and standard deviation

The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.

# Compute the variance
var(my_data$Sepal.Length)
# Compute the standard deviation =
# square root of th variance
sd(my_data$Sepal.Length)

Median absolute deviation

The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.

# Compute the median
median(my_data$Sepal.Length)
# Compute the median absolute deviation
mad(my_data$Sepal.Length)

Which measure to use?

  • Range. It’s not often used because it’s very sensitive to outliers.
  • Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination with the median.
  • Variance. It’s completely uninterpretable because it doesn’t use the same units as the data. It’s almost never used except as a mathematical tool
  • Standard deviation. This is the square root of the variance. It’s expressed in the same units as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency.
  • Median absolute deviation. It’s a robust way to estimate the standard deviation, for data with outliers. It’s not used very often.

In summary, the IQR and the standard deviation are the two most common measures used to report the variability of the data.

Computing an overall summary of a variable and an entire data frame

summary() function

The function summary() can be used to display several statistic summaries of either one variable or an entire data frame.

  • Summary of a single variable. Five values are returned: the mean, median, 25th and 75th quartiles, min and max in one single line call:
summary(my_data$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 
  • Summary of a data frame. In this case, the function summary() is automatically applied to each column. The format of the result depends on the type of the data contained in the column. For example:
    • If the column is a numeric variable, mean, median, min, max and quartiles are returned.
    • If the column is a factor variable, the number of observations in each group is returned.
summary(my_data, digits = 1)
  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width        Species  
 Min.   :4     Min.   :2    Min.   :1     Min.   :0.1   setosa    :50  
 1st Qu.:5     1st Qu.:3    1st Qu.:2     1st Qu.:0.3   versicolor:50  
 Median :6     Median :3    Median :4     Median :1.3   virginica :50  
 Mean   :6     Mean   :3    Mean   :4     Mean   :1.2                  
 3rd Qu.:6     3rd Qu.:3    3rd Qu.:5     3rd Qu.:1.8                  
 Max.   :8     Max.   :4    Max.   :7     Max.   :2.5                  

sapply() function

It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …

# Compute the mean of each column
sapply(my_data[, -5], mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 
# Compute quartiles
sapply(my_data[, -5], quantile)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
0%            4.3         2.0         1.00         0.1
25%           5.1         2.8         1.60         0.3
50%           5.8         3.0         4.35         1.3
75%           6.4         3.3         5.10         1.8
100%          7.9         4.4         6.90         2.5

stat.desc() function

The function stat.desc() [in pastecs package], provides other useful statistics including:

  • the median
  • the mean
  • the standard error on the mean (SE.mean)
  • the confidence interval of the mean (CI.mean) at the p level (default is 0.95)
  • the variance (var)
  • the standard deviation (std.dev)
  • and the variation coefficient (coef.var) defined as the standard deviation divided by the mean

  • Install pastecs package

install.packages("pastecs")
  • Use the function stat.desc() to compute descriptive statistics
# Compute descriptive statistics
library(pastecs)
res <- stat.desc(my_data[, -5])
round(res, 2)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
nbr.val            150.00      150.00       150.00      150.00
nbr.null             0.00        0.00         0.00        0.00
nbr.na               0.00        0.00         0.00        0.00
min                  4.30        2.00         1.00        0.10
max                  7.90        4.40         6.90        2.50
range                3.60        2.40         5.90        2.40
sum                876.50      458.60       563.70      179.90
median               5.80        3.00         4.35        1.30
mean                 5.84        3.06         3.76        1.20
SE.mean              0.07        0.04         0.14        0.06
CI.mean.0.95         0.13        0.07         0.28        0.12
var                  0.69        0.19         3.12        0.58
std.dev              0.83        0.44         1.77        0.76
coef.var             0.14        0.14         0.47        0.64

Case of missing values

Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.

For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells to the function to remove any NAs before calculations. An example using the mean function is as follow:

mean(my_data$Sepal.Length, na.rm = TRUE)

Graphical display of distributions

The R package ggpubr will be used to create graphs.

Installation and loading ggpubr

  • Install the latest version from GitHub as follow:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Load ggpubr as follow:
library(ggpubr)

Box plots

ggboxplot(my_data, y = "Sepal.Length", width = 0.5)

Histogram


Histograms show the number of observations that fall within specified divisions (i.e., bins).


Histogram plot of Sepal.Length with mean line (dashed line).

gghistogram(my_data, x = "Sepal.Length", bins = 9, 
             add = "mean")

Empirical cumulative distribution function (ECDF)


ECDF is the fraction of data smaller than or equal to x.


ggecdf(my_data, x = "Sepal.Length")

Q-Q plots


QQ plots is used to check whether the data is normally distributed.


ggqqplot(my_data, x = "Sepal.Length")

Descriptive statistics by groups

To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used.

  • We want to group the data by Species and then:
    • compute the number of element in each group. R function: n()
    • compute the mean. R function mean()
    • and the standard deviation. R function sd()

The function %>% is used to chaine operations.

  • Install ddplyr as follow:
install.packages("dplyr")
  • Descriptive statistics by groups:
library(dplyr)
group_by(my_data, Species) %>% 
summarise(
  count = n(), 
  mean = mean(Sepal.Length, na.rm = TRUE),
  sd = sd(Sepal.Length, na.rm = TRUE)
  )
Source: local data frame [3 x 4]

     Species count  mean        sd
      (fctr) (int) (dbl)     (dbl)
1     setosa    50 5.006 0.3524897
2 versicolor    50 5.936 0.5161711
3  virginica    50 6.588 0.6358796
  • Graphics for grouped data:
library("ggpubr")
# Box plot colored by groups: Species
ggboxplot(my_data, x = "Species", y = "Sepal.Length",
          color = "Species",
          palette = c("#00AFBB", "#E7B800", "#FC4E07"))

# Stripchart colored by groups: Species
ggstripchart(my_data, x = "Species", y = "Sepal.Length",
          color = "Species",
          palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          add = "mean_sd")

Note that, when the number of observations per groups is small, it’s recommended to use strip chart compared to box plots.

Frequency tables

A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.

R function to generate tables: table()

Create some data

Distribution of hair and eye color by sex of 592 students:

# Hair/eye color data
df <- as.data.frame(HairEyeColor)
hair_eye_col <- df[rep(row.names(df), df$Freq), 1:3]
rownames(hair_eye_col) <- 1:nrow(hair_eye_col)
head(hair_eye_col)
   Hair   Eye  Sex
1 Black Brown Male
2 Black Brown Male
3 Black Brown Male
4 Black Brown Male
5 Black Brown Male
6 Black Brown Male
# hair/eye variables
Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye

Simple frequency distribution: one categorical variable

  • Table of counts
# Frequency distribution of hair color
table(Hair)
Hair
Black Brown   Red Blond 
  108   286    71   127 
# Frequency distribution of eye color
table(Eye)
Eye
Brown  Blue Hazel Green 
  220   215    93    64 
  • Graphics: to create the graphics, we start by converting the table as a data frame.
# Compute table and convert as data frame
df <- as.data.frame(table(Hair))
df
   Hair Freq
1 Black  108
2 Brown  286
3   Red   71
4 Blond  127
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq")

Two-way contingency table: Two categorical variables

tbl2 <- table(Hair , Eye)
tbl2
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

It’s also possible to use the function xtabs(), which will create cross tabulation of data frames with a formula interface.

xtabs(~ Hair + Eye, data = hair_eye_col)
  • Graphics: to create the graphics, we start by converting the table as a data frame.
df <- as.data.frame(tbl2)
head(df)
   Hair   Eye Freq
1 Black Brown   68
2 Brown Brown  119
3   Red Brown   26
4 Blond Brown    7
5 Black  Blue   20
6 Brown  Blue   84
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq",
          color = "Eye", 
          palette = c("brown", "blue", "gold", "green"))

# position dodge
ggbarplot(df, x = "Hair", y = "Freq",
          color = "Eye", position = position_dodge(),
          palette = c("brown", "blue", "gold", "green"))

Multiway tables: More than two categorical variables

  • Hair and Eye color distributions by sex using xtabs():
xtabs(~Hair + Eye + Sex, data = hair_eye_col)
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8
  • You can also use the function ftable() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:
ftable(Sex + Hair ~ Eye, data = hair_eye_col)
      Sex   Male                 Female                
      Hair Black Brown Red Blond  Black Brown Red Blond
Eye                                                    
Brown         32    53  10     3     36    66  16     4
Blue          11    50  10    30      9    34   7    64
Hazel         10    25   7     5      5    29   7     5
Green          3    15   7     8      2    14   7     8

Compute table margins and relative frequency

Table margins correspond to the sums of counts along rows or columns of the table. Relative frequencies express table entries as proportions of table margins (i.e., row or column totals).

The function margin.table() and prop.table() can be used to compute table margins and relative frequencies, respectively.

  1. Format of the functions:
margin.table(x, margin = NULL)

prop.table(x, margin = NULL)
  • x: table
  • margin: index number (1 for rows and 2 for columns)
  1. compute table margins:
Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye
# Hair/Eye color table
he.tbl <- table(Hair, Eye)
he.tbl
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16
# Margin of rows
margin.table(he.tbl, 1)
Hair
Black Brown   Red Blond 
  108   286    71   127 
# Margin of columns
margin.table(he.tbl, 2)
Eye
Brown  Blue Hazel Green 
  220   215    93    64 
  1. Compute relative frequencies:
# Frequencies relative to row total
prop.table(he.tbl, 1)
       Eye
Hair         Brown       Blue      Hazel      Green
  Black 0.62962963 0.18518519 0.13888889 0.04629630
  Brown 0.41608392 0.29370629 0.18881119 0.10139860
  Red   0.36619718 0.23943662 0.19718310 0.19718310
  Blond 0.05511811 0.74015748 0.07874016 0.12598425
# Table of percentages
round(prop.table(he.tbl, 1), 2)*100
       Eye
Hair    Brown Blue Hazel Green
  Black    63   19    14     5
  Brown    42   29    19    10
  Red      37   24    20    20
  Blond     6   74     8    13

To express the frequencies relative to the grand total, use this:

he.tbl/sum(he.tbl)

Infos

This analysis has been performed using R software (ver. 3.2.4).

QQ-plots: Quantile-Quantile plots - R Base Graphs

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for importing data into R.


Here, we’ll describe how to create quantile-quantile plots in R. QQ plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted. QQ plots are used to visually check the normality of the data.


Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Example data

Here, we’ll use the built-in R data set named ToothGrowth.

# Store the data in the variable my_data
my_data <- ToothGrowth

Create QQ plots

The R base functions qqnorm() and qqplot() can be used to produce quantile-quantile plots:

  • qqnorm(): produces a normal QQ plot of the variable
  • qqline(): adds a reference line
qqnorm(my_data$len, pch = 1, frame = FALSE)
qqline(my_data$len, col = "steelblue", lwd = 2)

It’s also possible to use the function qqPlot() [in car package]:

library("car")
qqPlot(my_data$len)

As all the points fall approximately along this reference line, we can assume normality.

Infos

This analysis has been performed using R statistical software (ver. 3.2.4).

Normality Test in R

$
0
0


Many of statistical tests including correlation, regression, t-test, and analysis of variance (ANOVA) assume some certain characteristics about the data. They require the data to follow a normal distribution or Gaussian distribution. These tests are called parametric tests, because their validity depends on the distribution of the data.

Normality and the other assumptions made by these tests should be taken seriously to draw reliable interpretation and conclusions of the research.

Before using a parametric test, we should perform some preleminary tests to make sure that the test assumptions are met. In the situations where the assumptions are violated, non-paramatric tests are recommended.

Here, we’ll describe how to check the normality of the data by visual inspection and by significance tests.

Install required R packages

  1. dplyr for data manipulation
install.packages("dplyr")
  1. ggpubr for an easy ggplot2-based data visualization
  • Install the latest version from GitHub as follow:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")

Load required R packages

library("dplyr")
library("ggpubr")

Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())

# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named ToothGrowth.

# Store the data in the variable my_data
my_data <- ToothGrowth

Check your data

We start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package].

Show 10 random rows:

set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
7  11.2   VC  0.5
37  8.2   OJ  0.5
36 10.0   OJ  0.5
58 27.3   OJ  2.0
49 14.5   OJ  1.0
57 26.4   OJ  2.0
1   4.2   VC  0.5
13 15.2   VC  1.0
35 14.5   OJ  0.5
27 26.7   VC  2.0

Assess the normality of the data in R

We want to test if the variable len (tooth length) is normally distributed.

Case of large sample sizes

If the sample size is large enough (n > 30), we can ignore the distribution of the data and use parametric tests.

The central limit theorem tells us that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).

However, to be consistent, normality can be checked by visual inspection [normal plots (histogram), Q-Q plot (quantile-quantile plot)] or by significance tests].

Visual methods

Density plot and Q-Q plot can be used to check normality visually.

  1. Density plot: the density plot provides a visual judgment about whether the distribution is bell shaped.
library("ggpubr")
ggdensity(my_data$len, 
          main = "Density plot of tooth length",
          xlab = "Tooth length")

  1. Q-Q plot: Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted.
library(ggpubr)
ggqqplot(my_data$len)

It’s also possible to use the function qqPlot() [in car package]:

library("car")
qqPlot(my_data$len)

As all the points fall approximately along this reference line, we can assume normality.

Normality test

Visual inspection, described in the previous section, is usually unreliable. It’s possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.

There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.

The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.

Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

Note that, normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.

The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):

shapiro.test(my_data$len)

    Shapiro-Wilk normality test

data:  my_data$len
W = 0.96743, p-value = 0.1091

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

Infos

This analysis has been performed using R software (ver. 3.2.4).

Statistical Tests and Assumptions

$
0
0



Here we’ll describe research questions and the corresponding statistical tests, as well as, the test assumptions.

Statistical tests and assumptions

Statistical tests and assumptions

Research questions and corresponding statistical tests

The most popular research questions include:


  1. whether two variables (n = 2) are correlated (i.e., associated)
  2. whether multiple variables (n > 2) are correlated
  3. whether two groups (n = 2) of samples differ from each other
  4. whether multiple groups (n >= 2) of samples differ from each other
  5. whether the variability of two samples differ


Each of these questions can be answered using the following statistical tests:


  1. Correlation test between two variables
  2. Correlation matrix between multiple variables
  3. Comparing the means of two groups:
    • Student’s t-test (parametric)
    • Wilcoxon rank test (non-parametric)
  4. Comaring the means of more than two groups
    • ANOVA test (analysis of variance, parametric): extension of t-test to compare more than two groups.
    • Kruskal-Wallis rank sum test (non-parametric): extension of Wilcoxon rank test to compare more than two groups
  5. Comparing the variances:
    • Comparing the variances of two groups: F-test (parametric)
    • Comparison of the variances of more than two groups: Bartlett’s test (parametric), Levene’s test (parametric) and Fligner-Killeen test (non-parametric)


Statistical test requirements (assumptions)

Many of the statistical procedures including correlation, regression, t-test, and analysis of variance assume some certain characteristic about the data. Generally they assume that:

  • the data are normally distributed
  • and the variances of the groups to be compared are homogeneous (equal).

These assumptions should be taken seriously to draw reliable interpretation and conclusions of the research.

These tests - correlation, t-test and ANOVA - are called parametric tests, because their validity depends on the distribution of the data.

Before using parametric test, we should perform some preleminary tests to make sure that the test assumptions are met. In the situations where the assumptions are violated, non-paramatric tests are recommended.

How to assess the normality of the data?

  1. With large enough sample sizes (n > 30) the violation of the normality assumption should not cause major problems (central limit theorem). This implies that we can ignore the distribution of the data and use parametric tests.

  2. However, to be consistent, we can use Shapiro-Wilk’s significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.

How to assess the equality of variances?

The standard Student’s t-test (comparing two independent samples) and the ANOVA test (comparing multiple samples) assume also that the samples to be compared have equal variances.

If the samples, being compared, follow normal distribution, then it’s possible to use:

  • F-test to compare the variances of two samples
  • Bartlett’s Test or Levene’s Test to compare the variances of multiple samples.

Infos

This analysis has been performed using R software (ver. 3.2.4).

Correlation Test Between Two Variables in R

$
0
0


What is correlation test?


Correlation test is used to evaluate the association between two or more variables.


For instance, if we are interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question.

If there is no relationship between the two variables (father and son heights), the average height of son should be the same regardless of the height of the fathers and vice versa.

Here, we’ll describe the different correlation methods and we’ll provide pratical examples using R software.

Install and load required R packages

We’ll use the [url=/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots]ggpubr R package[/url] for an easy ggplot2-based data visualization

  • Install the latest version from GitHub as follow (recommended):
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Load ggpubr as follow:
library("ggpubr")

Methods for correlation analyses

There are different methods to perform correlation analysis:

  • Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.

  • Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-parametric)

The most commonly used method is the Pearson correlation method.

Correlation formula

In the formula below,

  • x and y are two vectors of length n
  • \(m_x\) and \(m_y\) corresponds to the means of x and y, respectively.

Pearson correlation formula

\[ r = \frac{\sum{(x-m_x)(y-m_y)}}{\sqrt{\sum{(x-m_x)^2}\sum{(y-m_y)^2}}} \]

\(m_x\) and \(m_y\) are the means of x and y variables.

The p-value (significance level) of the correlation can be determined :

  1. by using the correlation coefficient table for the degrees of freedom : \(df = n-2\), where \(n\) is the number of observation in x and y variables.

  2. or by calculating the t value as follow:

\[ t = \frac{r}{\sqrt{1-r^2}}\sqrt{n-2} \]

In the case 2) the corresponding p-value is determined using [url=/wiki/t-distribution-table]t distribution table[/url] for \(df = n-2\)

If the p-value is < 5%, then the correlation between x and y is significant.

Spearman correlation formula

The Spearman correlation method computes the correlation between the rank of x and the rank of y variables.

\[ rho = \frac{\sum(x' - m_{x'})(y'_i - m_{y'})}{\sqrt{\sum(x' - m_{x'})^2 \sum(y' - m_{y'})^2}} \]

Where \(x' = rank(x_)\) and \(y' = rank(y)\).

Kendall correlation formula

The Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is \(n(n-1)/2\), where n is the size of x and y.

The procedure is as follow:

  • Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders.

  • Now, for each \(y_i\), count the number of \(y_j > y_i\) (concordant pairs (c)) and the number of \(y_j < y_i\) (discordant pairs (d)).

Kendall correlation distance is defined as follow:

\[ tau = \frac{n_c - n_d}{\frac{1}{2}n(n-1)} \]

Where,

  • \(n_c\): total number of concordant pairs
  • \(n_d\): total number of discordant pairs
  • \(n\): size of x and y

Compute correlation in R

R functions

Correlation coefficient can be computed using the functions cor() or cor.test():


  • cor() computes the correlation coefficient
  • cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation .


The simplified formats are:

cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))

  • x, y: numeric vectors with the same length
  • method: correlation method


If your data contain missing values, use the following R code to handle missing values by case-wise deletion.

cor(x, y,  method = "pearson", use = "complete.obs")

Import your data into R

  1. Prepare your data as specified here: [url=/wiki/best-practices-for-preparing-your-data-set-for-r]Best practices for preparing your data set for R[/url]

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set mtcars as an example.

The R code below computes the correlation between mpg and wt variables in mtcars data set:

my_data <- mtcars
head(my_data, 6)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We want to compute the correlation between mpg and wt variables.

Visualize your data using scatter plots

To use R base graphs, click this link: [url=/wiki/scatter-plots-r-base-graphs]scatter plot - R base graphs[/url]. Here, we’ll use the [url=/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots]ggpubr R package[/url].

library("ggpubr")
ggscatter(my_data, x = "mpg", y = "wt", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")
Correlation Test Between Two Variables in R software

Correlation Test Between Two Variables in R software

Preleminary test to check the test assumptions

  1. Is the covariation linear? Yes, form the plot above, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.

  2. Are the data from each of the 2 variables (x, y) follow a normal distribution?
    • Use Shapiro-Wilk normality test –> R function: shapiro.test()
    • and look at the normality plot —> R function: ggpubr::ggqqplot()
  • Shapiro-Wilk test can be performed as follow:
    • Null hypothesis: the data are normally distributed
    • Alternative hypothesis: the data are not normally distributed
# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229
# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09

From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

  • Visual inspection of the data normality using Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.
library("ggpubr")
# mpg
ggqqplot(my_data$mpg, ylab = "MPG")
# wt
ggqqplot(my_data$wt, ylab = "WT")
Correlation Test Between Two Variables in R softwareCorrelation Test Between Two Variables in R software

Correlation Test Between Two Variables in R software

From the normality plots, we conclude that both populations may come from normal distributions.

Note that, if the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

Pearson correlation test

Correlation test between mpg and wt variables:

res <- cor.test(my_data$wt, my_data$mpg, 
                    method = "pearson")
res

    Pearson's product-moment correlation
data:  my_data$wt and my_data$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594 

In the result above :

  • t is the t-test statistic value (t = -9.559),
  • df is the degrees of freedom (df= 30),
  • p-value is the significance level of the t-test (p-value = 1.29410^{-10}).
  • conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.9338, -0.7441]);
  • sample estimates is the correlation coefficient (Cor.coeff = -0.87).


Interpretation of the result

The p-value of the test is 1.29410^{-10}, which is less than the significance level alpha = 0.05. We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .

Access to the values returned by cor.test() function

The function cor.test() returns a list containing the following components:

  • p.value: the p-value of the test
  • estimate: the correlation coefficient
# Extract the p.value
res$p.value
[1] 1.293959e-10
# Extract the correlation coefficient
res$estimate
       cor 
-0.8676594 

Kendall rank correlation test

The Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

res2 <- cor.test(my_data$wt, my_data$mpg,  method="kendall")
res2

    Kendall's rank correlation tau
data:  my_data$wt and my_data$mpg
z = -5.7981, p-value = 6.706e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.7278321 

tau is the Kendall correlation coefficient.

The correlation coefficient between x and y are -0.7278 and the p-value is 6.70610^{-9}.

Spearman rank correlation coefficient

Spearman’s rho statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.

res2 <-cor.test(my_data$wt, my_data$mpg,  method = "spearman")
res2

    Spearman's rank correlation rho
data:  my_data$wt and my_data$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
-0.886422 

rho is the Spearman’s correlation coefficient.

The correlation coefficient between x and y are -0.8864 and the p-value is 1.48810^{-11}.

Interpret correlation coefficient

Correlation coefficient is comprised between -1 and 1:


  • -1 indicates a strong negative correlation : this means that every time x increases, y decreases (left panel figure)
  • 0 means that there is no association between the two variables (x and y) (middle panel figure)
  • 1 indicates a strong positive correlation : this means that y increases with x (right panel figure)


Correlation Test Between Two Variables in R softwareCorrelation Test Between Two Variables in R softwareCorrelation Test Between Two Variables in R software

Correlation Test Between Two Variables in R software

Online correlation coefficient calculator

You can compute correlation test between two variables, online, without any installation by clicking the following link:



Summary


  • Use the function cor.test(x,y) to analyze the correlation coefficient between two variables and to get significance level of the correlation.
  • Three possible correlation methods using the function cor.test(x,y): pearson, kendall, spearman


Infos

This analysis has been performed using R software (ver. 3.2.4).

Viewing all 183 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>