Quantcast
Channel: Easy Guides
Viewing all 183 articles
Browse latest View live

ggplot2 axis scales and transformations

$
0
0


This R tutorial describes how to modify x and y axis limits (minimum and maximum values) using ggplot2 package. Axis transformations (log scale, sqrt, …) and date axis are also covered in this article.


Prepare the data

ToothGrowth data is used in the following examples :

# Convert dose column dose from a numeric to a factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Make sure that dose column is converted as a factor using the above R script.

Example of plots

library(ggplot2)
# Box plot 
bp <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()
bp
# scatter plot
sp<-ggplot(cars, aes(x = speed, y = dist)) + geom_point()
sp

Change x and y axis limits

There are different functions to set axis limits :

  • xlim() and ylim()
  • expand_limits()
  • scale_x_continuous() and scale_y_continuous()

Use xlim() and ylim() functions

To change the range of a continuous axis, the functions xlim() and ylim() can be used as follow :

# x axis limits
sp + xlim(min, max)
# y axis limits
sp + ylim(min, max)

min and max are the minimum and the maximum values of each axis.

# Box plot : change y axis range
bp + ylim(0,50)
# scatter plots : change x and y limits
sp + xlim(5, 40)+ylim(0, 150)

Use expand_limts() function

Note that, the function expand_limits() can be used to :

  • quickly set the intercept of x and y axes at (0,0)
  • change the limits of x and y axes
# set the intercept of x and y axis at (0,0)
sp + expand_limits(x=0, y=0)
# change the axis limits
sp + expand_limits(x=c(0,30), y=c(0, 150))

Use scale_xx() functions

It is also possible to use the functions scale_x_continuous() and scale_y_continuous() to change x and y axis limits, respectively.

The simplified formats of the functions are :

scale_x_continuous(name, breaks, labels, limits, trans)
scale_y_continuous(name, breaks, labels, limits, trans)

  • name : x or y axis labels
  • breaks : to control the breaks in the guide (axis ticks, grid lines, …). Among the possible values, there are :
    • NULL : hide all breaks
    • waiver() : the default break computation
    • a character or numeric vector specifying the breaks to display
  • labels : labels of axis tick marks. Allowed values are :
    • NULL for no labels
    • waiver() for the default labels
    • character vector to be used for break labels
  • limits : a numeric vector specifying x or y axis limits (min, max)
  • trans for axis transformations. Possible values are “log2”, “log10”, …


The functions scale_x_continuous() and scale_y_continuous() can be used as follow :

# Change x and y axis labels, and limits
sp + scale_x_continuous(name="Speed of cars", limits=c(0, 30)) +
  scale_y_continuous(name="Stopping distance", limits=c(0, 150))

Axis transformations

Log and sqrt transformations

Built in functions for axis transformations are :

  • scale_x_log10(), scale_y_log10() : for log10 transformation
  • scale_x_sqrt(), scale_y_sqrt() : for sqrt transformation
  • scale_x_reverse(), scale_y_reverse() : to reverse coordinates
  • coord_trans(x =“log10”, y=“log10”) : possible values for x and y are “log2”, “log10”, “sqrt”, …
  • scale_x_continuous(trans=‘log2’), scale_y_continuous(trans=‘log2’) : another allowed value for the argument trans is ‘log10’

These functions can be used as follow :

# Default scatter plot
sp <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
sp
# Log transformation using scale_xx()
# possible values for trans : 'log2', 'log10','sqrt'
sp + scale_x_continuous(trans='log2') +
  scale_y_continuous(trans='log2')
# Sqrt transformation
sp + scale_y_sqrt()
# Reverse coordinates
sp + scale_y_reverse() 

The function coord_trans() can be used also for the axis transformation

# Possible values for x and y : "log2", "log10", "sqrt", ...
sp + coord_trans(x="log2", y="log2")

Format axis tick mark labels

Axis tick marks can be set to show exponents. The scales package is required to access break formatting functions.

# Log2 scaling of the y axis (with visually-equal spacing)
library(scales)
sp + scale_y_continuous(trans = log2_trans())
# show exponents
sp + scale_y_continuous(trans = log2_trans(),
    breaks = trans_breaks("log2", function(x) 2^x),
    labels = trans_format("log2", math_format(2^.x)))

Note that many transformation functions are available using the scales package : log10_trans(), sqrt_trans(), etc. Use help(trans_new) for a full list.

Format axis tick mark labels :

library(scales)
# Percent
sp + scale_y_continuous(labels = percent)
# dollar
sp + scale_y_continuous(labels = dollar)
# scientific
sp + scale_y_continuous(labels = scientific)

Display log tick marks

It is possible to add log tick marks using the function annotation_logticks().

Note that, these tick marks make sense only for base 10

The Animals data sets, from the package MASS, are used :

library(MASS)
head(Animals)
##                     body brain
## Mountain beaver     1.35   8.1
## Cow               465.00 423.0
## Grey wolf          36.33 119.5
## Goat               27.66 115.0
## Guinea pig          1.04   5.5
## Dipliodocus     11700.00  50.0

The function annotation_logticks() can be used as follow :

library(MASS) # to access Animals data sets
library(scales) # to access break formatting functions
# x and y axis are transformed and formatted
p2 <- ggplot(Animals, aes(x = body, y = brain)) + geom_point() +
     scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
     scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
     theme_bw()
# log-log plot without log tick marks
p2
# Show log tick marks
p2 + annotation_logticks()  

Note that, default log ticks are on bottom and left.

To specify the sides of the log ticks :

# Log ticks on left and right
p2 + annotation_logticks(sides="lr")
# All sides
p2+annotation_logticks(sides="trbl")

Allowed values for the argument sides are :

  • t : for top
  • r : for right
  • b : for bottom
  • l : for left
  • the combination of t, r, b and l

Format date axes

The functions scale_x_date() and scale_y_date() are used.

Example of data

Create some time serie data

df <- data.frame(
  date = seq(Sys.Date(), len=100, by="1 day")[sample(100, 50)],
  price = runif(50)
)
df <- df[order(df$date), ]
head(df)
##          date      price
## 33 2016-09-21 0.07245190
## 3  2016-09-23 0.51772443
## 23 2016-09-25 0.05758921
## 43 2016-09-26 0.99389551
## 45 2016-09-27 0.94858770
## 29 2016-09-28 0.82420890

Plot with dates

# Plot with date
dp <- ggplot(data=df, aes(x=date, y=price)) + geom_line()
dp

Format axis tick mark labels

Load the package scales to access break formatting functions.

library(scales)
# Format : month/day
dp + scale_x_date(labels = date_format("%m/%d")) +
  theme(axis.text.x = element_text(angle=45))
# Format : Week
dp + scale_x_date(labels = date_format("%W"))
# Months only
dp + scale_x_date(breaks = date_breaks("months"),
  labels = date_format("%b"))

Note that, since ggplot2 v2.0.0, date and datetime scales now have date_breaks, date_minor_breaks and date_labels arguments so that you never need to use the long scales::date_breaks() or scales::date_format().

Date axis limits

US economic time series data sets (from ggplot2 package) are used :

head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018

Create the plot of psavert by date :

  • date : Month of data collection
  • psavert : personal savings rate
# Plot with dates
dp <- ggplot(data=economics, aes(x=date, y=psavert)) + geom_line()
dp
# Axis limits c(min, max)
min <- as.Date("2002-1-1")
max <- max(economics$date)
dp+ scale_x_date(limits = c(min, max))

Go further

See also the function scale_x_datetime() and scale_y_datetime() to plot a data containing date and time.

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. )


ggplot2 line types : How to change line types of a graph in R software?

$
0
0


This R tutorial describes how to change line types of a graph generated using ggplot2 package.


Line types in R

The different line types available in R software are : “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”.

Note that, line types can be also specified using numbers : 0, 1, 2, 3, 4, 5, 6. 0 is for “blank”, 1 is for “solid”, 2 is for “dashed”, ….

A graph of the different line types is shown below :

ggplot2 line type, R software

Basic line plots

Generate some data

df <- data.frame(time=c("breakfeast", "Lunch", "Dinner"),
                bill=c(10, 30, 15))
head(df)
##         time bill
## 1 breakfeast   10
## 2      Lunch   30
## 3     Dinner   15

Create line plots and change line types

The argument linetype is used to change the line type :

library(ggplot2)
# Basic line plot with points
ggplot(data=df, aes(x=time, y=bill, group=1)) +
  geom_line()+
  geom_point()
# Change the line type
ggplot(data=df, aes(x=time, y=bill, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()

ggplot2 line type, R softwareggplot2 line type, R software

Line plot with multiple groups

Create some data

df2 <- data.frame(sex = rep(c("Female", "Male"), each=3),
                  time=c("breakfeast", "Lunch", "Dinner"),
                  bill=c(10, 30, 15, 13, 40, 17) )
head(df2)
##      sex       time bill
## 1 Female breakfeast   10
## 2 Female      Lunch   30
## 3 Female     Dinner   15
## 4   Male breakfeast   13
## 5   Male      Lunch   40
## 6   Male     Dinner   17

Change globally the appearance of lines

In the graphs below, line types, colors and sizes are the same for the two groups :

library(ggplot2)
# Line plot with multiple groups
ggplot(data=df2, aes(x=time, y=bill, group=sex)) +
  geom_line()+
  geom_point()
# Change line types
ggplot(data=df2, aes(x=time, y=bill, group=sex)) +
  geom_line(linetype="dashed")+
  geom_point()
# Change line colors and sizes
ggplot(data=df2, aes(x=time, y=bill, group=sex)) +
  geom_line(linetype="dotted", color="red", size=2)+
  geom_point(color="blue", size=3)

ggplot2 line type, R softwareggplot2 line type, R softwareggplot2 line type, R software

Change automatically the line types by groups

In the graphs below, line types, colors and sizes are changed automatically by the levels of the variable sex :

# Change line types by groups (sex)
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex))+
  geom_point()+
  theme(legend.position="top")
# Change line types + colors
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex, color=sex))+
  geom_point(aes(color=sex))+
  theme(legend.position="top")

ggplot2 line type, R softwareggplot2 line type, R software

Change manually the appearance of lines

The functions below can be used :

  • scale_linetype_manual() : to change line types
  • scale_color_manual() : to change line colors
  • scale_size_manual() : to change the size of lines
# Set line types manually
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))+
  theme(legend.position="top")
# Change line colors and sizes
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex, color=sex, size=sex))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))+
  scale_color_manual(values=c('#999999','#E69F00'))+
  scale_size_manual(values=c(1, 1.5))+
  theme(legend.position="top")

ggplot2 line type, R softwareggplot2 line type, R software

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)

ggplot2 add straight lines to a plot : horizontal, vertical and regression lines

$
0
0


This tutorial describes how to add one or more straight lines to a graph generated using R software and ggplot2 package.

The R functions below can be used :

  • geom_hline() for horizontal lines
  • geom_abline() for regression lines
  • geom_vline() for vertical lines
  • geom_segment() to add segments

geom_hline : Add horizontal lines

A simplified format of the function geom_hline() is :

geom_hline(yintercept, linetype, color, size)

It draws a horizontal line on the current plot at the specified ‘y’ coordinates :

library(ggplot2)
# Simple scatter plot
sp <- ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()
# Add horizontal line at y = 2O
sp + geom_hline(yintercept=20)
# Change line type and color
sp + geom_hline(yintercept=20, linetype="dashed", color = "red")
# Change line size
sp + geom_hline(yintercept=20, linetype="dashed", 
                color = "red", size=2)

add straight lines to a plot using R statistical software and ggplot2add straight lines to a plot using R statistical software and ggplot2add straight lines to a plot using R statistical software and ggplot2

Read more on line types here : Line types in R

geom_vline : Add vertical lines

A simplified format of the function geom_vline() is :

geom_vline(xintercept, linetype, color, size)

It draws a vertical line on the current plot at the specified ‘x’ coordinates :

library(ggplot2)
# Add a vertical line at x = 3
sp + geom_vline(xintercept = 3)
# Change line type, color and size
sp + geom_vline(xintercept = 3, linetype="dotted", 
                color = "blue", size=1.5)

add straight lines to a plot using R statistical software and ggplot2add straight lines to a plot using R statistical software and ggplot2

geom_abline : Add regression lines

A simplified format of the function geom_abline() is :

geom_abline(intercept, slope, linetype, color, size)

The function lm() is used to fit linear models.

# Fit regression line
require(stats)
reg<-lm(mpg ~ wt, data = mtcars)
reg
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344
coeff=coefficients(reg)
# Equation of the line : 
eq = paste0("y = ", round(coeff[2],1), "*x + ", round(coeff[1],1))
# Plot
sp + geom_abline(intercept = 37, slope = -5)+
  ggtitle(eq)
# Change line type, color and size
sp + geom_abline(intercept = 37, slope = -5, color="red", 
                 linetype="dashed", size=1.5)+
  ggtitle(eq)

add straight lines to a plot using R statistical software and ggplot2add straight lines to a plot using R statistical software and ggplot2

Note that, the function stat_smooth() can be used for fitting smooth models to data.

sp + stat_smooth(method="lm", se=FALSE)

add straight lines to a plot using R statistical software and ggplot2

geom_segment : Add a line segment

A simplified format of the function geom_segment() is :

geom_segment(aes(x, y, xend, yend))

It’s possible to use it as follow :

# Add a vertical line segment
sp + geom_segment(aes(x = 4, y = 15, xend = 4, yend = 27))
# Add horizontal line segment
sp + geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15))

add straight lines to a plot using R statistical software and ggplot2add straight lines to a plot using R statistical software and ggplot2

Note that, you can add an arrow at the end of the segment. grid package is required

library(grid)
sp + geom_segment(aes(x = 5, y = 30, xend = 3.5, yend = 25),
                  arrow = arrow(length = unit(0.5, "cm")))

add straight lines to a plot using R statistical software and ggplot2

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. )

ggplot2 point shapes

$
0
0


This R tutorial describes how to change the point shapes of a graph generated using R software and ggplot2 package.


Point shapes in R

The different points shapes commonly used in R are illustrated in the figure below :

r point shape

Create some data

mtcars data is used in the following examples.

df <- mtcars[, c("mpg", "cyl", "wt")]
df$cyl <- as.factor(df$cyl)
head(df)
##                    mpg cyl    wt
## Mazda RX4         21.0   6 2.620
## Mazda RX4 Wag     21.0   6 2.875
## Datsun 710        22.8   4 2.320
## Hornet 4 Drive    21.4   6 3.215
## Hornet Sportabout 18.7   8 3.440
## Valiant           18.1   6 3.460

Make sure to convert the column cyl from a numeric to a factor variable.

Basic scatter plots

Create a scatter plot and change point shapes using the argument shape :

library(ggplot2)
# Basic scatter plot
ggplot(df, aes(x=wt, y=mpg)) +
  geom_point()
# Change the point shape
ggplot(df, aes(x=wt, y=mpg)) +
  geom_point(shape=18)
# change shape, color, fill, size
ggplot(df, aes(x=wt, y=mpg)) +
  geom_point(shape=23, fill="blue", color="darkred", size=3)

ggplot2 point shapes in R softwareggplot2 point shapes in R softwareggplot2 point shapes in R software

Note that, the argument fill can be used only for the point shapes 21 to 25

Scatter plots with multiple groups

Change the point shapes, colors and sizes automatically

In the R code below, point shapes, colors and sizes are controlled automatically by the variable cyl :

library(ggplot2)
# Scatter plot with multiple groups
# shape depends on cyl
ggplot(df, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl))
# Change point shapes and colors
ggplot(df, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl))
# change point shapes,  colors and sizes
ggplot(df, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl, size=cyl))

ggplot2 point shapes in R softwareggplot2 point shapes in R softwareggplot2 point shapes in R software

Change point shapes, colors and sizes manually :

The functions below can be used :

  • scale_shape_manual() : to change point shapes
  • scale_color_manual() : to change point colors
  • scale_size_manual() : to change the size of points
# Change colors and shapes manually
ggplot(df, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl), size=2)+
  scale_shape_manual(values=c(3, 16, 17))+
  scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))+
  theme(legend.position="top")
# Change the point size manually
ggplot(df, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl, size=cyl))+
  scale_shape_manual(values=c(3, 16, 17))+
  scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))+
  scale_size_manual(values=c(2,3,4))+
  theme(legend.position="top")

ggplot2 point shapes in R softwareggplot2 point shapes in R software

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)

ggplot2 line plot : Quick start guide - R software and data visualization

$
0
0


This R tutorial describes how to create line plots using R software and ggplot2 package.

In a line graph, observations are ordered by x value and connected.

The functions geom_line(), geom_step(), or geom_path() can be used.

x value (for x axis) can be :

  • date : for a time series data
  • texts
  • discrete numeric values
  • continuous numeric values

ggplot2 line plot - R software and data visualization


Basic line plots

Data

Data derived from ToothGrowth data sets are used. ToothGrowth describes the effect of Vitamin C on tooth growth in Guinea pigs.

df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))
head(df)
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5
  • len : Tooth length
  • dose : Dose in milligrams (0.5, 1, 2)

Create line plots with points

library(ggplot2)
# Basic line plot with points
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line()+
  geom_point()
# Change the line type
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()
# Change the color
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(color="red")+
  geom_point()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Read more on line types : ggplot2 line types

You can add an arrow to the line using the grid package :

library(grid)
# Add an arrow
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow = arrow())+
  geom_point()
# Add a closed arrow to the end of the line
myarrow=arrow(angle = 15, ends = "both", type = "closed")
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow=myarrow)+
  geom_point()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Observations can be also connected using the functions geom_step() or geom_path() :

ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_step()+
  geom_point()
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_path()+
  geom_point()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization


  • geom_line : Connecting observations, ordered by x value
  • geom_path() : Observations are connected in original order
  • geom_step : Connecting observations by stairs


Line plot with multiple groups

Data

Data derived from ToothGrowth data sets are used. ToothGrowth describes the effect of Vitamin C on tooth growth in Guinea pigs. Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used :

df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5
  • len : Tooth length
  • dose : Dose in milligrams (0.5, 1, 2)
  • supp : Supplement type (VC or OJ)

Create line plots

In the graphs below, line types, colors and sizes are the same for the two groups :

# Line plot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line()+
  geom_point()
# Change line types
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line(linetype="dashed", color="blue", size=1.2)+
  geom_point(color="red", size=3)

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Change line types by groups

In the graphs below, line types and point shapes are controlled automatically by the levels of the variable supp :

# Change line types by groups (supp)
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()
# Change line types and point shapes
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point(aes(shape=supp))

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

It is also possible to change manually the line types using the function scale_linetype_manual().

# Set line types manually
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))

ggplot2 line plot - R software and data visualization

You can read more on line types here : ggplot2 line types

If you want to change also point shapes, read this article : ggplot2 point shapes

Change line colors by groups

Line colors are controlled automatically by the levels of the variable supp :

p<-ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(color=supp))+
  geom_point(aes(color=supp))
p

ggplot2 line plot - R software and data visualization

It is also possible to change manually line colors using the functions :

  • scale_color_manual() : to use custom colors
  • scale_color_brewer() : to use color palettes from RColorBrewer package
  • scale_color_grey() : to use grey color palettes
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")
# Use grey scale
p + scale_color_grey() + theme_classic()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Read more on ggplot2 colors here : ggplot2 colors

Change the legend position

p <- p + scale_color_brewer(palette="Paired")+
  theme_minimal()
p + theme(legend.position="top")
p + theme(legend.position="bottom")
# Remove legend
p + theme(legend.position="none")

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

The allowed values for the arguments legend.position are : “left”,“top”, “right”, “bottom”.

Read more on ggplot legend : ggplot2 legend

Line plot with a numeric x-axis

If the variable on x-axis is numeric, it can be useful to treat it as a continuous or a factor variable depending on what you want to do :

# Create some data
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("0.5", "1", "2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
##   supp dose  len
## 1   VC  0.5  6.8
## 2   VC    1 15.0
## 3   VC    2 33.0
## 4   OJ  0.5  4.2
## 5   OJ    1 10.0
## 6   OJ    2 29.5
# x axis treated as continuous variable
df2$dose <- as.numeric(as.vector(df2$dose))
ggplot(data=df2, aes(x=dose, y=len, group=supp, color=supp)) +
  geom_line() + geom_point()+
  scale_color_brewer(palette="Paired")+
  theme_minimal()
# Axis treated as discrete variable
df2$dose<-as.factor(df2$dose)
ggplot(data=df2, aes(x=dose, y=len, group=supp, color=supp)) +
  geom_line() + geom_point()+
  scale_color_brewer(palette="Paired")+
  theme_minimal()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Line plot with dates on x-axis

economics time series data sets are used :

head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-06-30 507.8 198712     9.8     4.5     2944
## 2 1967-07-31 510.9 198911     9.8     4.7     2945
## 3 1967-08-31 516.7 199113     9.0     4.6     2958
## 4 1967-09-30 513.3 199311     9.8     4.9     3143
## 5 1967-10-31 518.5 199498     9.7     4.7     3066
## 6 1967-11-30 526.2 199657     9.4     4.8     3018

Plots :

# Basic line plot
ggplot(data=economics, aes(x=date, y=pop))+
  geom_line()
# Plot a subset of the data
ggplot(data=subset(economics, date > as.Date("2006-1-1")), 
       aes(x=date, y=pop))+geom_line()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Change line size :

# Change line size
ggplot(data=economics, aes(x=date, y=pop, size=unemploy/pop))+
  geom_line()

ggplot2 line plot - R software and data visualization

Line graph with error bars

The function below will be used to calculate the mean and the standard deviation, for the variable of interest, in each group :

#+++++++++++++++++++++++++
# Function to calculate the mean and the standard deviation
  # for each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of a column containing the variable
  #to be summariezed
# groupnames : vector of column names to be used as
  # grouping variables
data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}

Summarize the data :

df3 <- data_summary(ToothGrowth, varname="len", 
                    groupnames=c("supp", "dose"))
head(df3)
##   supp dose   len       sd
## 1   OJ  0.5 13.23 4.459709
## 2   OJ  1.0 22.70 3.910953
## 3   OJ  2.0 26.06 2.655058
## 4   VC  0.5  7.98 2.746634
## 5   VC  1.0 16.77 2.515309
## 6   VC  2.0 26.14 4.797731

The function geom_errorbar() can be used to produce a line graph with error bars :

# Standard deviation of the mean
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) + 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1) +
    geom_line() + geom_point()+
   scale_color_brewer(palette="Paired")+theme_minimal()
# Use position_dodge to move overlapped errorbars horizontally
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) + 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line() + geom_point()+
   scale_color_brewer(palette="Paired")+theme_minimal()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Customized line graphs

# Simple line plot
# Change point shapes and line types by groups
ggplot(df3, aes(x=dose, y=len, group = supp, shape=supp, linetype=supp))+ 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line() +
    geom_point()+
    labs(title="Plot of lengthby dose",x="Dose (mg)", y = "Length")+
    theme_classic()
# Change color by groups
# Add error bars
p <- ggplot(df3, aes(x=dose, y=len, group = supp, color=supp))+ 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line(aes(linetype=supp)) + 
    geom_point(aes(shape=supp))+
    labs(title="Plot of lengthby dose",x="Dose (mg)", y = "Length")+
    theme_classic()
p + theme_classic() + scale_color_manual(values=c('#999999','#E69F00'))

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Change colors manually :

p + scale_color_brewer(palette="Paired") + theme_minimal()
# Greens
p + scale_color_brewer(palette="Greens") + theme_minimal()
# Reds
p + scale_color_brewer(palette="Reds") + theme_minimal()

ggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualizationggplot2 line plot - R software and data visualization

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)

ggplot2 themes and background colors : The 3 elements

$
0
0


This R tutorial describes how to change the look of a plot theme (background color, panel background color and grid lines) using R software and ggplot2 package. You’ll also learn how to use the base themes of ggplot2 and to create your own theme.


Prepare the data

ToothGrowth data is used :

# Convert the column dose from numeric to factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Make sure that the variable dose is converted as a factor using the above R script.

Example of plot

library(ggplot2)
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()
p

Quick functions to change plot themes

Several functions are available in ggplot2 package for changing quickly the theme of plots :

  • theme_gray : gray background color and white grid lines
  • theme_bw : white background and gray grid lines
p + theme_gray(base_size = 14)
p + theme_bw()
ggplot2 background color, theme_gray and theme_bw, R programmingggplot2 background color, theme_gray and theme_bw, R programming

ggplot2 background color, theme_gray and theme_bw, R programming

  • theme_linedraw : black lines around the plot
  • theme_light : light gray lines and axis (more attention towards the data)
p + theme_linedraw()
p + theme_light()
ggplot2 background color, theme_linedraw and theme_light, R programmingggplot2 background color, theme_linedraw and theme_light, R programming

ggplot2 background color, theme_linedraw and theme_light, R programming

  • theme_minimal: no background annotations
  • theme_classic : theme with axis lines and no grid lines
p + theme_minimal()
p + theme_classic()
ggplot2 background color, theme_minimal and theme_classic, R programmingggplot2 background color, theme_minimal and theme_classic, R programming

ggplot2 background color, theme_minimal and theme_classic, R programming

  • theme_void: Empty theme, useful for plots with non-standard coordinates or for drawings
  • theme_dark(): Dark background designed to make colours pop out
p + theme_void()
p + theme_dark()
ggplot2 background color, theme_void and theme_dark, R programmingggplot2 background color, theme_void and theme_dark, R programming

ggplot2 background color, theme_void and theme_dark, R programming

The functions theme_xx() can take the two arguments below :

  • base_size : base font size (to change the size of all plot text elements)
  • base_family : base font family

The size of all the plot text elements can be easily changed at once :

# Example 1
theme_set(theme_gray(base_size = 20))
ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()
# Example 2
ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()+
  theme_classic(base_size = 25)
ggplot2 background color, font size, R programmingggplot2 background color, font size, R programming

ggplot2 background color, font size, R programming

Note that, the function theme_set() changes the theme for the entire session.

Customize the appearance of the plot background

The function theme() is used to control non-data parts of the graph including :

  • Line elements : axis lines, minor and major grid lines, plot panel border, axis ticks background color, etc.
  • Text elements : plot title, axis titles, legend title and text, axis tick mark labels, etc.
  • Rectangle elements : plot background, panel background, legend background, etc.

There is a specific function to modify each of these three elements :

  • element_line() to modify the line elements of the theme
  • element_text() to modify the text elements
  • element_rect() to change the appearance of the rectangle elements

Note that, each of the theme elements can be removed using the function element_blank()

Change the colors of the plot panel background and the grid lines

  1. The functions theme() and element_rect() are used for changing the plot panel background color :
p + theme(panel.background = element_rect(fill, colour, size, 
                                          linetype, color))

  • fill : the fill color for the rectangle
  • colour, color : border color
  • size : border size


  1. The appearance of grid lines can be changed using the function element_line() as follow :
# change major and minor grid lines
p + theme(
  panel.grid.major = element_line(colour, size, linetype,
                                   lineend, color),
  panel.grid.minor = element_line(colour, size, linetype,
                                   lineend, color)
  )

  • colour, color : line color
  • size : line size
  • linetype : line type. Line type can be specified using either text (“blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”) or number (0, 1, 2, 3, 4, 5, 6). Note that linetype = “solid” is identical to linetype=1. The available line types in R are described here : Line types in R software
  • lineend : line end. Possible values for line end are : “round”, “butt” or “square”


The R code below illustrates how to modify the appearance of the plot panel background and grid lines :

# Change the colors of plot panel background to lightblue
# and the color of grid lines to white
p + theme(
  panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")
  )
ggplot2 background color, grid lines, R programming

ggplot2 background color, grid lines, R programming

Remove plot panel borders and grid lines

It is possible to hide plot panel borders and grid lines with the function element_blank() as follow :

# Remove panel borders and grid lines
p + theme(panel.border = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
# Hide panel borders and grid lines
# But change axis line
p + theme(panel.border = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.line = element_line(size = 0.5, linetype = "solid",
                                   colour = "black"))
ggplot2 background color, remove plot panel border, remove grid lines, R programmingggplot2 background color, remove plot panel border, remove grid lines, R programming

ggplot2 background color, remove plot panel border, remove grid lines, R programming

Change the plot background color (not the panel)

p + theme(plot.background = element_rect(fill = "darkblue"))
ggplot2 background color, R programming

ggplot2 background color, R programming

Use a custom theme

You can change the entire appearance of a plot by using a custom theme. Jeffrey Arnold has implemented the library ggthemes containing several custom themes.

To use these themes install and load ggthemes package as follow :

install.packages("ggthemes") # Install 
library(ggthemes) # Load

ggthemes package provides many custom themes and scales for ggplot.

theme_tufte : a minimalist theme

# scatter plot
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + geom_rangeframe() + 
  theme_tufte()
ggplot2 theme_tufte, R statistical software

ggplot2 theme_tufte, R statistical software

theme_economist : theme based on the plots in the economist magazine

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, colour = Species))+
  geom_point()
# Use economist color scales
p + theme_economist() + 
  scale_color_economist()+
  ggtitle("Iris data sets")
ggplot2 theme_economist, R statistical software

ggplot2 theme_economist, R statistical software

Note that, the function scale_fill_economist() are also available.

theme_stata: theme based on Stata graph schemes.

p + theme_stata() + scale_color_stata() +
  ggtitle("Iris data")
ggplot2 theme_stata, R statistical software

ggplot2 theme_stata, R statistical software

The stata theme color scales can be used as follow :

scale_fill_stata(scheme = "s2color", ...)
scale_color_stata(scheme = "s2color", ...)

The allowed values for the argument scheme are one of “s2color”, “s1rcolor”, “s1color”, or “mono”.

theme_wsj: theme based on plots in the Wall Street Journal

p + theme_wsj()+ scale_colour_wsj("colors6")+
  ggtitle("Iris data")
ggplot2 theme_wsj, R statistical software

ggplot2 theme_wsj, R statistical software

The Wall Street Journal color and fill scales are :

scale_color_wsj(palette = "colors6", ...)
scale_fill_wsj(palette = "colors6", ...)

The color palette to use can be one of “rgby”, “red_green”, “black_green”, “dem_rep”, “colors6”.

theme_calc : theme based on LibreOffice Calc

These themes are based on the defaults in Google Docs and LibreOffice Calc, respectively.

p + theme_calc()+ scale_colour_calc()+
  ggtitle("Iris data")
ggplot2 theme_calc, R statistical software

ggplot2 theme_calc, R statistical software

theme_hc : theme based on Highcharts JS

p + theme_hc()+ scale_colour_hc()
ggplot2 theme_hc, R statistical software

ggplot2 theme_hc, R statistical software

Create a custom theme

  1. You can change the theme for the current R session using the function theme_set() as follow :
theme_set(theme_gray(base_size = 20))
  1. You can extract and modify the R code of theme_gray :
theme_gray
function (base_size = 11, base_family = "") 
{
 half_line <- base_size/2
theme(
  line = element_line(colour = "black", size = 0.5, 
                      linetype = 1, lineend = "butt"), 
  rect = element_rect(fill = "white", colour = "black",
                      size = 0.5, linetype = 1),
  text = element_text(family = base_family, face = "plain",
                      colour = "black", size = base_size,
                      lineheight = 0.9,  hjust = 0.5,
                      vjust = 0.5, angle = 0, 
                      margin = margin(), debug = FALSE), 
  
  axis.line = element_blank(), 
  axis.text = element_text(size = rel(0.8), colour = "grey30"),
  axis.text.x = element_text(margin = margin(t = 0.8*half_line/2), 
                             vjust = 1), 
  axis.text.y = element_text(margin = margin(r = 0.8*half_line/2),
                             hjust = 1),
  axis.ticks = element_line(colour = "grey20"), 
  axis.ticks.length = unit(half_line/2, "pt"), 
  axis.title.x = element_text(margin = margin(t = 0.8 * half_line,
                                          b = 0.8 * half_line/2)),
  axis.title.y = element_text(angle = 90, 
                              margin = margin(r = 0.8 * half_line,
                                          l = 0.8 * half_line/2)),
  
  legend.background = element_rect(colour = NA), 
  legend.margin = unit(0.2, "cm"), 
  legend.key = element_rect(fill = "grey95", colour = "white"),
  legend.key.size = unit(1.2, "lines"), 
  legend.key.height = NULL,
  legend.key.width = NULL, 
  legend.text = element_text(size = rel(0.8)),
  legend.text.align = NULL,
  legend.title = element_text(hjust = 0), 
  legend.title.align = NULL, 
  legend.position = "right", 
  legend.direction = NULL,
  legend.justification = "center", 
  legend.box = NULL, 
  
  panel.background = element_rect(fill = "grey92", colour = NA),
  panel.border = element_blank(), 
  panel.grid.major = element_line(colour = "white"), 
  panel.grid.minor = element_line(colour = "white", size = 0.25), 
  panel.margin = unit(half_line, "pt"), panel.margin.x = NULL, 
  panel.margin.y = NULL, panel.ontop = FALSE, 
  
  strip.background = element_rect(fill = "grey85", colour = NA),
  strip.text = element_text(colour = "grey10", size = rel(0.8)),
  strip.text.x = element_text(margin = margin(t = half_line,
                                              b = half_line)), 
  strip.text.y = element_text(angle = -90, 
                              margin = margin(l = half_line, 
                                              r = half_line)),
  strip.switch.pad.grid = unit(0.1, "cm"),
  strip.switch.pad.wrap = unit(0.1, "cm"), 
  
  plot.background = element_rect(colour = "white"), 
  plot.title = element_text(size = rel(1.2), 
                            margin = margin(b = half_line * 1.2)),
  plot.margin = margin(half_line, half_line, half_line, half_line),
  complete = TRUE)
}

Note that, the function rel() modifies the size relative to the base size

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. 2.1.0)

Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization

$
0
0

Basics

ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics. The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”.

According to ggplot2 concept, a plot can be divided into different fundamental parts : Plot = data + Aesthetics + Geometry.

The principal components of every plot can be defined as follow:

  • data is a data frame
  • Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
  • Geometry corresponds to the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)

Two main functions, for creating plots, are available in ggplot2 package : a qplot() and ggplot() functions.

  • qplot() is a quick plot function which is easy to use for simple plots.
  • The ggplot() function is more flexible and robust than qplot for building a plot piece by piece.

The generated plot can be kept as a variable and then printed at any time using the function print().

After creating plots, two other important functions are:

  • last_plot(), which returns the last plot to be modified
  • ggsave(“plot.png”, width = 5, height = 5), which saves the last plot in the current working directory.

This document describes how to create and customize different types of graphs using ggplot2. Many examples of code and graphics are provided.


Types of graphs for data visualization

The type of plots, to be created, depends on the format of your data. The ggplot2 package provides methods for visualizing the following data structures:

  1. One variable - x: continuous or discrete
  2. Two variables - x & y: continuous and/or discrete
  3. Continuous bivariate distribution - x & y (both continuous)
  4. Continuous function
  5. Error bar
  6. Maps
  7. Three variables

In the current document we’ll provide the essential ggplot2 functions for drawing each of these seven data formats.

How this document is organized?

Install and load ggplot2 package

Use the R code below:

# Installation
install.packages('ggplot2')
# Loading
library(ggplot2)

Data format and preparation

The data should be a data.frame (columns are variables and rows are observations).

The data set mtcars is used in the examples below:

# Load the data
data(mtcars)
df <- mtcars[, c("mpg", "cyl", "wt")]
# Convert cyl to a factor variable
df$cyl <- as.factor(df$cyl)
head(df)
##                    mpg cyl    wt
## Mazda RX4         21.0   6 2.620
## Mazda RX4 Wag     21.0   6 2.875
## Datsun 710        22.8   4 2.320
## Hornet 4 Drive    21.4   6 3.215
## Hornet Sportabout 18.7   8 3.440
## Valiant           18.1   6 3.460

mtcars : Motor Trend Car Road Tests.

Description: The data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973 - 74 models).

Format: A data frame with 32 observations on 3 variables.

  • [, 1] mpg Miles/(US) gallon
  • [, 2] cyl Number of cylinders
  • [, 3] wt Weight (lb/1000)


qplot(): Quick plot with ggplot2

The qplot() function is very similar to the standard R plot() function. It can be used to create quickly and easily different types of graphs: scatter plots, box plots, violin plots, histogram and density plots.

A simplified format of qplot() is :

qplot(x, y = NULL, data, geom="auto")

  • x, y : x and y values, respectively. The argument y is optional depending on the type of graphs to be created.
  • data : data frame to use (optional).
  • geom : Character vector specifying geom to use. Defaults to “point” if x and y are specified, and “histogram” if only x is specified.


Other arguments such as main, xlab and ylab can be also used to add main title and axis labels to the plot.

Read more about qplot(): Quick plot with ggplot2.

Scatter plots

The R code below creates basic scatter plots using the argument geom = “point”. It’s also possible to combine different geoms (e.g.: geom = c(“point”, “smooth”)).

# Basic scatter plot
qplot(x = mpg, y = wt, data = df, geom = "point")
# Scatter plot with smoothed line
qplot(mpg, wt, data = df, 
      geom = c("point", "smooth"))

The following R code will change the color and the shape of points by groups. The column cyl will be used as grouping variable. In other words, the color and the shape of points will be changed by the levels of cyl.

qplot(mpg, wt, data = df, colour = cyl, shape = cyl)

Box plot, violin plot and dot plot

The R code below generates some data containing the weights by sex (M for male; F for female):

set.seed(1234)
wdata = data.frame(
        sex = factor(rep(c("F", "M"), each=200)),
        weight = c(rnorm(200, 55), rnorm(200, 58)))
head(wdata)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912
## 6   F 55.50606
# Basic box plot from data frame
qplot(sex, weight, data = wdata, 
      geom= "boxplot", fill = sex)
# Violin plot
qplot(sex, weight, data = wdata, geom = "violin")
# Dot plot
qplot(sex, weight, data = wdata, geom = "dotplot",
      stackdir = "center", binaxis = "y", dotsize = 0.5)

Histogram and density plots

The histogram and density plots are used to display the distribution of data.

# Histogram  plot
# Change histogram fill color by group (sex)
qplot(weight, data = wdata, geom = "histogram",
      fill = sex)
# Density plot
# Change density plot line color by group (sex)
# change line type
qplot(weight, data = wdata, geom = "density",
    color = sex, linetype = sex)

ggplot(): build plots piece by piece

As mentioned above, there are two main functions in ggplot2 package for generating graphics:

  • The quick and easy-to-use function: qplot()
  • The more powerful and flexible function to build plots piece by piece: ggplot()

This section describes briefly how to use the function ggplot(). Recall that, the concept of ggplot divides a plot into three different fundamental parts: plot = data + Aesthetics + geometry.

  • data: a data frame.
  • Aesthetics: used to specify x and y variables, color, size, shape, ….
  • Geometry: the type of plots (histogram, boxplot, line, density, dotplot, bar, …)

To demonstrate how the function ggplot() works, we’ll draw a scatter plot. The function aes() is used to specify aesthetics. An alternative option is the function aes_string() which generates mappings from a string. aes_string() is particularly useful when writing functions that create plots because you can use strings to define the aesthetic mappings, rather than having to use substitute to generate a call to aes()

# Basic scatter plot
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point()
# Change the point size, and shape
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 2, shape = 23)

The function aes_string() can be used as follow:

ggplot(mtcars, aes_string(x = "wt", y = "mpg")) +
  geom_point(size = 2, shape = 23)

Note that, some plots visualize a transformation of the original data set. In this case, an alternative way to build a layer is to use stat_*() functions.

In the following example, the function geom_density() does the same as the function stat_density():

# Use geometry function
ggplot(wdata, aes(x = weight)) + geom_density()
# OR use stat function
ggplot(wdata, aes(x = weight)) + stat_density()

For each plot type, we’ll provide the geom_*() function and the corresponding stat_*() function (if available).

One variable: Continuous

We’ll use weight data (wdata), generated in the previous sections.

head(wdata)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912
## 6   F 55.50606

The following R code computes the mean value by sex:

library(plyr)
mu <- ddply(wdata, "sex", summarise, grp.mean=mean(weight))
head(mu)
##   sex grp.mean
## 1   F 54.94224
## 2   M 58.07325

We start by creating a plot, named a, that we’ll finish in the next section by adding a layer.

a <- ggplot(wdata, aes(x = weight))

Possible layers are:

  • For one continuous variable:
    • geom_area() for area plot
    • geom_density() for density plot
    • geom_dotplot() for dot plot
    • geom_freqpoly() for frequency polygon
    • geom_histogram() for histogram plot
    • stat_ecdf() for empirical cumulative density function
    • stat_qq() for quantile - quantile plot
  • For one discrete variable:
    • geom_bar() for bar plot


geom_area(): Create an area plot

# Basic plot
a + geom_area(stat = "bin")
# change fill colors by sex
a + geom_area(aes(fill = sex), stat ="bin", alpha=0.6) +
  theme_classic()

Note that, by default y axis corresponds to the count of weight values. If you want to change the plot in order to have the density on y axis, the R code would be as follow.

a + geom_area(aes(y = ..density..), stat ="bin")

To customize the plot, the following arguments can be used: alpha, color, fill, linetype, size. Learn more here: ggplot2 area plot.

  • Key function: geom_area()
  • Alternative function: stat_bin()
a + stat_bin(geom = "area")

geom_density(): Create a smooth density estimate

We’ll use the following functions:

  • geom_density() to create a density plot
  • geom_vline() to add a vertical lines corresponding to group mean values
  • scale_color_manual() to change the color manually by groups
# Basic plot
a + geom_density()
# change line colors by sex
a + geom_density(aes(color = sex)) 
# Change fill color by sex
# Use semi-transparent fill: alpha = 0.4
a + geom_density(aes(fill = sex), alpha=0.4)
   
# Add mean line and Change color manually
a + geom_density(aes(color = sex)) +
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed") +
  scale_color_manual(values=c("#999999", "#E69F00"))

To customize the plot, the following arguments can be used: alpha, color, fill, linetype, size. Learn more here: ggplot2 density plot.

  • Key function: geom_density()
  • Alternative function: stat_density()
a + stat_density()

geom_dotplot(): Dot plot

In a dot plot, dots are stacked with each dot representing one observation.

# Basic plot
a + geom_dotplot()
# change fill and color by sex
a + geom_dotplot(aes(fill = sex)) 
# Change fill color manually 
a + geom_dotplot(aes(fill = sex)) +
  scale_fill_manual(values=c("#999999", "#E69F00"))

To customize the plot, the following arguments can be used: alpha, color, fill and dotsize. Learn more here: ggplot2 dot plot.

  • Key functions: geom_dotplot()

geom_freqpoly(): Frequency polygon

# Basic plot
a + geom_freqpoly() 
# change y axis to density value
# and change theme
a + geom_freqpoly(aes(y = ..density..)) +
  theme_minimal()
# change color and linetype by sex
a + geom_freqpoly(aes(color = sex, linetype = sex)) +
  theme_minimal()

To customize the plot, the following arguments can be used: alpha, color, linetype and size.

  • Key function: geom_freqpoly()
  • Alternative function: stat_bin()

geom_histogram(): Histogram

# Basic plot
a + geom_histogram()
# change line colors by sex
a + geom_histogram(aes(color = sex), fill = "white",
                   position = "dodge") 

If you want to change the plot in order to have the density on y axis, the R code would be as follow.

a + geom_histogram(aes(y = ..density..))

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 histogram plot.

  • Key functions: geom_histogram()
  • Position adjustments: “identity” (or position_identity()), “stack” (or position_stack()), “dodge” ( or position_dodge()). Default value is “stack”
  • Alternative function: stat_bin()
a + stat_bin(geom = "histogram")

stat_ecdf(): Empirical Cumulative Density Function

a + stat_ecdf()

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 ECDF.

  • Key function: stat_ecdf()

stat_qq(): quantile - quantile plot

ggplot(mtcars, aes(sample=mpg)) + stat_qq()

To customize the plot, the following arguments can be used: alpha, color, shape and size. Learn more here: ggplot2 quantile - quantile plot.

  • Key function: stat_qq()

One variable: Discrete

The function geom_bar() can be used to visualize one discrete variable. In this case, the count of each level is plotted. We’ll use the mpg data set [in ggplot2 package]. The R code is as follow:

data(mpg)
b <- ggplot(mpg, aes(fl))
# Basic plot
b + geom_bar()
# Change fill color
b + geom_bar(fill = "steelblue", color ="steelblue") +
  theme_minimal()

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 bar plot.

  • Key function: geom_bar()
  • Alternative function: stat_count()
b + stat_count()

Two variables: Continuous X, Continuous Y

We’ll use the mtcars data set. The variable cyl is used as grouping variable.

data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
head(mtcars[, c("wt", "mpg", "cyl")])
##                      wt  mpg cyl
## Mazda RX4         2.620 21.0   6
## Mazda RX4 Wag     2.875 21.0   6
## Datsun 710        2.320 22.8   4
## Hornet 4 Drive    3.215 21.4   6
## Hornet Sportabout 3.440 18.7   8
## Valiant           3.460 18.1   6

We start by creating a plot, named b, that we’ll finish in the next section by adding a layer.

b <- ggplot(mtcars, aes(x = wt, y = mpg))

Possible layers include:

  • geom_point() for scatter plot
  • geom_smooth() for adding smoothed line such as regression line
  • geom_quantile() for adding quantile lines
  • geom_rug() for adding a marginal rug
  • geom_jitter() for avoiding overplotting
  • geom_text() for adding textual annotations


geom_point(): Scatter plot

# Basic plot
b + geom_point()
   
# change the color and the point 
# by the levels of cyl variable
b + geom_point(aes(color = cyl, shape = cyl)) 
# Change color manually
b + geom_point(aes(color = cyl, shape = cyl)) +
  scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size. Learn more here: ggplot2 scatter plot.

  • key function: geom_point()

geom_smooth(): Add regression line or smoothed conditional mean

To add a regression line on a scatter plot, the function geom_smooth() is used in combination with the argument method = lm. lm stands for linear model.

# Regression line only
b + geom_smooth(method = lm)
  
# Point + regression line
# Remove the confidence interval 
b + geom_point() + 
  geom_smooth(method = lm, se = FALSE)
# loess method: local regression fitting
b + geom_point() + geom_smooth()
# Change color and shape by groups (cyl)
b + geom_point(aes(color=cyl, shape=cyl)) + 
  geom_smooth(aes(color=cyl, shape=cyl), 
              method=lm, se=FALSE, fullrange=TRUE)

To customize the plot, the following arguments can be used: alpha, color, fill, shape , linetype and size. Learn more here: ggplot2 scatter plot

  • key function: geom_smooth()
  • Alternative function: stat_smooth()
b + stat_smooth(method = "lm")

geom_quantile(): Add quantile lines from a quantile regression

Quantile lines can be used as a continuous analogue of a geom_boxplot().

We’ll use the mpg data set [in ggplot2].

The function geom_quantile() can be used for adding quantile lines:

ggplot(mpg, aes(cty, hwy)) +
  geom_point() + geom_quantile() +
  theme_minimal()

An alternative to geom_quantile() is the function stat_quantile():

ggplot(mpg, aes(cty, hwy)) +
  geom_point() + stat_quantile(quantiles = c(0.25, 0.5, 0.75))

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: Continuous quantiles

  • Key function: geom_quantile()
  • Alternative function: stat_quantile()

geom_rug(): Add marginal rug to scatter plots

We’ll use faithful data set.

# Add marginal rugs using faithful data
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() + geom_rug()

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 scatter plot

  • key function: geom_rug()

geom_jitter(): Jitter points to reduce overplotting

The function geom_jitter() is a convenient default for geom_point(position = ‘jitter’). The mpg data set [in ggplot2] is used in the following examples.

p <- ggplot(mpg, aes(displ, hwy))
# Default scatter plot
p + geom_point()
# Use jitter to reduce overplotting
p + geom_jitter(
    position = position_jitter(width = 0.5, height = 0.5))

To adjust the extent of jittering, the function position_jitter() with the arguments width and height are used:

  • width: degree of jitter in x direction.
  • height: degree of jitter in y direction.

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size. Learn more here: ggplot2 jitter

  • Key functions: geom_jitter(), position_jitter()

geom_text(): Textual annotations

The argument label is used to specify a vector of labels for point annotations.

b + geom_text(aes(label = rownames(mtcars)))

To customize the plot, the following arguments can be used: label, alpha, angle, color, family, fontface, hjust, lineheight, size, and vjust. Learn more here: ggplot2 add textual annotations

  • key function: geom_text(), annotation_custom()

Two variables: Continuous bivariate distribution

We start by using the diamonds data set [in ggplot2].

data(diamonds)
head(diamonds[, c("carat", "price")])
##   carat price
## 1  0.23   326
## 2  0.21   326
## 3  0.23   327
## 4  0.29   334
## 5  0.31   335
## 6  0.24   336

We start by creating a plot, named c, that we’ll finish in the next section by adding a layer.

c <- ggplot(diamonds, aes(carat, price))

Possible layers include:

  • geom_bin2d() for adding a heatmap of 2d bin counts. Rectangular bining.
  • geom_hex() for adding hexagon bining. The R package hexbin is required for this functionality
  • geom_density_2d() for adding contours from a 2d density estimate


geom_bin2d(): Add heatmap of 2d bin counts

The function geom_bin2d() produces a scatter plot with rectangular bins. The number of observations is counted in each bins and displayed as a heatmap.

# Default plot 
c + geom_bin2d()
# Change the number of bins
c + geom_bin2d(bins = 15)

To customize the plot, the following arguments can be used: xmax, xmin, ymax, ymin, alpha, color, fill, linetype and size. Learn more here: ggplot2 Scatter plots with rectangular bins

  • Key functions: geom_bin2d()
  • Alternative functions: stat_bin_2d(), stat_summary_2d()
c + stat_bin_2d()
c + stat_summary_2d(aes(z = depth))

geom_hex(): Add hexagon bining

The function geom_hex() produces a scatter plot with hexagon bining. The hexbin R package is required for hexagon bining. If you don’t have it, use the R code below to install it:

install.packages("hexbin")

The function geom_hex() can be used as follow:

require(hexbin)
# Default plot 
c + geom_hex()
# Change the number of bins
c + geom_hex(bins = 10)

To customize the plot, the following arguments can be used: alpha, color, fill and size. Learn more here: ggplot2 Scatter plots with rectangular bins

  • Key function: geom_hex()
  • Alternative functions: stat_bin_hex(), stat_summary_hex()
c + stat_bin_hex()
c + stat_summary_hex(aes(z = depth))

geom_density_2d(): Add contours from a 2d density estimate

The functions geom_density_2d() or stat_density_2d() can be used to add 2d density estimate to a scatter plot.

faithful data set is used in this section, and we first start by creating a scatter plot (**sp*) as follow:

# Scatter plot 
sp <- ggplot(faithful, aes(x=eruptions, y=waiting)) 
# Default plot
sp + geom_density_2d()
# Add points
sp + geom_point() + geom_density_2d()
# Use stat_density_2d with geom = "polygon"
sp + geom_point() + 
  stat_density_2d(aes(fill = ..level..), geom="polygon")

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 Scatter plots with the 2d density estimation

  • Key function: geom_density_2d()
  • Alternative functions: stat_density_2d()
sp + stat_density_2d()
  • See also: stat_contour(), geom_contour()

Two variables: Continuous function

In this section, we’ll see how to connect observations by line. The economics data set [in ggplot2] is used.

data(economics)
head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018

We start by creating a plot, named d, that we’ll finish in the next section by adding a layer.

d <- ggplot(economics, aes(x = date, y = unemploy))

Possible layers include:

  • geom_area() for area plot
  • geom_line() for line plot connecting observations, ordered by x
  • geom_step() for connecting observations by stairs


# Area plot
d + geom_area()
# Line plot: connecting observations, ordered by x
d + geom_line()
# Connecting observations by stairs
# a subset of economics data set is used
set.seed(1234)
ss <- economics[sample(1:nrow(economics), 20), ]
ggplot(ss, aes(x = date, y = unemploy)) + 
  geom_step()

To customize the plot, the following arguments can be used: alpha, color, linetype, size and fill (for geom_area only). Learn more here: ggplot2 line plot.

  • Key functions: geom_area(), geom_line(), geom_step()

Two variables: Discrete X, Continuous Y

The ToothGrowth data set we’ll be used to plot the continuous variable len (for tooth length) by the discrete variable dose. The following R code converts the variable dose from a numeric to a discrete factor variable.

data("ToothGrowth")
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

We start by creating a plot, named e, that we’ll finish in the next section by adding a layer.

e <- ggplot(ToothGrowth, aes(x = dose, y = len))

Possible layers include:

  • geom_boxplot() for box plot
  • geom_violin() for violin plot
  • geom_dotplot() for dot plot
  • geom_jitter() for stripchart
  • geom_line() for line plot
  • geom_bar() for bar plot


geom_boxplot(): Box and whiskers plot

# Default plot
e + geom_boxplot()
# Notched box plot
e + geom_boxplot(notch = TRUE)
# Color by group (dose)
e + geom_boxplot(aes(color = dose))
# Change fill color by group (dose)
e + geom_boxplot(aes(fill = dose))

# Box plot with multiple groups
ggplot(ToothGrowth, aes(x=dose, y=len, fill=supp)) +
  geom_boxplot()

To customize the plot, the following arguments can be used: alpha, color, linetype, shape, size and fill. Learn more here: ggplot2 box plot.

  • Key function: geom_boxplot()
  • Alternative functions: stat_boxplot()
e + stat_boxplot(coeff = 1.5)

geom_violin(): Violin plot

Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

# Default plot
e + geom_violin(trim = FALSE)
# violin plot with mean points (+/- SD)
e + geom_violin(trim = FALSE) + 
  stat_summary(fun.data="mean_sdl",  fun.args = list(mult=1), 
               geom="pointrange", color = "red")
# Combine with box plot
e + geom_violin(trim = FALSE) + 
  geom_boxplot(width = 0.2)
# Color by group (dose) 
e + geom_violin(aes(color = dose), trim = FALSE)

To customize the plot, the following arguments can be used: alpha, color, linetype, size and fill. Learn more here: ggplot2 violin plot.

  • Key functions: geom_violin()
  • Alternative functions: stat_ydensity()
e + stat_ydensity(trim = FALSE)

geom_dotplot(): Dot plot

# Default plot
e + geom_dotplot(binaxis = "y", stackdir = "center")
# Dot plot with mean points (+/- SD)
e + geom_dotplot(binaxis = "y", stackdir = "center") + 
  stat_summary(fun.data="mean_sdl",  fun.args = list(mult=1), 
               geom="pointrange", color = "red")

# Combine with box plot
e + geom_boxplot() + 
  geom_dotplot(binaxis = "y", stackdir = "center") 
# Add violin plot
e + geom_violin(trim = FALSE) +
  geom_dotplot(binaxis='y', stackdir='center')
  
# Color and fill by group (dose) 
e + geom_dotplot(aes(color = dose, fill = dose), 
                 binaxis = "y", stackdir = "center")

To customize the plot, the following arguments can be used: alpha, color, dotsize and fill. Learn more here: ggplot2 dot plot.

  • Key functions: geom_dotplot(), stat_summary()

geom_jitter(): Strip charts

Stripcharts are also known as one dimensional scatter plots. These plots are suitable compared to box plots when sample sizes are small.

# Default plot
e + geom_jitter(position=position_jitter(0.2))
# Strip charts with mean points (+/- SD)
e + geom_jitter(position=position_jitter(0.2)) + 
  stat_summary(fun.data="mean_sdl",  fun.args = list(mult=1), 
               geom="pointrange", color = "red")

# Combine with box plot
e + geom_jitter(position=position_jitter(0.2)) + 
  geom_dotplot(binaxis = "y", stackdir = "center") 
# Add violin plot
e + geom_violin(trim = FALSE) +
  geom_jitter(position=position_jitter(0.2))
  
# Change color and shape by group (dose) 
e +  geom_jitter(aes(color = dose, shape = dose),
                 position=position_jitter(0.2))

To customize the plot, the following arguments can be used: alpha, color, shape, size and fill. Learn more here: ggplot2 strip charts.

  • Key functions: geom_jitter(), stat_summary()

geom_line(): Line plot

Data derived from ToothGrowth data sets are used.

df <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

In the graphs below, line types and point shapes are controlled automatically by the levels of the variable supp :

# Change line types by groups (supp)
ggplot(df, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()
# Change line types, point shapes and colors
ggplot(df, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp, color = supp))+
  geom_point(aes(shape=supp, color = supp))

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 line plot.

  • Key functions: geom_line(), geom_step()

geom_bar(): Bar plot

Data derived from ToothGrowth data sets are used.

df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))
head(df)
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

We start by creating a simple bar plot (named f) using the df data set:

f <- ggplot(df, aes(x = dose, y = len))
# Basic bar plot
f + geom_bar(stat = "identity")
# Change fill color and add labels
f + geom_bar(stat="identity", fill="steelblue")+
  geom_text(aes(label=len), vjust=-0.3, size=3.5)+
  theme_minimal()
# Change bar plot line colors by groups
f + geom_bar(aes(color = dose),
             stat="identity", fill="white")
# Change bar plot fill colors by groups
f + geom_bar(aes(fill = dose), stat="identity")

Bar plot with multiple groups:

g <- ggplot(data=df2, aes(x=dose, y=len, fill=supp)) 
# Stacked bar plot
g + geom_bar(stat = "identity")
# Use position=position_dodge()
g + geom_bar(stat="identity", position=position_dodge())

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 bar plot.

  • Key function: geom_bar()
  • Alternative function: stat_identity()
g + stat_identity(geom = "bar")
g + stat_identity(geom = "bar", position = "dodge")

Two variables: Discrete X, Discrete Y

The diamonds data set [in ggplot2] we’ll be used to plot the discrete variable color (for diamond colors) by the discrete variable cut (for diamond cut types). The plot is created using the function geom_jitter().

ggplot(diamonds, aes(cut, color)) +
  geom_jitter(aes(color = cut), size = 0.5)

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size.

  • Key function: geom_jitter()

Two variables: Visualizing error

The ToothGrowth data set we’ll be used. We start by creating a data set named df which holds ToothGrowth data.

# ToothGrowth data set
df <- ToothGrowth
df$dose <- as.factor(df$dose)
head(df)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

The helper function below (data_summary()) will be used to calculate the mean and the standard deviation (used as error), for the variable of interest, in each group. The plyr package is required.

# Calculate the mean and the SD in each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of the variable to be summariezed
# grps : column names to be used as grouping variables
data_summary <- function(data, varname, grps){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, grps, .fun=summary_func, varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}

Using the function data_summary(), the following R code creates a data set named df2 which holds the mean and the SD of tooth length (len) by groups (dose).

df2 <- data_summary(df, varname="len", grps= "dose")
# Convert dose to a factor variable
df2$dose=as.factor(df2$dose)
head(df2)
##   dose    len       sd
## 1  0.5 10.605 4.499763
## 2    1 19.735 4.415436
## 3    2 26.100 3.774150

We start by creating a plot, named f, that we’ll finish in the next section by adding a layer.

f <- ggplot(df2, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))

Possible layers include:

  • geom_crossbar() for hollow bar with middle indicated by horizontal line
  • geom_errorbar() for error bars
  • geom_errorbarh() for horizontal error bars
  • geom_linerange() for drawing an interval represented by a vertical line
  • geom_pointrange() for creating an interval represented by a vertical line, with a point in the middle.


geom_crossbar(): Hollow bar with middle indicated by horizontal line

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose).

# Default plot
f + geom_crossbar()
# color by groups
f + geom_crossbar(aes(color = dose))
# Change color manually
f + geom_crossbar(aes(color = dose)) + 
  scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()
# fill by groups and change color manually
f + geom_crossbar(aes(fill = dose)) + 
  scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

Cross bar with multiple groups: Using the function data_summary(), we start by creating a data set named df3 which holds the mean and the SD of tooth length (len) by 2 groups (supp and dose).

df3 <- data_summary(df, varname="len", grps= c("supp", "dose"))
head(df3)
##   supp dose   len       sd
## 1   OJ  0.5 13.23 4.459709
## 2   OJ    1 22.70 3.910953
## 3   OJ    2 26.06 2.655058
## 4   VC  0.5  7.98 2.746634
## 5   VC    1 16.77 2.515309
## 6   VC    2 26.14 4.797731

The data set df3 is used to create cross bars with multiple groups. For this end, the variable len is plotted by dose and the color is changed by the levels of the factor supp.

f <- ggplot(df3, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Default plot
f + geom_crossbar(aes(color = supp))
# Use position_dodge() to avoid overlap
f + geom_crossbar(aes(color = supp), 
                  position = position_dodge(1))

A simple alternative to geom_crossbar() is to use the function stat_summary() as follow. In this case, the mean and the SD can be computed automatically.

f <- ggplot(df, aes(x = dose, y = len, color = supp)) 
# Use geom_crossbar()
f + stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), 
                 geom="crossbar", width = 0.6, 
                 position = position_dodge(0.8))

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 error bars.

  • Key functions: geom_crossbar(), stat_summary()

geom_errorbar(): Error bars

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose).

We start by creating a plot, named f, that we’ll finish next by adding a layer.

f <- ggplot(df2, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Error bars colored by groups
f + geom_errorbar(aes(color = dose), width = 0.2)
# Combine with line plot
f + geom_line(aes(group = 1)) + 
  geom_errorbar(width = 0.2)
# Combine with bar plot, color by groups
f + geom_bar(aes(color = dose), stat = "identity", fill ="white") + 
  geom_errorbar(aes(color = dose), width = 0.2)

Error bars with multiple groups:

The data set df3 is used to create cross bars with multiple groups. For this end, the variable len is plotted by dose and the color is changed by the levels of the factor supp.

f <- ggplot(df3, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Default plot
f + geom_bar(aes(fill = supp), stat = "identity",
             position = "dodge") + 
  geom_errorbar(aes(color = supp),  position = "dodge")

To customize the plot, the following arguments can be used: alpha, color, linetype, size and width.

Learn more here:

  • Key functions: geom_errorbar(), stat_summary()

geom_errorbarh(): Horizontal error bars

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose):

df2 <- data_summary(ToothGrowth, varname="len", grps = "dose")
head(df2)
##   dose    len       sd
## 1  0.5 10.605 4.499763
## 2    1 19.735 4.415436
## 3    2 26.100 3.774150

We start by creating a plot, named f, that we’ll finish next by adding a layer.

f <- ggplot(df2, aes(x = len, y = dose ,
                     xmin=len-sd, xmax=len+sd))

The arguments xmin and xmax are used for horizontal error bars.

To customize the plot, the following arguments can be used: alpha, color, linetype, size and height.

  • Key functions: geom_errorbarh()

geom_linerange() and geom_pointrange(): An interval represented by a vertical line

  • geom_linerange(): Add an interval represented by a vertical line
  • geom_pointrange(): Add an interval represented by a vertical line with a point in the middle

We’ll use the data set df2.

f <- ggplot(df2, aes(x = dose, y = len,
                     ymin=len-sd, ymax=len+sd))
# Line range
f + geom_linerange()
# Point range
f + geom_pointrange()

To customize the plot, the following arguments can be used: alpha, color, linetype, size, shape and fill (for geom_pointrange()).

Combine geom_dotplot and error bars

It’s also possible to combine geom_dotplot() and error bars. We’ll use the ToothGrowth data set. You don’t need to compute the mean and SD. This can be done automatically by using the function stat_summary() in combination with the argument fun.data = “mean_sdl”.

We start by creating a dot plot, named g, that we’ll finish in the next section by adding error bar layers.

g <- ggplot(df, aes(x=dose, y=len)) + 
  geom_dotplot(binaxis='y', stackdir='center')
# use geom_crossbar()
g + stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), 
                 geom="crossbar", width=0.5)
# Use geom_errorbar()
g + stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), 
        geom="errorbar", color="red", width=0.2) +
  stat_summary(fun.y=mean, geom="point", color="red")
   
# Use geom_pointrange()
g + stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), 
                 geom="pointrange", color="red")

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 error bars.

  • Key functions: geom_errorbarh(), geom_errorbar(), geom_linerange(), geom_pointrange(), geom_crossbar(), stat_summary()

Two variables: Maps

The function geom_map() can be used to create a map with ggplot2. The R package map is required. It contains geographical information useful for drawing easily maps in ggplot2.

Install map package (if you don’t have it):

install.packages("map")

In the following R code, we’ll create USA map and USArrests crime data to shade each region.

# Prepare the data
crimes <- data.frame(state = tolower(rownames(USArrests)), 
                     USArrests)
library(reshape2) # for melt
crimesm <- melt(crimes, id = 1)
# Get map data
require(maps) 
map_data <- map_data("state")
# Plot the map with Murder data
ggplot(crimes, aes(map_id = state)) + 
  geom_map(aes(fill = Murder), map = map_data) + 
  expand_limits(x = map_data$long, y = map_data$lat)

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 map.

Key function: geom_map()

Three variables

The mtcars data set we’ll be used. We first compute a correlation matrix, which will be visualized using specific ggplot2 functions.

Prepare the data:

df <- mtcars[, c(1,3,4,5,6,7)]
# Correlation matrix
cormat <- round(cor(df),2)
# Melt the correlation matrix
require(reshape2)
cormat <- melt(cormat)
head(cormat)
##   Var1 Var2 value
## 1  mpg  mpg  1.00
## 2 disp  mpg -0.85
## 3   hp  mpg -0.78
## 4 drat  mpg  0.68
## 5   wt  mpg -0.87
## 6 qsec  mpg  0.42

We start by creating a plot, named g, that we’ll finish in the next section by adding a layer.

g <- ggplot(cormat, aes(x = Var1, y = Var2))

Possible layers include:

  • geom_tile(): Tile plane with rectangles (similar to levelplot and image)
  • geom_raster(): High-performance rectangular tiling. This is a special case of geom_tile where all tiles are the same size.


We’ll use the function geom_tile() to visualize a correlation matrix.

Compute and visualize correlation matrix:

# 1. Compute correlation
cormat <- round(cor(df),2)
# 2. Reorder the correlation matrix by 
# Hierarchical clustering
hc <- hclust(as.dist(1-cormat)/2)
cormat.ord <- cormat[hc$order, hc$order]
# 3. Get the upper triangle
cormat.ord[lower.tri(cormat.ord)]<- NA
# 4. Melt the correlation matrix
require(reshape2)
melted_cormat <- melt(cormat.ord, na.rm = TRUE)
# Create the heatmap
ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab",
   name="Pearson\nCorrelation") + # Change gradient color
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                  size = 12, hjust = 1))+
 coord_fixed()

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 correlation matrix heatmap.

  • Key functions: geom_tile(), geom_raster()

Other types of graphs

Graphical primitives: polygon, path, ribbon, segment, rectangle

This section describes how to add graphical elements to a plot. The functions below we’ll be used:


  • geom_polygon(): Add polygon, a filled path
  • geom_path(): Connect observations in original order
  • geom_ribbon(): Add ribbons, y range with continuous x values.
  • geom_segment(): Add a single line segments
  • geom_curve(): Add curves
  • geom_rect(): Add a 2d rectangles.


  1. The R code below draws France map using geom_polygon():
require(maps)
france = map_data('world', region = 'France')
ggplot(france, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'white', colour = 'black')

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size.

  1. The following R code uses econimics data [in ggplot2] and produces path, ribbon and rectangles.
h <- ggplot(economics, aes(date, unemploy))
# Path
h + geom_path()
# Ribbon
h + geom_ribbon(aes(ymin = unemploy-900, ymax = unemploy+900),
                fill = "steelblue") +
  geom_path(size = 0.8)
# Rectangle
h + geom_rect(aes(xmin = as.Date('1980-01-01'), ymin = -Inf, 
                 xmax = as.Date('1985-01-01'), ymax = Inf),
             fill = "steelblue") +
  geom_path(size = 0.8) 

To customize the plot, the following arguments can be used: alpha, color, fill (for ribbon only), linetype and size.

  1. Add line segments between points (x1, y1) and (x2, y2):
# Create a scatter plot
i <- ggplot(mtcars, aes(wt, mpg)) + geom_point()
# Add segment
i + geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15))
# Add arrow
require(grid)
i + geom_segment(aes(x = 5, y = 30, xend = 3.5, yend = 25),
                  arrow = arrow(length = unit(0.5, "cm")))

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 add line segment.

  1. Add curves between points (x1, y1) and (x2, y2):
i + geom_curve(aes(x = 2, y = 15, xend = 3, yend = 15))

  • Key functions: geom_path(), geom_ribbon(), geom_rect(), geom_segment()

Graphical parameters

Main title, axis labels and legend title

We start by creating a box plot using the data set ToothGrowth:

# Convert the variable dose from numeric to factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()

The function below can be used for changing titles and labels:


  • p + ggtitle(“New main title”): Adds a main title above the plot
  • p + xlab(“New X axis label”): Changes the X axis label
  • p + ylab(“New Y axis label”): Changes the Y axis label
  • p + labs(title = “New main title”, x = “New X axis label”, y = “New Y axis label”): Changes main title and axis labels


The function labs() can be also used to change the legend title.

  1. Change main title and axis labels
# Default plot
print(p)
# Change title and axis labels
p <- p +labs(title="Plot of length \n by dose",
        x ="Dose (mg)", y = "Teeth length")
p

Note that, \n is used to split long title into multiple lines.

  1. Change the appearance of labels:

To change the appearance(color, size and face ) of labels, the functions theme() and element_text() can be used.

The function element_blank() hides the labels.

# Change the appearance of labels
p + theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=14, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
# Hide labels
p + theme(plot.title = element_blank(), 
          axis.title.x = element_blank(),
          axis.title.y = element_blank())

  1. Change legend titles: Scale functions (fill, color, size, shape, …) are used to update legend titles.
# Default plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len, fill=dose))+
  geom_boxplot()
p
# Modify legend titles
p + labs(fill = "Dose (mg)")

Learn more here: ggplot2 title: main, axis and legend titles.

Legend position and appearance

  1. Create a box plot
# Convert the variable dose from numeric to factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
p <- ggplot(ToothGrowth, aes(x=dose, y=len, fill=dose))+
  geom_boxplot()
  1. Change legend position and appearance
# Change legend position: "left","top", "right", "bottom", "none"
p + theme(legend.position="top")
# Remove legends
p + theme(legend.position = "none")
# Change the appearance of legend title and labels
p + theme(legend.title = element_text(colour="blue"),
          legend.text = element_text(colour="red"))
# Change legend box background color
p + theme(legend.background = element_rect(fill="lightblue"))

  1. Customize legends using scale functions
    • Change the order of legend items: scale_x_discrete()
    • Set legend title and labels: scale_fill_discrete()
# Change the order of legend items
p + scale_x_discrete(limits=c("2", "0.5", "1"))
# Set legend title and labels
p + scale_fill_discrete(name = "Dose", labels = c("A", "B", "C"))

Learn more here: ggplot2 legend position and appearance.

Change colors automatically and manually

ToothGrowth and mtcars data sets are used in the examples below.

# Convert dose and cyl columns from numeric to factor variables
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
mtcars$cyl <- as.factor(mtcars$cyl)

We start by creating some plots which will be finished hereafter:

# Box plot
bp <- ggplot(ToothGrowth, aes(x=dose, y=len))
# Scatter plot
sp <- ggplot(mtcars, aes(x=wt, y=mpg))
  1. Draw plots: change fill and outline colors
# box plot
bp + geom_boxplot(fill='steelblue', color="red")
# scatter plot
sp + geom_point(color='darkblue')

  1. Change color by groups using the levels of dose variable
# Box plot
bp <- bp + geom_boxplot(aes(fill = dose))
bp
# Scatter plot
sp <- sp + geom_point(aes(color = cyl))
sp

  1. Change colors manually:
  • scale_fill_manual() for box plot, bar plot, violin plot, etc
  • scale_color_manual() for lines and points
# Box plot
bp + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Scatter plot
sp + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

  1. Use RColorBrewer palettes: (Read more about RColorBrewer: color in R)
  • scale_fill_brewer() for box plot, bar plot, violin plot, etc
  • scale_color_brewer() for lines and points
# Box plot
bp + scale_fill_brewer(palette="Dark2")
# Scatter plot
sp + scale_color_brewer(palette="Dark2")

Available color palettes in the RColorBrewer package:

RColorBrewer palettes

RColorBrewer palettes

  1. Use gray colors:
  • scale_fill_grey() for box plot, bar plot, violin plot, etc
  • scale_colour_grey() for points, lines, etc
# Box plot
bp + scale_fill_grey() + theme_classic()
# Scatter plot
sp + scale_color_grey() + theme_classic()

  1. Gradient or continuous colors:

Plots can be colored according to the values of a continuous variable using the functions :

  • scale_color_gradient(), scale_fill_gradient() for sequential gradients between two colors
  • scale_color_gradient2(), scale_fill_gradient2() for diverging gradients
  • scale_color_gradientn(), scale_fill_gradientn() for gradient between n colors

Gradient colors for scatter plots: The graphs are colored using the qsec continuous variable :

# Color by qsec values
sp2<-ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(aes(color = qsec))
sp2
# Change the low and high colors
# Sequential color scheme
sp2+scale_color_gradient(low="blue", high="red")
# Diverging color scheme
mid<-mean(mtcars$qsec)
sp2+scale_color_gradient2(midpoint=mid, low="blue", mid="white",
                          high="red", space = "Lab" )

Learn more here: ggplot2 colors.

Point shapes, colors and size

The different points shapes commonly used in R are shown in the image below:

r point shape

mtcars data is used in the following examples.

# Convert cyl as factor variable
mtcars$cyl <- as.factor(mtcars$cyl)

Create a scatter plot and change point shapes, colors and size:

# Basic scatter plot
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(shape = 18, color = "steelblue", size = 4)
# Change point shapes and colors by groups
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(aes(shape = cyl, color = cyl))

It’s also possible to manually change the appearance of points:

  • scale_shape_manual() : to change point shapes
  • scale_color_manual() : to change point colors
  • scale_size_manual() : to change the size of points
# Change colors and shapes manually
ggplot(mtcars, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl), size=2)+
  scale_shape_manual(values=c(3, 16, 17))+
  scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))+
  theme(legend.position="top")

Learn more here: ggplot2 point shapes, colors and size.

Add text annotations to a graph

There are three important functions for adding texts to a plot:

  • geom_text(): Textual annotations
  • annotate(): Textual annotations
  • annotation_custom(): Static annotations that are the same in every panel. These annotations are not affected by the plot scales.

A subset of mtcars data is used:

set.seed(1234)
df <- mtcars[sample(1:nrow(mtcars), 10), ]
df$cyl <- as.factor(df$cyl)

Scatter plots with textual annotations:

# Scatter plot
sp <- ggplot(df, aes(x=wt, y=mpg))+ geom_point() 
# Add text, change colors by groups
sp + geom_text(aes(label = rownames(df), color = cyl),
               size = 3, vjust = -1)
# Add text at a particular coordinate
sp + geom_text(x = 3, y = 30, label = "Scatter plot",
              color="red")

Learn more here: ggplot2 text: Add text annotations to a graph.

Line types

The different line types available in R software are : “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”.

Note that, line types can be also specified using numbers : 0, 1, 2, 3, 4, 5, 6. 0 is for “blank”, 1 is for “solid”, 2 is for “dashed”, ….

A graph of the different line types is shown below :

  1. Basic line plot
# Create some data
df <- data.frame(time=c("breakfeast", "Lunch", "Dinner"),
                bill=c(10, 30, 15))
head(df)
##         time bill
## 1 breakfeast   10
## 2      Lunch   30
## 3     Dinner   15
# Basic line plot with points
# Change the line type
ggplot(data=df, aes(x=time, y=bill, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()

  1. Line plots with multiple groups
# Create some data
df2 <- data.frame(sex = rep(c("Female", "Male"), each=3),
                  time=c("breakfeast", "Lunch", "Dinner"),
                  bill=c(10, 30, 15, 13, 40, 17) )
head(df2)
##      sex       time bill
## 1 Female breakfeast   10
## 2 Female      Lunch   30
## 3 Female     Dinner   15
## 4   Male breakfeast   13
## 5   Male      Lunch   40
## 6   Male     Dinner   17
# Line plot with multiple groups
# Change line types and colors by groups (sex)
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype = sex, color = sex))+
  geom_point(aes(color=sex))+
  theme(legend.position="top")

The functions below can be used to change the appearance of line types manually:

  • scale_linetype_manual() : to change line types
  • scale_color_manual() : to change line colors
  • scale_size_manual() : to change the size of lines
# Change line types, colors and sizes
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex, color=sex, size=sex))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))+
  scale_color_manual(values=c('#999999','#E69F00'))+
  scale_size_manual(values=c(1, 1.5))+
  theme(legend.position="top")

Learn more here: ggplot2 line types.

Themes and background colors

ToothGrowth data is used :

# Convert the column dose from numeric to factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
  1. Create a box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + 
  geom_boxplot()
  1. Change plot themes

Several functions are available in ggplot2 package for changing quickly the theme of plots :

  • theme_gray(): gray background color and white grid lines
  • theme_bw() : white background and gray grid lines
p + theme_gray(base_size = 14)
p + theme_bw()
ggplot2 background color, theme_gray and theme_bw, R programmingggplot2 background color, theme_gray and theme_bw, R programming

ggplot2 background color, theme_gray and theme_bw, R programming

  • theme_linedraw : black lines around the plot
  • theme_light : light gray lines and axis (more attention towards the data)
p + theme_linedraw()
p + theme_light()
ggplot2 background color, theme_linedraw and theme_light, R programmingggplot2 background color, theme_linedraw and theme_light, R programming

ggplot2 background color, theme_linedraw and theme_light, R programming

  • theme_minimal: no background annotations
  • theme_classic : theme with axis lines and no grid lines
p + theme_minimal()
p + theme_classic()
ggplot2 background color, theme_minimal and theme_classic, R programmingggplot2 background color, theme_minimal and theme_classic, R programming

ggplot2 background color, theme_minimal and theme_classic, R programming

Learn more here: ggplot2 themes and background colors.

Axis limits: Minimum and Maximum values

Create a plot:

p <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()

Different functions are available for setting axis limits:


  1. Without clipping (preferred):
    • p + coord_cartesian(xlim = c(5, 20), ylim = (0, 50)): Cartesian coordinates. The Cartesian coordinate system is the most common type of coordinate system. It will zoom the plot (like you’re looking at it with a magnifying glass), without clipping the data.
  2. With clipping the data (removes unseen data points): Observations not in this range will be dropped completely and not passed to any other layers.
    • p + xlim(5, 20) + ylim(0, 50)
    • p + scale_x_continuous(limits = c(5, 20)) + scale_y_continuous(limits = c(0, 50))
  3. Expand the plot limits with data: This function is a thin wrapper around geom_blank() that makes it easy to add data to a plot.
    • p + expand_limits(x = 0, y = 0): set the intercept of x and y axes at (0,0)
    • p + expand_limits(x = c(5, 50), y = c(0, 150))


# Default plot
print(p)
# Change axis limits using coord_cartesian()
p + coord_cartesian(xlim =c(5, 20), ylim = c(0, 50))
# Use xlim() and ylim()
p + xlim(5, 20) + ylim(0, 50)
# Expand limits
p + expand_limits(x = c(5, 50), y = c(0, 150))

Learn more here: ggplot2 axis limits.

Note that, date axis limits can be set using the functions scale_x_date() and scale_y_date(). Read more here: ggplot2 date axis.

Axis transformations: log and sqrt scales

  1. Create a scatter plot:
p <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
  1. ggplot2 functions for continuous axis transformations:

  • p + scale_x_log10(), p + scale_y_log10() : Plot x and y on log10 scale, respectively.

  • p + scale_x_sqrt(), p + scale_y_sqrt() : Plot x and y on square root scale, respectively.

  • p + scale_x_reverse(), p + scale_y_reverse() : Reverse direction of axes

  • p + coord_trans(x =“log10”, y=“log10”) : transformed cartesian coordinate system. Possible values for x and y are “log2”, “log10”, “sqrt”, …

  • p + scale_x_continuous(trans=‘log2’), p + scale_y_continuous(trans=‘log2’) : another allowed value for the argument trans is ‘log10’


  1. The R code below uses the function scale_xx_continuous() to transform axis scales:
# Default scatter plot
print(p)
# Log transformation using scale_xx()
# possible values for trans : 'log2', 'log10','sqrt'
p + scale_x_continuous(trans='log2') +
  scale_y_continuous(trans='log2')
# Format axis tick mark labels
require(scales)
p + scale_y_continuous(trans = log2_trans(),
    breaks = trans_breaks("log2", function(x) 2^x),
    labels = trans_format("log2", math_format(2^.x)))
# Reverse coordinates
p + scale_y_reverse() 

Learn more here: ggplot2 axis limits.

Axis ticks: customize tick marks and labels, reorder and select items

  1. Functions for changing the style of axis tick mark labels:

  • element_text(face, color, size, angle): change text style
  • element_blank(): Hide text


  1. Create a box plot:
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()
# print(p)
  1. Change the style and the orientation angle of axis tick labels
# Change the style of axis tick labels
# face can be "plain", "italic", "bold" or "bold.italic"
p + theme(axis.text.x = element_text(face="bold", color="#993333", 
                           size=14, angle=45),
          axis.text.y = element_text(face="bold", color="blue", 
                           size=14, angle=45))
# Remove axis ticks and tick mark labels
p + theme(
  axis.text.x = element_blank(), # Remove x axis tick labels
  axis.text.y = element_blank(), # Remove y axis tick labels
  axis.ticks = element_blank()) # Remove ticks

  1. Customize continuous and discrete axes:
  • Discrete axes
    • scale_x_discrete(name, breaks, labels, limits): for X axis
    • scale_y_discrete(name, breaks, labels, limits): for y axis
  • Continuous axes
    • scale_x_continuous(name, breaks, labels, limits, trans): for X axis
    • scale_y_continuous(name, breaks, labels, limits, trans): for y axis

Briefly, the meaning of the arguments are as follow:


  • name : x or y axis labels
  • breaks : vector specifying which breaks to display
  • labels : labels of axis tick marks
  • limits : vector indicating the data range
(Read more here: Set axis ticks for discrete and continuous axes)


scale_xx() functions can be used to change the following x or y axis parameters :

  • axis titles
  • axis limits (data range to display)
  • choose where tick marks appear
  • manually label tick marks

4.1. Discrete axes:

# Change x axis label and the order of items
p + scale_x_discrete(name ="Dose (mg)", 
                    limits=c("2","1","0.5"))
# Change tick mark labels
p + scale_x_discrete(breaks=c("0.5","1","2"),
        labels=c("Dose 0.5", "Dose 1", "Dose 2"))
# Choose which items to display
p + scale_x_discrete(limits=c("0.5", "2"))

4.2. Continuous axes:

# Default scatter plot
# +++++++++++++++++
sp <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
sp
# Customize the plot
#+++++++++++++++++++++
# 1. Change x and y axis labels, and limits
sp <- sp + scale_x_continuous(name="Speed of cars", limits=c(0, 30)) +
  scale_y_continuous(name="Stopping distance", limits=c(0, 150))
# 2. Set tick marks on y axis: a tick mark is shown on every 50
sp + scale_y_continuous(breaks=seq(0, 150, 50))
# Format the labels
# +++++++++++++++++
require(scales)
sp + scale_y_continuous(labels = percent) # labels as percents

Learn more here: ggplot2 Axis ticks: tick marks and labels.

Add straight lines to a plot: horizontal, vertical and regression lines

The R function below can be used :

  • geom_hline(yintercept, linetype, color, size): for horizontal lines
  • geom_vline(xintercept, linetype, color, size): for vertical lines
  • geom_abline(intercept, slope, linetype, color, size): for regression lines
  • geom_segment() to add segments
  1. Create a simple scatter plot
# Simple scatter plot
sp <- ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()
  1. Add straight lines
# Add horizontal line at y = 2O; change line type and color
sp + geom_hline(yintercept=20, linetype="dashed", color = "red")
# Add vertical line at x = 3; change line type, color and size
sp + geom_vline(xintercept = 3, color = "blue", size=1.5)
# Add regression line
sp + geom_abline(intercept = 37, slope = -5, color="blue")+
  ggtitle("y = -5X + 37")
# Add horizontal line segment
sp + geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15))

Learn more here: ggplot2 add straight lines to a plot.

Rotate a plot: flip and reverse

  • coord_flip(): Create horizontal plots
  • scale_x_reverse(), scale_y_reverse(): Reverse the axes
set.seed(1234)
# Basic histogram
hp <- qplot(x=rnorm(200), geom="histogram")
hp
# Horizontal histogram
hp + coord_flip()
# Y axis reversed
hp + scale_y_reverse()

Learn more here: ggplot2 rotate a graph.

Faceting: split a plot into a matrix of panels

Facets divide a plot into subplots based on the values of one or more categorical variables.

There are two main functions for faceting :

  • facet_grid()
  • facet_wrap()

Create a box plot filled by groups:

p <- ggplot(ToothGrowth, aes(x=dose, y=len, group=dose)) + 
  geom_boxplot(aes(fill=dose))
p

The following functions can be used for facets:


  • p + facet_grid(supp ~ .): Facet in vertical direction based on the levels of supp variable.

  • p + facet_grid(. ~ supp): Facet in horizontal direction based on the levels of supp variable.

  • p + facet_grid(dose ~ supp): Facet in horizontal and vertical directions based on two variables: dose and supp.

  • p + facet_wrap(~ fl): Place facet side by side into a rectangular layout


  1. Facet with one discrete variable: Split by the levels of the group “supp”
# Split in vertical direction
p + facet_grid(supp ~ .)
# Split in horizontal direction
p + facet_grid(. ~ supp)

  1. Facet with two discrete variables: Split by the levels of the groups “dose” and “supp”
# Facet by two variables: dose and supp.
# Rows are dose and columns are supp
p + facet_grid(dose ~ supp)
# Facet by two variables: reverse the order of the 2 variables
# Rows are supp and columns are dose
p + facet_grid(supp ~ dose)

By default, all the panels have the same scales (scales=“fixed”). They can be made independent, by setting scales to free, free_x, or free_y.

p + facet_grid(dose ~ supp, scales='free')

Learn more here: ggplot2 facet : split a plot into a matrix of panels.

Position adjustements

Position adjustments determine how to arrange geoms. The argument position is used to adjust geom positions:

p <- ggplot(mpg, aes(fl, fill = drv))
# Arrange elements side by side
p + geom_bar(position = "dodge")
# Stack objects on top of one another, 
# and normalize to have equal height
p + geom_bar(position = "fill")

# Stack elements on top of one another
p + geom_bar(position = "stack")
# Add random noise to X and Y position 
# of each element to avoid overplotting
ggplot(mpg, aes(cty, hwy)) + 
  geom_point(position = "jitter")

Note that, each of these position adjustments can be done using a function with manual width and height argument.

  • position_dodge(width, height)
  • position_fill(width, height)
  • position_stack(width, height)
  • position_jitter(width, height)
p + geom_bar(position = position_dodge(width = 1))

Learn more here: ggplot2 bar plots.

Coordinate systems

p <- ggplot(mpg, aes(fl)) + geom_bar()

The coordinate systems in ggplot2 are:


  • p + coord_cartesian(xlim = NULL, ylim = NULL): Cartesian coordinate system (default). It’s the most familiar and common, type of coordinate system.

  • p + coord_fixed(ratio = 1, xlim = NULL, ylim = NULL): Cartesian coordinates with fixed relationship between x and y scales. The ratio represents the number of units on the y-axis equivalent to one unit on the x-axis. The default, ratio = 1, ensures that one unit on the x-axis is the same length as one unit on the y-axis.

  • p + coord_flip(…): Flipped cartesian coordinates. Useful for creating horizontal plot by rotating.

  • p + coord_polar(theta = “x”, start = 0, direction = 1): Polar coordinates. The polar coordinate system is most commonly used for pie charts, which are a stacked bar chart in polar coordinates.

  • p + coord_trans(x, y, limx, limy): Transformed cartesian coordinate system.

  • coord_map(): Map projections. Provides the full range of map projections available in the mapproj package.


  1. Arguments for coord_cartesian(), coord_fixed() and coord_flip()
    • xlim: limits for the x axis
    • ylim: limits for the y axis
    • ratio: aspect ratio, expressed as y/x
    • : Other arguments passed onto coord_cartesian
  2. Arguments for coord_polar()
    • theta: variable to map angle to (x or y)
    • start: offset of starting point from 12 o’clock in radians
    • direction: 1, clockwise; -1, anticlockwise
  3. Arguments for coord_trans()
    • x, y: transformers for x and y axes
    • limx, limy: limits for x and y axes.
p + coord_cartesian(ylim = c(0, 200))
p + coord_fixed(ratio = 1/50)
p + coord_flip()

p + coord_polar(theta = "x", direction = 1)
p + coord_trans(y = "sqrt")

Extensions to ggplot2: R packages and functions

  • factoextra: factoextra : Extract and Visualize the outputs of a multivariate analysis. factoextra provides some easy-to-use functions to extract and visualize the output of PCA (Principal Component Analysis), CA (Correspondence Analysis) and MCA (Multiple Correspondence Analysis) functions from several packages (FactoMineR, stats, ade4 and MASS). It contains also many functions for simplifying clustering analysis workflows. Ggplot2 plotting system is used.

  • easyggplot2: Perform and customize easily a plot with ggplot2. The idea behind ggplot2 is seductively simple but the detail is, yes, difficult. To customize a plot, the syntax is sometimes a tiny bit opaque and this raises the level of difficulty. easyGgplot2 package (which depends on ggplot2) to make and customize quickly plots including box plot, dot plot, strip chart, violin plot, histogram, density plot, scatter plot, bar plot, line plot, etc, …

  • ggplot2 - Easy way to mix multiple graphs on the same page: The R package gridExtra and cowplot are used.

  • ggplot2: Correlation matrix heatmap

  • ggfortify: Define fortify and autoplot functions to allow ggplot2 to handle some popular R packages. These include plotting 1) Matrix; 2) Linear Model and Generalized Linear Model; 3) Time Series; 4) PCA/Clustering; 5) Survival Curve; 6) Probability distribution

  • GGally: GGally extends ggplot2 by providing several functions including pairwise correlation matrix, scatterplot plot matrix, parallel coordinates plot, survival plot and several functions to plot networks.

  • ggRandomForests: Graphical analysis of random forests with the randomForestSRC and ggplot2 packages.

  • ggdendro: Create dendrograms and tree diagrams using ggplot2

  • ggmcmc: Tools for Analyzing MCMC Simulations from Bayesian Inference

  • ggthemes: Package with additional ggplot2 themes and scales

  • Theme used to create journal ready figures easily

Acknoweledgment

Infos

This analysis was performed using R (ver. 3.2.4) and ggplot2 (ver 2.1.0).

ggplot2 density plot : Quick start guide - R software and data visualization

$
0
0


This R tutorial describes how to create a density plot using R software and ggplot2 package.

The function geom_density() is used. You can also add a line for the mean using the function geom_vline.

ggplot2 density - R software and data visualization


Prepare the data

This data will be used for the examples below :

set.seed(1234)
df <- data.frame(
  sex=factor(rep(c("F", "M"), each=200)),
  weight=round(c(rnorm(200, mean=55, sd=5),
                 rnorm(200, mean=65, sd=5)))
  )
head(df)
##   sex weight
## 1   F     49
## 2   F     56
## 3   F     60
## 4   F     43
## 5   F     57
## 6   F     58

Basic density plots

library(ggplot2)
# Basic density
p <- ggplot(df, aes(x=weight)) + 
  geom_density()
p
# Add mean line
p+ geom_vline(aes(xintercept=mean(weight)),
            color="blue", linetype="dashed", size=1)

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Change density plot line types and colors

# Change line color and fill color
ggplot(df, aes(x=weight))+
  geom_density(color="darkblue", fill="lightblue")
# Change line type
ggplot(df, aes(x=weight))+
  geom_density(linetype="dashed")

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Read more on ggplot2 line types : ggplot2 line types

Change density plot colors by groups

Calculate the mean of each group :

library(plyr)
mu <- ddply(df, "sex", summarise, grp.mean=mean(weight))
head(mu)
##   sex grp.mean
## 1   F    54.70
## 2   M    65.36

Change line colors

Density plot line colors can be automatically controlled by the levels of sex :

# Change density plot line colors by groups
ggplot(df, aes(x=weight, color=sex)) +
  geom_density()
# Add mean lines
p<-ggplot(df, aes(x=weight, color=sex)) +
  geom_density()+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")
p

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

It is also possible to change manually density plot line colors using the functions :

  • scale_color_manual() : to use custom colors
  • scale_color_brewer() : to use color palettes from RColorBrewer package
  • scale_color_grey() : to use grey color palettes
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")
# Use grey scale
p + scale_color_grey() + theme_classic()

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Read more on ggplot2 colors here : ggplot2 colors

Change fill colors

Density plot fill colors can be automatically controlled by the levels of sex :

# Change density plot fill colors by groups
ggplot(df, aes(x=weight, fill=sex)) +
  geom_density()
# Use semi-transparent fill
p<-ggplot(df, aes(x=weight, fill=sex)) +
  geom_density(alpha=0.4)
p
# Add mean lines
p+geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

It is also possible to change manually density plot fill colors using the functions :

  • scale_fill_manual() : to use custom colors
  • scale_fill_brewer() : to use color palettes from RColorBrewer package
  • scale_fill_grey() : to use grey color palettes
# Use custom color palettes
p+scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# use brewer color palettes
p+scale_fill_brewer(palette="Dark2")
# Use grey scale
p + scale_fill_grey() + theme_classic()

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Read more on ggplot2 colors here : ggplot2 colors

Change the legend position

p + theme(legend.position="top")
p + theme(legend.position="bottom")
p + theme(legend.position="none") # Remove legend

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

The allowed values for the arguments legend.position are : “left”,“top”, “right”, “bottom”.

Read more on ggplot legends : ggplot2 legends

Combine histogram and density plots

  • The histogram is plotted with density instead of count values on y-axis
  • Overlay with transparent density plot
# Histogram with density plot
ggplot(df, aes(x=weight)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") 
# Color by groups
ggplot(df, aes(x=weight, color=sex, fill=sex)) + 
 geom_histogram(aes(y=..density..), alpha=0.5, 
                position="identity")+
 geom_density(alpha=.2) 

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Use facets

Split the plot in multiple panels :

p<-ggplot(df, aes(x=weight))+
  geom_density()+facet_grid(sex ~ .)
p
# Add mean lines
p+geom_vline(data=mu, aes(xintercept=grp.mean, color="red"),
             linetype="dashed")

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Read more on facets : ggplot2 facets

Customized density plots

# Basic density
ggplot(df, aes(x=weight, fill=sex)) +
  geom_density(fill="gray")+
  geom_vline(aes(xintercept=mean(weight)), color="blue",
             linetype="dashed")+
  labs(title="Weight density curve",x="Weight(kg)", y = "Density")+
  theme_classic()
# Change line colors by groups
p<- ggplot(df, aes(x=weight, color=sex)) +
  geom_density()+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")+
  labs(title="Weight density curve",x="Weight(kg)", y = "Density")
  
p + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
  theme_classic()

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Change line colors manually :

# Continuous colors
p + scale_color_brewer(palette="Paired") + theme_classic()
# Discrete colors
p + scale_color_brewer(palette="Dark2") + theme_minimal()
# Gradient colors
p + scale_color_brewer(palette="Accent") + theme_minimal()

ggplot2 density - R software and data visualizationggplot2 density - R software and data visualizationggplot2 density - R software and data visualization

Read more on ggplot2 colors here : ggplot2 colors

Infos

This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0)


ggplot2 texts : Add text annotations to a graph in R software

$
0
0


This article describes how to add a text annotation to a plot generated using ggplot2 package.

The functions below can be used :

  • geom_text(): adds text directly to the plot
  • geom_label(): draws a rectangle underneath the text, making it easier to read.
  • annotate(): useful for adding small text annotations at a particular location on the plot
  • annotation_custom(): Adds static annotations that are the same in every panel

It’s also possible to use the R package ggrepel, which is an extension and provides geom for ggplot2 to repel overlapping text labels away from each other.

We’ll start by describing how to use ggplot2 official functions for adding text annotations. In the last sections, examples using ggrepel extensions are provided.


Install required packages

# Install ggplot2
install.packages("ggplot2")
# Install ggrepel
install.packages("ggrepel")

Create some data

We’ll use a subset of mtcars data. The function sample() can be used to randomly extract 10 rows:

# Subset 10 rows
set.seed(1234)
ss <- sample(1:32, 10)
df <- mtcars[ss, ]

Text annotations using geom_text and geom_label

library(ggplot2)
# Simple scatter plot
sp <- ggplot(df, aes(wt, mpg, label = rownames(df)))+
  geom_point()
 
# Add texts
sp + geom_text()
# Change the size of the texts
sp + geom_text(size=6)
# Change vertical and horizontal adjustement
sp +  geom_text(hjust=0, vjust=0)
# Change fontface. Allowed values : 1(normal),
# 2(bold), 3(italic), 4(bold.italic)
sp + geom_text(aes(fontface=2))

  • Change font family
sp + geom_text(family = "Times New Roman")
  • geom_label() works like geom_text() but draws a rounded rectangle underneath each label. This is useful when you want to label plots that are dense with data.
sp + geom_label()


Others useful arguments for geom_text() and geom_label() are:

  • nudge_x and nudge_y: let you offset labels from their corresponding points. The function position_nudge() can be also used.
  • check_overlap = TRUE: for avoiding overplotting of labels
  • hjust and vjust can now be character vectors (ggplot2 v >= 2.0.0): “left”, “center”, “right”, “bottom”, “middle”, “top”. New options include “inward” and “outward” which align text towards and away from the center of the plot respectively.


Change the text color and size by groups

It’s possible to change the appearance of the texts using aesthetics (color, size,…) :

sp2 <- ggplot(mtcars, aes(x=wt, y=mpg, label=rownames(mtcars)))+
  geom_point()
# Color by groups
sp2 + geom_text(aes(color=factor(cyl)))

# Set the size of the text using a continuous variable
sp2 + geom_text(aes(size=wt))

# Define size range
sp2 + geom_text(aes(size=wt)) + scale_size(range=c(3,6))

Add a text annotation at a particular coordinate

The functions geom_text() and annotate() can be used :

# Solution 1
sp2 + geom_text(x=3, y=30, label="Scatter plot")
# Solution 2
sp2 + annotate(geom="text", x=3, y=30, label="Scatter plot",
              color="red")

annotation_custom : Add a static text annotation in the top-right, top-left, …

The functions annotation_custom() and textGrob() are used to add static annotations which are the same in every panel.The grid package is required :

library(grid)
# Create a text
grob <- grobTree(textGrob("Scatter plot", x=0.1,  y=0.95, hjust=0,
  gp=gpar(col="red", fontsize=13, fontface="italic")))
# Plot
sp2 + annotation_custom(grob)

Facet : In the plot below, the annotation is at the same place (in each facet) even if the axis scales vary.

sp2 + annotation_custom(grob)+facet_wrap(~cyl, scales="free")

ggrepel: Avoid overlapping of text labels

There are two important functions in ggrepel R packages:

  • geom_label_repel()
  • geom_text_repel()

Scatter plots with text annotations

We start by creating a simple scatter plot using a subset of the mtcars data set containing 15 rows.

  1. Prepare some data:
# Take a subset of 15 random points
set.seed(1234)
ss <- sample(1:32, 15)
df <- mtcars[ss, ]
  1. Create a scatter plot:
p <- ggplot(df, aes(wt, mpg)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)
  1. Add text labels:
# Add text annotations using ggplot2::geom_text
p + geom_text(aes(label = rownames(df)),
              size = 3.5)

# Use ggrepel::geom_text_repel
require("ggrepel")
set.seed(42)
p + geom_text_repel(aes(label = rownames(df)),
                    size = 3.5) 

# Use ggrepel::geom_label_repel and 
# Change color by groups
set.seed(42)
p + geom_label_repel(aes(label = rownames(df),
                    fill = factor(cyl)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

Volcano plot

genes <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", header = TRUE)
genes$Significant <- ifelse(genes$padj < 0.05, "FDR < 0.05", "Not Sig")
ggplot(genes, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = Significant)) +
  scale_color_manual(values = c("red", "grey")) +
  theme_bw(base_size = 12) + theme(legend.position = "bottom") +
  geom_text_repel(
    data = subset(genes, padj < 0.05),
    aes(label = Gene),
    size = 5,
    box.padding = unit(0.35, "lines"),
    point.padding = unit(0.3, "lines")
  )

source

Infos

This analysis has been performed using R software (ver. 3.2.4) and ggplot2 (ver. )

One-Way ANOVA Test in R

$
0
0


What is one-way ANOVA test?


The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable). This tutorial describes the basic principle of the one-way ANOVA test and provides practical anova test examples in R software.


ANOVA test hypotheses:

  • Null hypothesis: the means of the different groups are the same
  • Alternative hypothesis: At least one sample mean is not equal to the others.

Note that, if you have only two groups, you can use t-test. In this case the F-test and the t-test are equivalent.


One-Way ANOVA Test


Assumptions of ANOVA test

Here we describe the requirement for ANOVA test. ANOVA test can be applied only when:


  • The observations are obtained independently and randomly from the population defined by the factor levels
  • The data of each factor level are normally distributed.
  • These normal populations have a common variance. (Levene’s test can be used to check this.)


How one-way ANOVA test works?

Assume that we have 3 groups (A, B, C) to compare:

  1. Compute the common variance, which is called variance within samples (\(S^2_{within}\)) or residual variance.
  2. Compute the variance between sample means as follow:
    • Compute the mean of each group
    • Compute the variance between sample means (\(S^2_{between}\))
  3. Produce F-statistic as the ratio of \(S^2_{between}/S^2_{within}\).

Note that, a lower ratio (ratio < 1) indicates that there are no significant difference between the means of the samples being compared. However, a higher ratio implies that the variation among group means are significant.

Visualize your data and compute one-way ANOVA in R

Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.

my_data <- PlantGrowth

Check your data

To have an idea of what the data look like, we use the the function sample_n()[in dplyr package]. The sample_n() function randomly picks a few of the observations in the data frame to print out:

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
   weight group
19   4.32  trt1
18   4.89  trt1
29   5.80  trt2
24   5.50  trt2
17   6.03  trt1
1    4.17  ctrl
6    4.61  ctrl
16   3.83  trt1
12   4.17  trt1
15   5.87  trt1

In R terminology, the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically.

# Show the levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"

If the levels are not automatically in the correct order, re-order them as follow:

my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))

It’s possible to compute summary statistics (mean and sd) by groups using the dplyr package.

  • Compute summary statistics by groups - count, mean, sd:
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
Source: local data frame [3 x 4]
   group count  mean        sd
  (fctr) (int) (dbl)     (dbl)
1   ctrl    10 5.032 0.5830914
2   trt1    10 4.661 0.7936757
3   trt2    10 5.526 0.4425733

Visualize your data

  • To use R base graphs read this: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization.

  • Install the latest version of ggpubr from GitHub as follow (recommended):

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Visualize your data with ggpubr:
# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")
One-way ANOVA Test in R

One-way ANOVA Test in R

# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")
One-way ANOVA Test in R

One-way ANOVA Test in R

If you still want to use R base graphs, type the following scripts:

# Box plot
boxplot(weight ~ group, data = my_data,
        xlab = "Treatment", ylab = "Weight",
        frame = FALSE, col = c("#00AFBB", "#E7B800", "#FC4E07"))
# plotmeans
library("gplots")
plotmeans(weight ~ group, data = my_data, frame = FALSE,
          xlab = "Treatment", ylab = "Weight",
          main="Mean Plot with 95% CI") 

Compute one-way ANOVA test

We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.

The R function aov() can be used to answer to this question. The function summary.aov() is used to summarize the analysis of variance model.

# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.

Interpret the result of one-way ANOVA tests

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*" in the model summary.

Multiple pairwise-comparison between the means of groups

In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different.

It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

Tukey multiple pairwise-comparisons

As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups.

The function TukeyHD() takes the fitted ANOVA as an argument.

TukeyHSD(res.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = my_data)
$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064
  • diff: difference between means of the two groups
  • lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)
  • p adj: p-value after adjustment for the multiple comparisons.

It can be seen from the output, that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.

Multiple comparisons using multcomp package

It’s possible to use the function glht() [in multcomp package] to perform multiple comparison procedures for an ANOVA. glht stands for general linear hypothesis tests. The simplified format is as follow:

glht(model, lincft)
  • model: a fitted model, for example an object returned by aov().
  • lincft(): a specification of the linear hypotheses to be tested. Multiple comparisons in ANOVA models are specified by objects returned from the function mcp().

Use glht() to perform multiple pairwise-comparisons for a one-way ANOVA:

library(multcomp)
summary(glht(res.aov, linfct = mcp(group = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: aov(formula = weight ~ group, data = my_data)
Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)  
trt1 - ctrl == 0  -0.3710     0.2788  -1.331    0.391  
trt2 - ctrl == 0   0.4940     0.2788   1.772    0.198  
trt2 - trt1 == 0   0.8650     0.2788   3.103    0.012 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Pairewise t-test

The function pairewise.t.test() can be also used to calculate pairwise comparisons between group levels with corrections for multiple testing.

pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 
data:  my_data$weight and my_data$group 
     ctrl  trt1 
trt1 0.194 -    
trt2 0.132 0.013
P value adjustment method: BH 

The result is a table of p-values for the pairwise comparisons. Here, the p-values have been adjusted by the Benjamini-Hochberg method.

Check ANOVA assumptions: test validity?

The ANOVA test assumes that, the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots.

Check the homogeneity of variance assumption

The residuals versus fits plot can be used to check the homogeneity of variances.

In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. So, we can assume the homogeneity of variances.

# 1. Homogeneity of variances
plot(res.aov, 1)
One-way ANOVA Test in R

One-way ANOVA Test in R

Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

It’s also possible to use Bartlett’s test or Levene’s test to check the homogeneity of variances.

We recommend Levene’s test, which is less sensitive to departures from normal distribution. The function leveneTest() [in car package] will be used:

library(car)
leveneTest(weight ~ group, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.1192 0.3412
      27               

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

Relaxing the homogeneity of variance assumption

The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fine: the Levene test is not significant.

How do we save our ANOVA test, in a situation where the homogeneity of variance assumption is violated?

An alternative procedure (i.e.: Welch one-way test), that does not require that assumption have been implemented in the function oneway.test().

  • ANOVA test with no assumption of equal variances
oneway.test(weight ~ group, data = my_data)
  • Pairwise t-tests with no assumption of equal variances
pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH", pool.sd = FALSE)

Check the normality assumption

Normality plot of residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted.

The normal probability plot of residuals is used to check the assumption that the residuals are normally distributed. It should approximately follow a straight line.

# 2. Normality
plot(res.aov, 2)
One-way ANOVA Test in R

One-way ANOVA Test in R

As all the points fall approximately along this reference line, we can assume normality.

The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = 0.6) which finds no indication that normality is violated.

# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test
data:  aov_residuals
W = 0.96607, p-value = 0.4379

Non-parametric alternative to one-way ANOVA test

Note that, a non-parametric alternative to one-way ANOVA is Kruskal-Wallis rank sum test, which can be used when ANNOVA assumptions are not met.

kruskal.test(weight ~ group, data = my_data)

    Kruskal-Wallis rank sum test
data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Summary


  1. Import your data from a .txt tab file: my_data <- read.delim(file.choose()). Here, we used my_data <- PlantGrowth.
  2. Visualize your data: ggpubr::ggboxplot(my_data, x = “group”, y = “weight”, color = “group”)
  3. Compute one-way ANOVA test: summary(aov(weight ~ group, data = my_data))
  4. Tukey multiple pairwise-comparisons: TukeyHSD(res.aov)


Read more

Infos

This analysis has been performed using R software (ver. 3.2.4).

t test formula

$
0
0

t-test definition

Student t test is a statistical test which is widely used to compare the mean of two groups of samples. It is therefore to evaluate whether the means of the two sets of data are statistically significantly different from each other.

There are many types of t test :

  • The one-sample t-test, used to compare the mean of a population with a theoretical value.
  • The unpaired two sample t-test, used to compare the mean of two independent samples.
  • The paired t-test, used to compare the means between two related groups of samples.

The aim of this article is to describe the different t test formula. Student’s t-test is a parametric test as the formula depends on the mean and the standard deviation of the data being compared.

Note that an online t-test calculator is available here to compute t-test statistics without any installation.

One-sample t-test formula

As mentioned above, one-sample t-test is used to compare the mean of a population to a specified theoretical mean (\(\mu\)).

Let X represents a set of values with size n, with mean m and with standard deviation S. The comparison of the observed mean (m) of the population to a theoretical value \(\mu\) is performed with the formula below :

\[ t = \frac{m-\mu}{s/\sqrt{n}} \]

To evaluate whether the difference is statistically significant, you first have to read in t test table the critical value of Student’s t distribution corresponding to the significance level alpha of your choice (5%). The degrees of freedom (df) used in this test are :

\[ df = n - 1 \]

If the absolute value of the t-test statistics (|t|) is greater than the critical value, then the difference is significant. Otherwise it isn’t. The level of significance or (p-value) corresponds to the risk indicated by the t test table for the calculated |t| value.

The t test can be used only when the data are normally distributed.

Independent two sample t-test

What is independent t-test ?

Independent (or unpaired two sample) t-test is used to compare the means of two unrelated groups of samples.

As an example, we have a cohort of 100 individuals (50 women and 50 men). The question is to test whether the average weight of women is significantly different from that of men?

In this case, we have two independents groups of samples and unpaired t-test can be used to test whether the means are different.

Independent t-test formula

  • Let A and B represent the two groups to compare.
  • Let \(m_A\) and \(m_B\) represent the means of groups A and B, respectively.
  • Let \(n_A\) and \(n_B\) represent the sizes of group A and B, respectively.

The t test statistic value to test whether the means are different can be calculated as follow :

\[ t = \frac{m_A - m_B}{\sqrt{ \frac{S^2}{n_A} + \frac{S^2}{n_B} }} \]

\(S^2\) is an estimator of the common variance of the two samples. It can be calculated as follow :

\[ S^2 = \frac{\sum{(x-m_A)^2}+\sum{(x-m_B)^2}}{n_A+n_B-2} \]

Once t-test statistic value is determined, you have to read in t-test table the critical value of Student’s t distribution corresponding to the significance level alpha of your choice (5%). The degrees of freedom (df) used in this test are :

\[ df = n_A + n_B -2 \]

If the absolute value of the t-test statistics (|t|) is greater than the critical value, then the difference is significant. Otherwise it isn’t. The level of significance or (p-value) corresponds to the risk indicated by the t-test table for the calculated |t| value.

The test can be used only when the two groups of samples (A and B) being compared follow bivariate normal distribution with equal variances.

If the variances of the two groups being compared are different, the Welch t test can be used.

Paired sample t-test

What is paired t-test ?

Paired Student’s t-test is used to compare the means of two related samples. That is when you have two values (pair of values) for the same samples.

For example, 20 mice received a treatment X for 3 months. The question is to test whether the treatment X has an impact on the weight of the mice at the end of the 3 months treatment. The weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice.

In this case, paired t-test can be used as the two sets of values being compared are related. We have a pair of values for each mouse (one before and the other after treatment).

Paired t-test formula

To compare the means of the two paired sets of data, the differences between all pairs must be, first, calculated.

Let d represents the differences between all pairs. The average of the difference d is compared to 0. If there is any significant difference between the two pairs of samples, then the mean of d is expected to be far from 0.

t test statistisc value can be calculated as follow :

\[ t = \frac{m}{s/\sqrt{n}} \]

m and s are the mean and the standard deviation of the difference (d), respectively. n is the size of d.

Once t value is determined, you have to read in t-test table the critical value of Student’s t distribution corresponding to the significance level alpha of your choice (5%). The degrees of freedom (df) used in this test are :

\[ df = n - 1 \]

If the absolute value of the t-test statistics (|t|) is greater than the critical value, then the difference is significant. Otherwise it isn’t. The level of significance or (p-value) corresponds to the risk indicated by the t-test table for the calculated |t| value.

The test can be used only when the difference d is normally distributed.

Online t-test calculator

You no longer need SPSS or Excel to perform t-test.

An online t-test calculator is available here to perform Student’s t-test without any installation.

Depending on the types of Student’s t-test you want to do, click the following links :

Infos

This analysis has been done using R (ver. 3.1.0).

Normality Test in R

$
0
0


Many of statistical tests including correlation, regression, t-test, and analysis of variance (ANOVA) assume some certain characteristics about the data. They require the data to follow a normal distribution or Gaussian distribution. These tests are called parametric tests, because their validity depends on the distribution of the data.

Normality and the other assumptions made by these tests should be taken seriously to draw reliable interpretation and conclusions of the research.

Before using a parametric test, we should perform some preleminary tests to make sure that the test assumptions are met. In the situations where the assumptions are violated, non-paramatric tests are recommended.

Here, we’ll describe how to check the normality of the data by visual inspection and by significance tests.


Install required R packages

  1. dplyr for data manipulation
install.packages("dplyr")
  1. ggpubr for an easy ggplot2-based data visualization
  • Install the latest version from GitHub as follow:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")

Load required R packages

library("dplyr")
library("ggpubr")

Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named ToothGrowth.

# Store the data in the variable my_data
my_data <- ToothGrowth

Check your data

We start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package].

Show 10 random rows:

set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
7  11.2   VC  0.5
37  8.2   OJ  0.5
36 10.0   OJ  0.5
58 27.3   OJ  2.0
49 14.5   OJ  1.0
57 26.4   OJ  2.0
1   4.2   VC  0.5
13 15.2   VC  1.0
35 14.5   OJ  0.5
27 26.7   VC  2.0

Assess the normality of the data in R

We want to test if the variable len (tooth length) is normally distributed.

Case of large sample sizes

If the sample size is large enough (n > 30), we can ignore the distribution of the data and use parametric tests.

The central limit theorem tells us that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).

However, to be consistent, normality can be checked by visual inspection [normal plots (histogram), Q-Q plot (quantile-quantile plot)] or by significance tests].

Visual methods

Density plot and Q-Q plot can be used to check normality visually.

  1. Density plot: the density plot provides a visual judgment about whether the distribution is bell shaped.
library("ggpubr")
ggdensity(my_data$len, 
          main = "Density plot of tooth length",
          xlab = "Tooth length")

  1. Q-Q plot: Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted.
library(ggpubr)
ggqqplot(my_data$len)

It’s also possible to use the function qqPlot() [in car package]:

library("car")
qqPlot(my_data$len)

As all the points fall approximately along this reference line, we can assume normality.

Normality test

Visual inspection, described in the previous section, is usually unreliable. It’s possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.

There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.

The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.

Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

Note that, normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.

The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):

shapiro.test(my_data$len)

    Shapiro-Wilk normality test
data:  my_data$len
W = 0.96743, p-value = 0.1091

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

Infos

This analysis has been performed using R software (ver. 3.2.4).

Two-Way ANOVA Test in R

$
0
0


What is two-way ANOVA test?


Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable.


The grouping variables are also known as factors. The different categories (groups) of a factor are called levels. The number of levels can vary between factors. The level combinations of factors are called cell.


  • When the sample sizes within cells are equal, we have the so-called balanced design. In this case the standard two-way ANOVA test can be applied.

  • When the sample sizes within each level of the independent variables are not the same (case of unbalanced designs), the ANOVA test should be handled differently.


This tutorial describes how to compute two-way ANOVA test in R software for balanced and unbalanced designs.


Two-Way ANOVA Test


Two-way ANOVA test hypotheses

  1. There is no difference in the means of factor A
  2. There is no difference in means of factor B
  3. There is no interaction between factors A and B

The alternative hypothesis for cases 1 and 2 is: the means are not equal.

The alternative hypothesis for case 3 is: there is an interaction between A and B.

Assumptions of two-way ANOVA test

Two-way ANOVA, like all ANOVA tests, assumes that the observations within each cell are normally distributed and have equal variances. We’ll show you how to check these assumptions after fitting ANOVA.

Compute two-way ANOVA test in R: balanced designs

Balanced designs correspond to the situation where we have equal sample sizes within levels of our independent grouping levels.

Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named ToothGrowth. It contains data from a study evaluating the effect of vitamin C on tooth growth in Guinea pigs. The experiment has been performed on 60 pigs, where each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). Tooth length was measured and a sample of the data is shown below.

# Store the data in the variable my_data
my_data <- ToothGrowth

Check your data

To get an idea of what the data look like, we display a random sample of the data using the function sample_n()[in dplyr package]. First, install dplyr if you don’t have it:

install.packages("dplyr")
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
38  9.4   OJ  0.5
36 10.0   OJ  0.5
37  8.2   OJ  0.5
50 27.3   OJ  1.0
59 29.4   OJ  2.0
1   4.2   VC  0.5
13 15.2   VC  1.0
56 30.9   OJ  2.0
27 26.7   VC  2.0
53 22.4   OJ  2.0
# Check the structure
str(my_data)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

From the output above, R considers “dose” as a numeric variable. We’ll convert it as a factor variable (i.e., grouping variable) as follow.

# Convert dose as a factor and recode the levels
# as "D0.5", "D1", "D2"
my_data$dose <- factor(my_data$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
head(my_data)
   len supp dose
1  4.2   VC D0.5
2 11.5   VC D0.5
3  7.3   VC D0.5
4  5.8   VC D0.5
5  6.4   VC D0.5
6 10.0   VC D0.5

Question: We want to know if tooth length depends on supp and dose.

  • Generate frequency tables:
table(my_data$supp, my_data$dose)
    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10

We have 2X3 design cells with the factors being supp and dose and 10 subjects in each cell. Here, we have a balanced design. In the next sections I’ll describe how to analyse data from balanced designs, since this is the simplest case.

Visualize your data

Box plots and line plots can be used to visualize group differences:

  • Box plot to plot the data grouped by the combinations of the levels of the two factors.
  • Two-way interaction plot, which plots the mean (or other summary) of the response for two-way combinations of factors, thereby illustrating possible interactions.

  • To use R base graphs read this: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization.

  • Install the latest version of ggpubr from GitHub as follow (recommended):

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Visualize your data with ggpubr:
# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
library("ggpubr")
ggboxplot(my_data, x = "dose", y = "len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))
Two-Way ANOVA Test in R

Two-Way ANOVA Test in R

# Line plots with multiple groups
# +++++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))
Two-Way ANOVA Test in R

Two-Way ANOVA Test in R

If you still want to use R base graphs, type the following scripts:

# Box plot with two factor variables
boxplot(len ~ supp * dose, data=my_data, frame = FALSE, 
        col = c("#00AFBB", "#E7B800"), ylab="Tooth Length")
# Two-way interaction plot
interaction.plot(x.factor = my_data$dose, trace.factor = my_data$supp, 
                 response = my_data$len, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "Dose", ylab="Tooth Length",
                 pch=c(1,19), col = c("#00AFBB", "#E7B800"))

Arguments used for the function interaction.plot():


  • x.factor: the factor to be plotted on x axis.
  • trace.factor: the factor to be plotted as lines
  • response: a numeric variable giving the response
  • type: the type of plot. Allowed values include p (for point only), l (for line only) and b (for both point and line).


Compute two-way ANOVA test

We want to know if tooth length depends on supp and dose.

The R function aov() can be used to answer this question. The function summary.aov() is used to summarize the analysis of variance model.

res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.

From the ANOVA table we can conclude that both supp and dose are statistically significant. dose is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length.

Not the above fitted model is called additive model. It makes an assumption that the two factor variables are independent. If you think that these two variables might interact to create an synergistic effect, replace the plus symbol (+) by an asterisk (*), as follow.

# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that the two main effects (supp and dose) are statistically significant, as well as their interaction.

Note that, in the situation where the interaction is not significant you should use the additive model.

Interpret the results

From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:

  • the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.
  • the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
  • the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.

Compute some summary statistics

  • Compute mean and SD by groups using dplyr R package:
require("dplyr")
group_by(my_data, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
Source: local data frame [6 x 5]
Groups: supp [?]
    supp   dose count  mean       sd
  (fctr) (fctr) (int) (dbl)    (dbl)
1     OJ   D0.5    10 13.23 4.459709
2     OJ     D1    10 22.70 3.910953
3     OJ     D2    10 26.06 2.655058
4     VC   D0.5    10  7.98 2.746634
5     VC     D1    10 16.77 2.515309
6     VC     D2    10 26.14 4.797731
  • It’s also possible to use the function model.tables() as follow:
model.tables(res.aov3, type="means", se = TRUE)

Multiple pairwise-comparison between the means of groups

In ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different.

It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

Tukey multiple pairwise-comparisons

As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups. The function TukeyHD() takes the fitted ANOVA as an argument.

We don’t need to perform the test for the “supp” variable because it has only two levels, which have been already proven to be significantly different by ANOVA test. Therefore, the Tukey HSD test will be done only for the factor variable “dose”.

TukeyHSD(res.aov3, which = "dose")
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)
$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06
  • diff: difference between means of the two groups
  • lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)
  • p adj: p-value after adjustment for the multiple comparisons.

It can be seen from the output, that all pairwise comparisons are significant with an adjusted p-value < 0.05.

Multiple comparisons using multcomp package

It’s possible to use the function glht() [in multcomp package] to perform multiple comparison procedures for an ANOVA. glht stands for general linear hypothesis tests. The simplified format is as follow:

glht(model, lincft)
  • model: a fitted model, for example an object returned by aov().
  • lincft(): a specification of the linear hypotheses to be tested. Multiple comparisons in ANOVA models are specified by objects returned from the function mcp().

Use glht() to perform multiple pairwise-comparisons:

library(multcomp)
summary(glht(res.aov2, linfct = mcp(dose = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: aov(formula = len ~ supp + dose, data = my_data)
Linear Hypotheses:
               Estimate Std. Error t value Pr(>|t|)    
D1 - D0.5 == 0    9.130      1.210   7.543   <1e-05 ***
D2 - D0.5 == 0   15.495      1.210  12.802   <1e-05 ***
D2 - D1 == 0      6.365      1.210   5.259   <1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Pairwise t-test

The function pairwise.t.test() can be also used to calculate pairwise comparisons between group levels with corrections for multiple testing.

pairwise.t.test(my_data$len, my_data$dose,
                p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 
data:  my_data$len and my_data$dose 
   D0.5    D1     
D1 1.0e-08 -      
D2 4.4e-16 1.4e-05
P value adjustment method: BH 

Check ANOVA assumptions: test validity?

ANOVA assumes that the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots.

Check the homogeneity of variance assumption

The residuals versus fits plot is used to check the homogeneity of variances. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. So, we can assume the homogeneity of variances.

# 1. Homogeneity of variances
plot(res.aov3, 1)
Two-Way ANOVA Test in R

Two-Way ANOVA Test in R

Points 32 and 23 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

Use the Levene’s test to check the homogeneity of variances. The function leveneTest() [in car package] will be used:

library(car)
leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54               

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

Check the normality assumpttion

Normality plot of the residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted.

The normal probability plot of residuals is used to verify the assumption that the residuals are normally distributed.

The normal probability plot of the residuals should approximately follow a straight line.

# 2. Normality
plot(res.aov3, 2)
Two-Way ANOVA Test in R

Two-Way ANOVA Test in R

As all the points fall approximately along this reference line, we can assume normality.

The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.98, p = 0.5) which finds no indication that normality is violated.

# Extract the residuals
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test
data:  aov_residuals
W = 0.98499, p-value = 0.6694

Compute two-way ANOVA test in R for unbalanced designs

An unbalanced design has unequal numbers of subjects in each group.

There are three fundamentally different ways to run an ANOVA in an unbalanced design. They are known as Type-I, Type-II and Type-III sums of squares. To keep things simple, note that The recommended method are the Type-III sums of squares.

The three methods give the same result when the design is balanced. However, when the design is unbalanced, they don’t give the same results.

The function Anova() [in car package] can be used to compute two-way ANOVA test for unbalanced designs.

First install the package on your computer. In R, type install.packages(“car”). Then:

library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")
Anova Table (Type III tests)
Response: len
             Sum Sq Df F value    Pr(>F)    
(Intercept) 1750.33  1 132.730 3.603e-16 ***
supp         137.81  1  10.450  0.002092 ** 
dose         885.26  2  33.565 3.363e-10 ***
supp:dose    108.32  2   4.107  0.021860 *  
Residuals    712.11 54                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Infos

This analysis has been performed using R software (ver. 3.2.4).

Paired Samples T-test in R

$
0
0


What is paired samples t-test?


The paired samples t-test is used to compare the means between two related groups of samples. In this case, you have two values (i.e., pair of values) for the same samples. This article describes how to compute paired samples t-test using R software.


As an example of data, 20 mice received a treatment X during 3 months. We want to know whether the treatment X has an impact on the weight of the mice.

To answer to this question, the weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice.

In such situations, paired t-test can be used to compare the mean weights before and after treatment.

Paired t-test analysis is performed as follow:

  1. Calculate the difference (\(d\)) between each pair of value
  2. Compute the mean (\(m\)) and the standard deviation (\(s\)) of \(d\)
  3. Compare the average difference to 0. If there is any significant difference between the two pairs of samples, then the mean of d (\(m\)) is expected to be far from 0.

Paired t-test can be used only when the difference \(d\) is normally distributed. This can be checked using Shapiro-Wilk test.


Paired samples t test


Research questions and statistical hypotheses

Typical research questions are:


  1. whether the mean difference (\(m\)) is equal to 0?
  2. whether the mean difference (\(m\)) is less than 0?
  3. whether the mean difference (\(m\)) is greather than 0?


In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

  1. \(H_0: m = 0\)
  2. \(H_0: m \leq 0\)
  3. \(H_0: m \geq 0\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

  1. \(H_a: m \ne 0\) (different)
  2. \(H_a: m > 0\) (greater)
  3. \(H_a: m < 0\) (less)

Note that:

  • Hypotheses 1) are called two-tailed tests
  • Hypotheses 2) and 3) are called one-tailed tests

Formula of paired samples t-test

t-test statistisc value can be calculated using the following formula:

\[ t = \frac{m}{s/\sqrt{n}} \]

where,

  • m is the mean differences
  • n is the sample size (i.e., size of d).
  • s is the standard deviation of d

We can compute the p-value corresponding to the absolute value of the t-test statistics (|t|) for the degrees of freedom (df): \(df = n - 1\).

If the p-value is inferior or equal to 0.05, we can conclude that the difference between the two paired samples are significantly different.

Visualize your data and compute paired t-test in R

R function to compute paired t-test

To perform paired samples t-test comparing the means of two paired samples (x & y), the R function t.test() can be used as follow:

t.test(x, y, paired = TRUE, alternative = "two.sided")

  • x,y: numeric vectors
  • paired: a logical value specifying that we want to compute a paired t-test
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set, which contains the weight of 10 mice before and after the treatment.

# Data in two numeric vectors
# ++++++++++++++++++++++++++
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )

We want to know, if there is any significant difference in the mean weights after treatment?

Check your data

# Print all data
print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2

Compute summary statistics (mean and sd) by groups using the dplyr package.

  • To install dplyr package, type this:
install.packages("dplyr")
  • Compute summary statistics by groups:
library("dplyr")
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
Source: local data frame [2 x 4]
   group count   mean       sd
  (fctr) (int)  (dbl)    (dbl)
1  after    10 393.65 29.39801
2 before    10 199.16 18.47354

Visualize your data using box plots

To use R base graphs read this: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization.

  • Install the latest version of ggpubr from GitHub as follow (recommended):
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Visualize your data:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")
Paired Samples T-test in R

Paired Samples T-test in R

Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedData package] to plot paired data (“before - after” plot).

  • Install pairedData package:
install.packages("PairedData")
  • Plot paired data:
# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()
Paired Samples T-test in R

Paired Samples T-test in R

Preleminary test to check paired t-test assumptions

Assumption 1: Are the two samples paired?

Yes, since the data have been collected from measuring twice the weight of the same mice.

Assumption 2: Is this a large sample?

No, because n < 30. Since the sample size is not large enough (less than 30), we need to check whether the differences of the pairs follow a normal distribution.

How to check the normality?

Use Shapiro-Wilk normality test as described at: Normality Test in R.

  • Null hypothesis: the data are normally distributed
  • Alternative hypothesis: the data are not normally distributed
# compute the difference
d <- with(my_data, 
        weight[group == "before"] - weight[group == "after"])
# Shapiro-Wilk normality test for the differences
shapiro.test(d) # => p-value = 0.6141

From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the differences (d) are not significantly different from normal distribution. In other words, we can assume the normality.

Note that, if the data are not normally distributed, it’s recommended to use the non parametric paired two-samples Wilcoxon test.

Compute paired samples t-test

Question : Is there any significant changes in the weights of mice after treatment?

1) Compute paired t-test - Method 1: The data are saved in two different numeric vectors.

# Compute t-test
res <- t.test(before, after, paired = TRUE)
res

    Paired t-test
data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean of the differences 
                -194.49 

2) Compute paired t-test - Method 2: The data are saved in a data frame.

# Compute t-test
res <- t.test(weight ~ group, data = my_data, paired = TRUE)
res

    Paired t-test
data:  weight by group
t = 20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 173.4219 215.5581
sample estimates:
mean of the differences 
                 194.49 

As you can see, the two methods give the same results.


In the result above :

  • t is the t-test statistic value (t = 20.88),
  • df is the degrees of freedom (df= 9),
  • p-value is the significance level of the t-test (p-value = 6.210^{-9}).
  • conf.int is the confidence interval (conf.int) of the mean differences at 95% is also shown (conf.int= [173.42, 215.56])
  • sample estimates is the mean differences between pairs (mean = 194.49).



Note that:

  • if you want to test whether the average weight before treatment is less than the average weight after treatment, type this:
t.test(weight ~ group, data = my_data, paired = TRUE,
        alternative = "less")
  • Or, if you want to test whether the average weight before treatment is greater than the average weight after treatment, type this
t.test(weight ~ group, data = my_data, paired = TRUE,
       alternative = "greater")


Interpretation of the result

The p-value of the test is 6.210^{-9}, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment with a p-value = 6.210^{-9}.

Access to the values returned by t.test() function

The result of t.test() function is a list containing the following components:


  • statistic: the value of the t test statistics
  • parameter: the degrees of freedom for the t test statistics
  • p.value: the p-value for the test
  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).


The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 6.200298e-09
# printing the mean
res$estimate
mean of the differences 
                 194.49 
# printing the confidence interval
res$conf.int
[1] 173.4219 215.5581
attr(,"conf.level")
[1] 0.95

Online paired t-test calculator

You can perform paired-samples t-test, online, without any installation by clicking the following link:



Infos

This analysis has been performed using R software (ver. 3.2.4).

One-Sample T-test in R

$
0
0


What is one-sample t-test?


one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (\(\mu\)).


Generally, the theoretical mean comes from:

  • a previous experiment. For example, compare whether the mean weight of mice differs from 200 mg, a value determined in a previous study.
  • or from an experiment where you have control and treatment conditions. If you express your data as “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.

Note that, one-sample t-test can be used only, when the data are normally distributed . This can be checked using Shapiro-Wilk test .


One Sample t-test


Research questions and statistical hypotheses

Typical research questions are:


  1. whether the mean (\(m\)) of the sample is equal to the theoretical mean (\(\mu\))?
  2. whether the mean (\(m\)) of the sample is less than the theoretical mean (\(\mu\))?
  3. whether the mean (\(m\)) of the sample is greater than the theoretical mean (\(\mu\))?


In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

  1. \(H_0: m = \mu\)
  2. \(H_0: m \leq \mu\)
  3. \(H_0: m \geq \mu\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

  1. \(H_a: m \ne \mu\) (different)
  2. \(H_a: m > \mu\) (greater)
  3. \(H_a: m < \mu\) (less)

Note that:

  • Hypotheses 1) are called two-tailed tests
  • Hypotheses 2) and 3) are called one-tailed tests

Formula of one-sample t-test

The t-statistic can be calculated as follow:

\[ t = \frac{m-\mu}{s/\sqrt{n}} \]

where,

  • m is the sample mean
  • n is the sample size
  • s is the sample standard deviation with \(n-1\) degrees of freedom
  • \(\mu\) is the theoretical value

We can compute the p-value corresponding to the absolute value of the t-test statistics (|t|) for the degrees of freedom (df): \(df = n - 1\).

How to interpret the results?

If the p-value is inferior or equal to the significance level 0.05, we can reject the null hypothesis and accept the alternative hypothesis. In other words, we conclude that the sample mean is significantly different from the theoretical mean.

Visualize your data and compute one-sample t-test in R

Install ggpubr R package for data visualization

You can draw R base graps as described at this link: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization

  • Install the latest version from GitHub as follow (recommended):
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")

R function to compute one-sample t-test

To perform one-sample t-test, the R function t.test() can be used as follow:

t.test(x, mu = 0, alternative = "two.sided")

  • x: a numeric vector containing your data values
  • mu: the theoretical mean. Default is 0 but you can change it.
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set containing the weight of 10 mice.

We want to know, if the average weight of the mice differs from 25g?

set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

Check your data

# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
  • Min.: the minimum value
  • 1st Qu.: The first quartile. 25% of values are lower than this.
  • Median: the median value. Half the values are lower; half are higher.
  • 3rd Qu.: the third quartile. 75% of values are higher than this.
  • Max.: the maximum value

Visualize your data using box plots

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())
One-Sample Student's T-test in R

One-Sample Student’s T-test in R

Preleminary test to check one-sample t-test assumptions

  1. Is this a large sample? - No, because n < 30.
  2. Since the sample size is not large enough (less than 30, central limit theorem), we need to check whether the data follow a normal distribution.

How to check the normality?

Read this article: Normality Test in R.

Briefly, it’s possible to use the Shapiro-Wilk normality test and to look at the normality plot.

  1. Shapiro-Wilk test:
    • Null hypothesis: the data are normally distributed
    • Alternative hypothesis: the data are not normally distributed
shapiro.test(my_data$weight) # => p-value = 0.6993

From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribtion. In other words, we can assume the normality.

  • Visual inspection of the data normality using Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.
library("ggpubr")
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())
One-Sample Student's T-test in R

One-Sample Student’s T-test in R

From the normality plots, we conclude that the data may come from normal distributions.

Note that, if the data are not normally distributed, it’s recommended to use the non parametric one-sample Wilcoxon rank test.

Compute one-sample t-test

We want to know, if the average weight of the mice differs from 25g (two-tailed test)?

# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res 

    One Sample t-test
data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 

In the result above :

  • t is the t-test statistic value (t = -9.078),
  • df is the degrees of freedom (df= 9),
  • p-value is the significance level of the t-test (p-value = 7.95310^{-6}).
  • conf.int is the confidence interval of the mean at 95% (conf.int = [17.8172, 20.6828]);
  • sample estimates is he mean value of the sample (mean = 19.25).



Note that:

  • if you want to test whether the mean weight of mice is less than 25g (one-tailed test), type this:
t.test(my_data$weight, mu = 25,
              alternative = "less")
  • Or, if you want to test whether the mean weight of mice is greater than 25g (one-tailed test), type this:
t.test(my_data$weight, mu = 25,
              alternative = "greater")


Interpretation of the result

The p-value of the test is 7.95310^{-6}, which is less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g with a p-value = 7.95310^{-6}.

Access to the values returned by t.test() function

The result of t.test() function is a list containing the following components:


  • statistic: the value of the t test statistics
  • parameter: the degrees of freedom for the t test statistics
  • p.value: the p-value for the test
  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).


The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 7.953383e-06
# printing the mean
res$estimate
mean of x 
    19.25 
# printing the confidence interval
res$conf.int
[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95

Online one-sample t-test calculator

You can perform one-sample t-test, online, without any installation by clicking the following link:



Infos

This analysis has been performed using R software (ver. 3.2.4).


One-Sample Wilcoxon Signed Rank Test in R

$
0
0


What’s one-sample Wilcoxon signed rank test?


The one-sample Wilcoxon signed rank test is a non-parametric alternative to one-sample t-test when the data cannot be assumed to be normally distributed. It’s used to determine whether the median of the sample is equal to a known standard value (i.e. theoretical value).


Note that, the data should be distributed symmetrically around the median. In other words, there should be roughly the same number of values above and below the median.


One Sample Wilcoxon test


Research questions and statistical hypotheses

Typical research questions are:


  1. whether the median (\(m\)) of the sample is equal to the theoretical value (\(m_0\))?
  2. whether the median (\(m\)) of the sample is less than to the theoretical value (\(m_0\))?
  3. whether the median (\(m\)) of the sample is greater than to the theoretical value(\(m_0\))?


In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

  1. \(H_0: m = m_0\)
  2. \(H_0: m \leq m_0\)
  3. \(H_0: m \geq m_0\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

  1. \(H_a: m \ne m_0\) (different)
  2. \(H_a: m > m_0\) (greater)
  3. \(H_a: m < m_0\) (less)

Note that:

  • Hypotheses 1) are called two-tailed tests
  • Hypotheses 2) and 3) are called one-tailed tests

Visualize your data and compute one-sample Wilcoxon test in R

Install ggpubr R package for data visualization

You can draw R base graphs as described at this link: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization

  • Install the latest version from GitHub as follow (recommended):
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")

R function to compute one-sample Wilcoxon test

To perform one-sample Wilcoxon-test, the R function wilcox.test() can be used as follow:

wilcox.test(x, mu = 0, alternative = "two.sided")

  • x: a numeric vector containing your data values
  • mu: the theoretical mean/median value. Default is 0 but you can change it.
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set containing the weight of 10 mice.

We want to know, if the median weight of the mice differs from 25g?

set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

Check your data

# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
  • Min.: the minimum value
  • 1st Qu.: The first quartile. 25% of values are lower than this.
  • Median: the median value. Half the values are lower; half are higher.
  • 3rd Qu.: the third quartile. 75% of values are higher than this.
  • Max.: the maximum value

Visualize your data using box plots

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())
One-Sample Wilcoxon Signed Rank Test in R

One-Sample Wilcoxon Signed Rank Test in R

Compute one-sample Wilcoxon test

We want to know, if the average weight of the mice differs from 25g (two-tailed test)?

# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)
# Printing the results
res 

    Wilcoxon signed rank test with continuity correction
data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
# print only the p-value
res$p.value
[1] 0.005793045

The p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. We can reject the null hypothesis and conclude that the average weight of the mice is significantly different from 25g with a p-value = 0.005793.


Note that:

  • if you want to test whether the median weight of mice is less than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "less")
  • Or, if you want to test whether the median weight of mice is greater than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "greater")


Infos

This analysis has been performed using R software (ver. 3.2.4).

Unpaired Two-Samples T-test in R

$
0
0


What is unpaired two-samples t-test?


The unpaired two-samples t-test is used to compare the mean of two independent groups.


For example, suppose that we have measured the weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (\(m_A\)) is significantly different from that of men (\(m_B\)).

In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.

Note that, unpaired two-samples t-test can be used only under certain conditions:

  • when the two groups of samples (A and B), being compared, are normally distributed. This can be checked using Shapiro-Wilk test.
  • and when the variances of the two groups are equal. This can be checked using F-test.



Unpaired two-samples t-test

This article describes the formula of the independent t-test and provides pratical examples in R.


Research questions and statistical hypotheses

Typical research questions are:


  1. whether the mean of group A (\(m_A\)) is equal to the mean of group B (\(m_B\))?
  2. whether the mean of group A (\(m_A\)) is less than the mean of group B (\(m_B\))?
  3. whether the mean of group A (\(m_A\)) is greather than the mean of group B (\(m_B\))?


In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

  1. \(H_0: m_A = m_B\)
  2. \(H_0: m_A \leq m_B\)
  3. \(H_0: m_A \geq m_B\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

  1. \(H_a: m_A \ne m_B\) (different)
  2. \(H_a: m_A > m_B\) (greater)
  3. \(H_a: m_A < m_B\) (less)

Note that:

  • Hypotheses 1) are called two-tailed tests
  • Hypotheses 2) and 3) are called one-tailed tests

Formula of unpaired two-samples t-test

  1. Classical t-test:

If the variance of the two groups are equivalent (homoscedasticity), the t-test value, comparing the two samples (\(A\) and \(B\)), can be calculated as follow.

\[ t = \frac{m_A - m_B}{\sqrt{ \frac{S^2}{n_A} + \frac{S^2}{n_B} }} \]

where,

  • \(m_A\) and \(m_B\) represent the mean value of the group A and B, respectively.
  • \(n_A\) and \(n_B\) represent the sizes of the group A and B, respectively.
  • \(S^2\) is an estimator of the pooled variance of the two groups. It can be calculated as follow :

\[ S^2 = \frac{\sum{(x-m_A)^2}+\sum{(x-m_B)^2}}{n_A+n_B-2} \]

with degrees of freedom (df): \(df = n_A + n_B - 2\).

2.Welch t-statistic:

If the variances of the two groups being compared are different (heteroscedasticity), it’s possible to use the Welch t test, an adaptation of Student t-test.

Welch t-statistic is calculated as follow :

\[ t = \frac{m_A - m_B}{\sqrt{ \frac{S_A^2}{n_A} + \frac{S_B^2}{n_B} }} \]

where, \(S_A\) and \(S_B\) are the standard deviation of the the two groups A and B, respectively.

Unlike the classic Student’s t-test, Welch t-test formula involves the variance of each of the two groups (\(S_A^2\) and \(S_B^2\)) being compared. In other words, it does not use the pooled variance\(S\).

The degrees of freedom of Welch t-test is estimated as follow :

\[ df = (\frac{S_A^2}{n_A}+ \frac{S_B^2}{n_B^2}) / (\frac{S_A^4}{n_A^2(n_B-1)} + \frac{S_B^4}{n_B^2(n_B-1)} ) \]

A p-value can be computed for the corresponding absolute value of t-statistic (|t|).

Note that, the Welch t-test is considered as the safer one. Usually, the results of the classical t-test and the Welch t-test are very similar unless both the group sizes and the standard deviations are very different.

How to interpret the results?

If the p-value is inferior or equal to the significance level 0.05, we can reject the null hypothesis and accept the alternative hypothesis. In other words, we can conclude that the mean values of group A and B are significantly different.

Visualize your data and compute unpaired two-samples t-test in R

Install ggpubr R package for data visualization

You can draw R base graphs as described at this link: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization

  • Install the latest version from GitHub as follow (recommended):
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")

R function to compute unpaired two-samples t-test

To perform two-samples t-test comparing the means of two independent samples (x & y), the R function t.test() can be used as follow:

t.test(x, y, alternative = "two.sided", var.equal = FALSE)

  • x,y: numeric vectors
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
  • var.equal: a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch test is used.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men):

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )

We want to know, if the average women’s weight differs from the average men’s weight?

Check your data

# Print all data
print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4

It’s possible to compute summary statistics (mean and sd) by groups. The dplyr package can be used.

  • To install dplyr package, type this:
install.packages("dplyr")
  • Compute summary statistics by groups:
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
Source: local data frame [2 x 4]
   group count     mean        sd
  (fctr) (int)    (dbl)     (dbl)
1    Man     9 68.98889  9.375426
2  Woman     9 52.10000 15.596714

Visualize your data using box plots

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
        ylab = "Weight", xlab = "Groups")
Unpaired Two-Samples Student's T-test in R

Unpaired Two-Samples Student’s T-test in R

Preleminary test to check independent t-test assumptions

Assumption 1: Are the two samples independents?

Yes, since the samples from men and women are not related.

Assumtion 2: Are the data from each of the 2 groups follow a normal distribution?

Use Shapiro-Wilk normality test as described at: Normality Test in R. - Null hypothesis: the data are normally distributed - Alternative hypothesis: the data are not normally distributed

We’ll use the functions with() and shapiro.test() to compute Shapiro-Wilk test for each group of samples.

# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))# p = 0.1
# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.6

From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality.

Note that, if the data are not normally distributed, it’s recommended to use the non parametric two-samples Wilcoxon rank test.

Assumption 3. Do the two populations have the same variances?

We’ll use F-test to test for homogeneity in variances. This can be performed with the function var.test() as follow:

res.ftest <- var.test(weight ~ group, data = my_data)
res.ftest

    F test to compare two variances
data:  weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.08150656 1.60191315
sample estimates:
ratio of variances 
         0.3613398 

The p-value of F-test is p = 0.1713596. It’s greater than the significance level alpha = 0.05. In conclusion, there is no significant difference between the variances of the two sets of data. Therefore, we can use the classic t-test witch assume equality of the two variances.

Compute unpaired two-samples t-test

Question : Is there any significant difference between women and men weights?

1) Compute independent t-test - Method 1: The data are saved in two different numeric vectors.

# Compute t-test
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res

    Two Sample t-test
data:  women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.748019  -4.029759
sample estimates:
mean of x mean of y 
 52.10000  68.98889 

2) Compute independent t-test - Method 2: The data are saved in a data frame.

# Compute t-test
res <- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res

    Two Sample t-test
data:  weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 

As you can see, the two methods give the same results.


In the result above :

  • t is the t-test statistic value (t = 2.784),
  • df is the degrees of freedom (df= 16),
  • p-value is the significance level of the t-test (p-value = 0.01327).
  • conf.int is the confidence interval of the mean at 95% (conf.int = [4.0298, 29.748]);
  • sample estimates is he mean value of the sample (mean = 68.9888889, 52.1).



Note that:

  • if you want to test whether the average men’s weight is less than the average women’s weight, type this:
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "less")
  • Or, if you want to test whether the average men’s weight is greater than the average women’s weight, type this
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "greater")


Interpretation of the result

The p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. We can conclude that men’s average weight is significantly different from women’s average weight with a p-value = 0.01327.

Access to the values returned by t.test() function

The result of t.test() function is a list containing the following components:


  • statistic: the value of the t test statistics
  • parameter: the degrees of freedom for the t test statistics
  • p.value: the p-value for the test
  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).


The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 0.0132656
# printing the mean
res$estimate
  mean in group Man mean in group Woman 
           68.98889            52.10000 
# printing the confidence interval
res$conf.int
[1]  4.029759 29.748019
attr(,"conf.level")
[1] 0.95

Online unpaired two-samples t-test calculator

You can perform unpaired two-samples t-test, online, without any installation by clicking the following link:



See also

Infos

This analysis has been performed using R software (ver. 3.2.4).

Unpaired Two-Samples Wilcoxon Test in R

$
0
0


The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.



Unpaired two-samples wilcoxon test


This article describes how to compute two samples Wilcoxon test in R.

Visualize your data and compute Wilcoxon test in R

R function to compute Wilcoxon test

To perform two-samples Wilcoxon test comparing the means of two independent samples (x & y), the R function wilcox.test() can be used as follow:

wilcox.test(x, y, alternative = "two.sided")

  • x,y: numeric vectors
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men):

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )

We want to know, if the median women’s weight differs from the median men’s weight?

Check your data

print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4

It’s possible to compute summary statistics (median and interquartile range (IQR)) by groups. The dplyr package can be used.

  • To install dplyr package, type this:
install.packages("dplyr")
  • Compute summary statistics by groups:
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
Source: local data frame [2 x 4]
   group count median   IQR
  (fctr) (int)  (dbl) (dbl)
1    Man     9   67.3  10.9
2  Woman     9   48.8  15.0

Visualize your data using box plots

You can draw R base graphs as described at this link: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization

  • Install the latest version of ggpubr from GitHub as follow (recommended):
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Visualize your data:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          ylab = "Weight", xlab = "Groups")
Unpaired Two-Samples Wilcoxon Test in R

Unpaired Two-Samples Wilcoxon Test in R

Compute unpaired two-samples Wilcoxon test

Question : Is there any significant difference between women and men weights?

1) Compute two-samples Wilcoxon test - Method 1: The data are saved in two different numeric vectors.

res <- wilcox.test(women_weight, men_weight)
res

    Wilcoxon rank sum test with continuity correction
data:  women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0

It will give a warning message, saying that “cannot compute exact p-value with tie”. It comes from the assumption of a Wilcoxon test that the responses are continuous. You can suppress this message by adding another argument exact = FALSE, but the result will be the same.

2) Compute two-samples Wilcoxon test - Method 2: The data are saved in a data frame.

res <- wilcox.test(weight ~ group, data = my_data,
                   exact = FALSE)
res

    Wilcoxon rank sum test with continuity correction
data:  weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
# Print the p-value only
res$p.value
[1] 0.02711657

As you can see, the two methods give the same results.

The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.


Note that:

  • if you want to test whether the median men’s weight is less than the median women’s weight, type this:
wilcox.test(weight ~ group, data = my_data, 
        exact = FALSE, alternative = "less")
  • Or, if you want to test whether the median men’s weight is greater than the median women’s weight, type this
wilcox.test(weight ~ group, data = my_data,
        exact = FALSE, alternative = "greater")


Online unpaired two-samples Wilcoxon test calculator

You can perform unpaired two-samples Wilcoxon test, online, without any installation by clicking the following link:



See also

Infos

This analysis has been performed using R software (ver. 3.2.4).

Paired Samples Wilcoxon Test in R

$
0
0


The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.

Differences between paired samples should be distributed symmetrically around the median.


Paired samples wilcoxon test


Visualize your data and compute paired samples Wilcoxon test in R

R function

The R function wilcox.test() can be used as follow:

wilcox.test(x, y, paired = TRUE, alternative = "two.sided")

  • x,y: numeric vectors
  • paired: a logical value specifying that we want to compute a paired Wilcoxon test
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.


Import your data into R

  1. Prepare your data as specified here: Best practices for preparing your data set for R

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use an example data set, which contains the weight of 10 mice before and after the treatment.

# Data in two numeric vectors
# ++++++++++++++++++++++++++
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )

We want to know, if there is any significant difference in the median weights before and after treatment?

Check your data

# Print all data
print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2

Compute summary statistics (median and inter-quartile range (IQR)) by groups using the dplyr package can be used.

  • Install dplyr package:
install.packages("dplyr")
  • Compute summary statistics by groups:
library("dplyr")
group_by(my_data, group) %>%
  summarise(
    count = n(),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
Source: local data frame [2 x 4]
   group count median    IQR
  (fctr) (int)  (dbl)  (dbl)
1  after    10 392.95 28.800
2 before    10 195.30 12.575

Visualize your data using box plots

  • To use R base graphs read this: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization.

  • Install the latest version of ggpubr from GitHub as follow (recommended):

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
  • Or, install from CRAN as follow:
install.packages("ggpubr")
  • Visualize your data:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")
Paired Samples Wilcoxon Test in R

Paired Samples Wilcoxon Test in R

Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedData package] to plot paired data (“before - after” plot).

  • Install pairedData package:
install.packages("PairedData")
  • Plot paired data:
# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()
Paired Samples Wilcoxon Test in R

Paired Samples Wilcoxon Test in R

Compute paired-sample Wilcoxon test

Question : Is there any significant changes in the weights of mice before after treatment?

1) Compute paired Wilcoxon test - Method 1: The data are saved in two different numeric vectors.

res <- wilcox.test(before, after, paired = TRUE)
res

    Wilcoxon signed rank test
data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

2) Compute paired Wilcoxon-test - Method 2: The data are saved in a data frame.

# Compute t-test
res <- wilcox.test(weight ~ group, data = my_data, paired = TRUE)
res

    Wilcoxon signed rank test
data:  weight by group
V = 55, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
# print only the p-value
res$p.value
[1] 0.001953125

As you can see, the two methods give the same results.

The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.


Note that:

  • if you want to test whether the median weight before treatment is less than the median weight after treatment, type this:
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
        alternative = "less")
  • Or, if you want to test whether the median weight before treatment is greater than the median weight after treatment, type this
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
       alternative = "greater")


Online paired-sample Wilcoxon test calculator

You can perform paired-sample Wilcoxon test, online, without any installation by clicking the following link:



Infos

This analysis has been performed using R software (ver. 3.2.4).

MANOVA Test in R: Multivariate Analysis of Variance

$
0
0


What is MANOVA test?


In the situation where there multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA). This article describes how to compute manova in R.


For example, we may conduct an experiment where we give two treatments (A and B) to two groups of mice, and we are interested in the weight and height of mice. In that case, the weight and height of mice are two dependent variables, and our hypothesis is that both together are affected by the difference in treatment. A multivariate analysis of variance could be used to test this hypothesis.


MANOVA Test


Assumptions of MANOVA

MANOVA can be used in certain conditions:

  • The dependent variables should be normally distribute within groups. The R function mshapiro.test( )[in the mvnormtest package] can be used to perform the Shapiro-Wilk test for multivariate normality. This is useful in the case of MANOVA, which assumes multivariate normality.

  • Homogeneity of variances across the range of predictors.

  • Linearity between all pairs of dependent variables, all pairs of covariates, and all dependent variable-covariate pairs in each cell

Interpretation of MANOVA

If the global multivariate test is significant, we conclude that the corresponding effect (treatment) is significant. In that case, the next question is to determine if the treatment affects only the weight, only the height or both. In other words, we want to identify the specific dependent variables that contributed to the significant global effect.

To answer this question, we can use one-way ANOVA (or univariate ANOVA) to examine separately each dependent variable.

Compute MANOVA in R

Import your data into R

  1. Prepare your data as specified here: [url=/wiki/best-practices-for-preparing-your-data-set-for-r]Best practices for preparing your data set for R[/url]

  2. Save your data in an external .txt tab or .csv files

  3. Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use iris data set:

# Store the data in the variable my_data
my_data <- iris

Check your data

The R code below display a random sample of our data using the function sample_n()[in dplyr package]. First, install dplyr if you don’t have it:

install.packages("dplyr")
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
94           5.0         2.3          3.3         1.0 versicolor
91           5.5         2.6          4.4         1.2 versicolor
93           5.8         2.6          4.0         1.2 versicolor
127          6.2         2.8          4.8         1.8  virginica
150          5.9         3.0          5.1         1.8  virginica
2            4.9         3.0          1.4         0.2     setosa
34           5.5         4.2          1.4         0.2     setosa
96           5.7         3.0          4.2         1.2 versicolor
74           6.1         2.8          4.7         1.2 versicolor
98           6.2         2.9          4.3         1.3 versicolor

Question: We want to know if there is any significant difference, in sepal and petal length, between the different species.

Compute MANOVA test

The function manova() can be used as follow:

sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look to see which differ
summary.aov(res.man)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, it can be seen that the two variables are highly significantly different among Species.

See also

  • Analysis of variance (ANOVA, parametric):
    • [url=/wiki/one-way-anova-test-in-r]One-Way ANOVA Test in R[/url]
    • [url=/wiki/two-way-anova-test-in-r]Two-Way ANOVA Test in R[/url]
  • [url=/wiki/kruskal-wallis-test-in-r]Kruskal-Wallis Test in R (non parametric alternative to one-way ANOVA)[/url]

Infos

This analysis has been performed using R software (ver. 3.2.4).

Viewing all 183 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>