Quantcast
Channel: Easy Guides
Viewing all 183 articles
Browse latest View live

ggfortify : Extension to ggplot2 to handle some popular packages - R software and data visualization

$
0
0


ggfortify extends ggplot2 for plotting some popular R packages in a unified way.

The following R packages and functions are covered:

Package nameFunctions
basematrix and table
clusterclara, fanny and pam
changepointcpt
dlmdlmFilter and dlmSmooth
fGarchfGARCH
forecastbats, forecast, ets and nnetar
fracdifffracdiff
glmnetglmnet
KFASKFS and signal
lfdaklfda and self
MASSisoMDS and sammon
statsacf, ar, Arima, smdscale, decomposed.ts, density, fractanal, glm, HoltWinters, kmeans, lm, prcomp, princomp, spec, stepfun, stl and ts
survivalsurvfit and survfit.cox
strucchangebreakpoints and breakpointsfull
timeSeriestimeSeries
tseriesirts
varsvarprd
xtsxts
zoozooreg

Installation

ggfortify can be installed from GitHub or CRAN:

# Github
if(!require(devtools)) install.packages("devtools")
devtools::install_github("sinhrks/ggfortify")
# CRAN
install.packages("ggfortify")

Loading ggfortify

library("ggfortify")

Plotting matrix

The function autoplot.matrix() is used:

autoplot(object, geom = "tile")
  • object: an object of class matrix
  • geom: allowed values are “tile” (for heatmap) or “point” (for scatter plot)

The *mtcars** data set is used in the example below.

df <- mtcars[, c("mpg", "disp", "hp", "drat", "wt")]
df <- as.matrix(df)

Plot a heatmap:

# Heatmap
autoplot(scale(df))

ggplot2 and ggfortify - R software and data visualization

Plot a scatter plot: The data should be a matrix with 2 columns named V1 and V2. The R code below plots mpg by wt. We start by renaming column names.

# Extract the data
df2 <- df[, c("wt", "mpg")]
colnames(df2) <- c("V1", "V2")

# Scatter plot
autoplot(df2, geom = 'point') +
  labs(x = "mpg", y = "wt")

ggplot2 and ggfortify - R software and data visualization

Plotting diagnostics for LM and GLM

The function autoplot.lm() is used to plot diagnostic plots for LM and GLM [in stats package].

autoplot(object, which = c(1:3, 5))
  • object: stats::lm instance
  • which: If a subset of the plots is required, specify a subset of the numbers 1:6.
  • ncol and nrow allows you to specify the number of subplot columns and rows.

Diagnostic plots for Linear Models (LM)

iris data set is used for computing the linear model

# Compute a linear model
m <- lm(Petal.Width ~ Petal.Length, data = iris)

# Create the plot
autoplot(m, which = 1:6, ncol = 2, label.size = 3)

ggplot2 and ggfortify - R software and data visualization

# Change the color by groups (species)
autoplot(m, which = 1:6, label.size = 3, data = iris,
         colour = 'Species')

ggplot2 and ggfortify - R software and data visualization

Diagnostic plots with Generalized Linear Models (GLM)

USArrests data set is used.

# Compute a generalized linear model
m <- glm(Murder ~ Assault + UrbanPop + Rape,
         family = gaussian, data = USArrests)

# Create the plot
# Change the theme and colour
autoplot(m, which = 1:6, ncol = 2, label.size = 3,
         colour = "steelblue") + theme_bw()

ggplot2 and ggfortify - R software and data visualization

Plotting time series

Plotting ts objects

  • Data set: AirPassengers
  • R Function: autoplot.ts()
autoplot(AirPassengers)

ggplot2 and ggfortify - R software and data visualization

The function autoplot() can handle also other time-series-likes packages, including:

  • zoo::zooreg()
  • xts::xts()
  • timeSeries::timSeries()
  • tseries::irts()
  • forecast::forecast()
  • vars:vars()

Plotting with changepoint package

The changepoint package provides a simple approach for identifying shifts in mean and/or variance in a time series.

ggfortify supports cpt object in changepoint package.

library(changepoint)
autoplot(cpt.meanvar(AirPassengers))

ggplot2 and ggfortify - R software and data visualization

Plotting with strucchange package

strucchange is an R package for detecting jumps in data.

Data set: Nile

library(strucchange)
autoplot(breakpoints(Nile ~ 1))

ggplot2 and ggfortify - R software and data visualization

Plotting PCA (Principal Component Analysis)

  • Data set: iris
  • Function: autoplot.prcomp()
# Prepare the data
df <- iris[, -5]

# Principal component analysis
pca <- prcomp(df, scale. = TRUE)

# Plot
autoplot(pca, loadings = TRUE, loadings.label = TRUE,
         data = iris, colour = 'Species')

ggplot2 and ggfortify - R software and data visualization

Plotting K-means

  • Data set: USArrests
  • Function: autoplot.kmeans()

The original data is required as kmeans object doesn’t store original data. Samples will be colored by groups (clusters).

autoplot(kmeans(USArrests, 3), data = USArrests,
         label = TRUE, label.size = 3, frame = TRUE)

ggplot2 and ggfortify - R software and data visualization

Plotting cluster package

ggfortify supports cluster::clara, cluster::fanny and cluster::pam classes. These functions return object containing original data, so there is no need to pass original data explicitly.

The R code below shows an example for pam() function:

library(cluster)
autoplot(pam(iris[-5], 3), frame = TRUE, frame.type = 'norm')

ggplot2 and ggfortify - R software and data visualization

Plotting Local Fisher Discriminant Analysis

library(lfda)
# Local Fisher Discriminant Analysis (LFDA)
model <- lfda(iris[,-5], iris[, 5], 4, metric="plain")
autoplot(model, data = iris, frame = TRUE, frame.colour = 'Species')

Plotting survival curves

library(survival)
fit <- survfit(Surv(time, status) ~ sex, data = lung)
autoplot(fit)

ggplot2 and ggfortify - R software and data visualization

Learn more

ggfortify

Infos

This analysis has been performed using R software (ver. 3.2.1) and ggplot2 (ver. 1.0.1)


ggsave : Save a ggplot - R software and data visualization

$
0
0


ggsave: save the last ggplot

ggsave is a convenient function for saving the last plot that you displayed. It also guesses the type of graphics device from the extension. This means the only argument you need to supply is the filename.

It’s also possible to make a ggplot and to save it from the screen using the function ggsave():

# 1. Create a plot
# The plot is displayed on the screen
ggplot(mtcars, aes(wt, mpg)) + geom_point()

# 2. Save the plot to a pdf
ggsave("myplot.pdf")

For saving to a png file, use:

ggsave("myplot.png")

Infos

This analysis has been performed using R software (ver. 3.2.1) and ggplot2 (ver. )

Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization

$
0
0

Basics

ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics. The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”.

According to ggplot2 concept, a plot can be divided into different fundamental parts : Plot = data + Aesthetics + Geometry.

The principal components of every plot can be defined as follow:

  • data is a data frame
  • Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
  • Geometry corresponds to the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)

Two main functions, for creating plots, are available in ggplot2 package : a qplot() and ggplot() functions.

  • qplot() is a quick plot function which is easy to use for simple plots.
  • The ggplot() function is more flexible and robust than qplot for building a plot piece by piece.

The generated plot can be kept as a variable and then printed at any time using the function print().

After creating plots, two other important functions are:

  • last_plot(), which returns the last plot to be modified
  • ggsave(“plot.png”, width = 5, height = 5), which saves the last plot in the current working directory.

This document describes how to create and customize different types of graphs using ggplot2. Many examples of code and graphics are provided.

Note that, the content provided here is available as a book: ggplot2: The Elements for Elegant Data Visualization in R

ggplot2 book

Types of graphs for data visualization

The type of plots, to be created, depends on the format of your data. The ggplot2 package provides methods for visualizing the following data structures:

  1. One variable - x: continuous or discrete
  2. Two variables - x & y: continuous and/or discrete
  3. Continuous bivariate distribution - x & y (both continuous)
  4. Continuous function
  5. Error bar
  6. Maps
  7. Three variables

In the current document we’ll provide the essential ggplot2 functions for drawing each of these seven data formats.

How this document is organized?

Install and load ggplot2 package

Use the R code below:

# Installation
install.packages('ggplot2')
# Loading
library(ggplot2)

Data format and preparation

The data should be a data.frame (columns are variables and rows are observations).

The data set mtcars is used in the examples below:

# Load the data
data(mtcars)
df <- mtcars[, c("mpg", "cyl", "wt")]

# Convert cyl to a factor variable
df$cyl <- as.factor(df$cyl)
head(df)
##                    mpg cyl    wt
## Mazda RX4         21.0   6 2.620
## Mazda RX4 Wag     21.0   6 2.875
## Datsun 710        22.8   4 2.320
## Hornet 4 Drive    21.4   6 3.215
## Hornet Sportabout 18.7   8 3.440
## Valiant           18.1   6 3.460


mtcars : Motor Trend Car Road Tests.

Description: The data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973 - 74 models).

Format: A data frame with 32 observations on 3 variables.

  • [, 1] mpg Miles/(US) gallon
  • [, 2] cyl Number of cylinders
  • [, 3] wt Weight (lb/1000)


qplot(): Quick plot with ggplot2

The qplot() function is very similar to the standard Rplot() function. It can be used to create quickly and easily different types of graphs: scatter plots, bar plots, box plots, violin plots, histogram and density plots.

A simplified format of qplot() is :

qplot(x, y = NULL, data, geom="auto")

  • x, y : x and y values, respectively. The argument y is optional depending on the type of graphs to be created.
  • data : data frame to use (optional).
  • geom : Character vector specifying geom to use. Defaults to “point” if x and y are specified, and “histogram” if only x is specified.


Read more about qplot(): Quick plot with ggplot2.

Scatter plots

The R code below will create basic scatter plots using the argument geom = “point”. To add a regression line on the scatter plot, the argument method = “lm” is used in combination with geom = c(“point”, “smooth”). Note that, lm stands for linear model.

# Basic scatter plot
qplot(x = mpg, y = wt, data = df, geom = "point")

# Change main title and axis labels
qplot(x = mpg, y = wt, data = df, geom = "point",
      xlab = "Weight (lb/1000)", ylab = "Miles/(US) gallon", 
      main = "Plot of wt by mpg ")

# Combine point and line
qplot(x = mpg, y = wt, data = df,
      geom = c("line", "point"))

# Scatter plot with regression line
qplot(mpg, wt, data = df, 
      geom = c("point", "smooth"), method="lm")

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

The following R code will change the color and the shape of points by groups. The column cyl will be used as grouping variable. In other words, the color and the shape of points will be changed by the levels of cyl.

qplot(mpg, wt, data = df, colour = cyl, shape = cyl)

ggplot2 - R software and data visualization

Bar plot

It’s possible to draw a bar plot using the arguments geom = “bar” and stat = “identity”. We’ll create a bar plot of the mpg variable. We start by creating a new variable named index, which holds the position of each mpg value.

# y represents values in the data
index <- 1:nrow(mtcars)
qplot(index, mpg, data = df, geom = "bar", stat = "identity")

# Change fill color by groups (cyl)
# Order the data by cyl and then by mpg values
df <- df[order(df$cyl, df$mpg), ]
qplot(index, mpg, data = df, geom = "bar", stat = "identity",
      fill = cyl)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Box plot, violin plot and dot plot

The R code below generates some data containing the weights by sex (M for male; F for female):

set.seed(1234)
wdata = data.frame(
        sex = factor(rep(c("F", "M"), each=200)),
        weight = c(rnorm(200, 55), rnorm(200, 58)))
head(wdata)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912
## 6   F 55.50606
# Basic box plot from data frame
qplot(sex, weight, data = wdata, 
      geom= "boxplot", fill = sex)

# Violin plot
qplot(sex, weight, data = wdata, geom = "violin")

# Dot plot
qplot(sex, weight, data = wdata, geom = "dotplot",
      stackdir = "center", binaxis = "y", dotsize = 0.5)

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Histogram and density plots

The histogram and density plots are used to display the distribution of data.

# Histogram  plot
# Change histogram fill color by group (sex)
qplot(weight, data = wdata, geom = "histogram",
    fill = sex, position = "dodge")

# Density plot
# Change density plot line color by group (sex)
# change line type
qplot(weight, data = wdata, geom = "density",
    color = sex, linetype = sex)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

ggplot(): build plots piece by piece

As mentioned above, there are two main functions in ggplot2 package for generating graphics:

  • The quick and easy-to-use function: qplot()
  • The more powerful and flexible function to build plots piece by piece: ggplot()

This section describes briefly how to use the function ggplot(). Recall that, the concept of ggplot divides a plot into three different fundamental parts: plot = data + Aesthetics + geometry.

  • data: a data frame.
  • Aesthetics: used to specify x and y variables, color, size, shape, ….
  • Geometry: the type of plots (histogram, boxplot, line, density, dotplot, bar, …)

To demonstrate how the function ggplot() works, we’ll draw a scatter plot. The function aes() is used to specify aesthetics. An alternative option is the function aes_string() which generates mappings from a string. aes_string() is particularly useful when writing functions that create plots because you can use strings to define the aesthetic mappings, rather than having to use substitute to generate a call to aes()

# Basic scatter plot
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point()

# Change the point size, and shape
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 2, shape = 23)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

The function aes_string() can be used as follow:

ggplot(mtcars, aes_string(x = "wt", y = "mpg")) +
  geom_point(size = 2, shape = 23)

Note that, some plots visualize a transformation of the original data set. In this case, an alternative way to build a layer is to use stat_*() functions.

In the following example, the function geom_density() does the same as the function stat_density():

# Use geometry function
ggplot(wdata, aes(x = weight)) + geom_density()

# OR use stat function
ggplot(wdata, aes(x = weight)) + stat_density()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

For each plot type, we’ll provide the geom_*() function and the corresponding stat_*() function (if available).

One variable: Continuous

We’ll use weight data (wdata), generated in the previous sections.

head(wdata)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912
## 6   F 55.50606

The following R code computes the mean value by sex:

library(plyr)
mu <- ddply(wdata, "sex", summarise, grp.mean=mean(weight))
head(mu)
##   sex grp.mean
## 1   F 54.94224
## 2   M 58.07325

We start by creating a plot, named a, that we’ll finish in the next section by adding a layer.

a <- ggplot(wdata, aes(x = weight))

Possible layers are:

  • For one continuous variable:
    • geom_area() for area plot
    • geom_density() for density plot
    • geom_dotplot() for dot plot
    • geom_freqpoly() for frequency polygon
    • geom_histogram() for histogram plot
    • stat_ecdf() for empirical cumulative density function
    • stat_qq() for quantile - quantile plot
  • For one discrete variable:
    • geom_bar() for bar plot


  1. One variable: Continuous

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. One variable: Discrete

ggplot2 - R software and data visualization

geom_area(): Create an area plot

# Basic plot
a + geom_area(stat = "bin")

# change fill colors by sex
a + geom_area(aes(fill = sex), stat ="bin", alpha=0.6) +
  theme_classic()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Note that, by default y axis corresponds to the count of weight values. If you want to change the plot in order to have the density on y axis, the R code would be as follow.

a + geom_area(aes(y = ..density..), stat ="bin")

To customize the plot, the following arguments can be used: alpha, color, fill, linetype, size. Learn more here: ggplot2 area plot.

ggplot2 - R software and data visualization

  • Key function: geom_area()
  • Alternative function: stat_bin()
a + stat_bin(geom = "area")

geom_density(): Create a smooth density estimate

We’ll use the following functions:

  • geom_density() to create a density plot
  • geom_vline() to add a vertical lines corresponding to group mean values
  • scale_color_manual() to change the color manually by groups
# Basic plot
a + geom_density()

# change line colors by sex
a + geom_density(aes(color = sex)) 

# Change fill color by sex
# Use semi-transparent fill: alpha = 0.4
a + geom_density(aes(fill = sex), alpha=0.4)
   
# Add mean line and Change color manually
a + geom_density(aes(color = sex)) +
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed") +
  scale_color_manual(values=c("#999999", "#E69F00"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype, size. Learn more here: ggplot2 density plot.

ggplot2 - R software and data visualization

  • Key function: geom_density()
  • Alternative function: stat_density()
a + stat_density()

geom_dotplot(): Dot plot

In a dot plot, dots are stacked with each dot representing one observation.

# Basic plot
a + geom_dotplot()

# change fill and color by sex
a + geom_dotplot(aes(fill = sex)) 

# Change fill color manually 
a + geom_dotplot(aes(fill = sex)) +
  scale_fill_manual(values=c("#999999", "#E69F00"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill and dotsize. Learn more here: ggplot2 dot plot.

ggplot2 - R software and data visualization

  • Key functions: geom_dotplot()
  • Alternative function: stat_bindot()
a + stat_bindot()

geom_freqpoly(): Frequency polygon

# Basic plot
a + geom_freqpoly() 

# change y axis to density value
# and change theme
a + geom_freqpoly(aes(y = ..density..)) +
  theme_minimal()

# change color and linetype by sex
a + geom_freqpoly(aes(color = sex, linetype = sex)) +
  theme_minimal()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size.

  • Key function: geom_freqpoly()
  • Alternative function: stat_bin()
a + stat_bin(geom = "freqpoly")

geom_histogram(): Histogram

# Basic plot
a + geom_histogram()

# change line colors by sex
a + geom_histogram(aes(color = sex), fill = "white",
                   position = "dodge") 

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

If you want to change the plot in order to have the density on y axis, the R code would be as follow.

a + geom_histogram(aes(y = ..density..))

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 histogram plot.

ggplot2 - R software and data visualization

  • Key functions: geom_histogram()
  • Position adjustments: “identity” (or position_identity()), “stack” (or position_stack()), “dodge” ( or position_dodge()). Default value is “stack”
  • Alternative function: stat_bin()
a + stat_bin(geom = "histogram")

stat_ecdf(): Empirical Cumulative Density Function

a + stat_ecdf()

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 ECDF.

  • Key function: stat_ecdf()

stat_qq(): quantile - quantile plot

ggplot(mtcars, aes(sample=mpg)) + stat_qq()

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, shape and size. Learn more here: ggplot2 quantile - quantile plot.

  • Key function: stat_qq()

One variable: Discrete

The function geom_bar() can be used to visualize one discrete variable. In this case, the count of each level is plotted. We’ll use the mpg data set [in ggplot2 package]. The R code is as follow:

data(mpg)
b <- ggplot(mpg, aes(fl))

# Basic plot
b + geom_bar()

# Change fill color
b + geom_bar(fill = "steelblue", color ="steelblue") +
  theme_minimal()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 bar plot.

ggplot2 - R software and data visualization

  • Key function: geom_bar()
  • Alternative function: stat_bin()
b + stat_bin()

Two variables: Continuous X, Continuous Y

We’ll use the mtcars data set. The variable cyl is used as grouping variable.

data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
head(mtcars[, c("wt", "mpg", "cyl")])
##                      wt  mpg cyl
## Mazda RX4         2.620 21.0   6
## Mazda RX4 Wag     2.875 21.0   6
## Datsun 710        2.320 22.8   4
## Hornet 4 Drive    3.215 21.4   6
## Hornet Sportabout 3.440 18.7   8
## Valiant           3.460 18.1   6

We start by creating a plot, named b, that we’ll finish in the next section by adding a layer.

b <- ggplot(mtcars, aes(x = wt, y = mpg))

Possible layers include:

  • geom_point() for scatter plot
  • geom_smooth() for adding smoothed line such as regression line
  • geom_quantile() for adding quantile lines
  • geom_rug() for adding a marginal rug
  • geom_jitter() for avoiding overplotting
  • geom_text() for adding textual annotations


ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

geom_point(): Scatter plot

# Basic plot
b + geom_point()
# change the color and the point 
# by the levels of cyl variable
b + geom_point(aes(color = cyl, shape = cyl)) 

# Change color manually
b + geom_point(aes(color = cyl, shape = cyl)) +
  scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size. Learn more here: ggplot2 scatter plot.

ggplot2 - R software and data visualization

  • key function: geom_point()

geom_smooth(): Add regression line or smoothed conditional mean

To add a regression line on a scatter plot, the function geom_smooth() is used in combination with the argument method = lm. lm stands for linear model.

# Regression line only
b + geom_smooth(method = lm)
# Point + regression line
# Remove the confidence interval 
b + geom_point() + 
  geom_smooth(method = lm, se = FALSE)

# loess method: local regression fitting
b + geom_point() + geom_smooth()

# Change color and shape by groups (cyl)
b + geom_point(aes(color=cyl, shape=cyl)) + 
  geom_smooth(aes(color=cyl, shape=cyl), 
              method=lm, se=FALSE, fullrange=TRUE)

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, shape , linetype and size. Learn more here: ggplot2 scatter plot

  • key function: geom_smooth()
  • Alternative function: stat_smooth()
b + stat_smooth(method = "lm")

geom_quantile(): Add quantile lines from a quantile regression

Quantile lines can be used as a continuous analogue of a geom_boxplot().

We’ll use the movies data set [in ggplot2]:

# Subset a sample
set.seed(1234)
msamp <- movies[sample(nrow(movies), 1000), c("year", "rating") ]
head(msamp)
##       year rating
## 6685  1996    5.0
## 36584 1989    3.4
## 35817 1994    8.1
## 36646 1989    3.5
## 50609 1987    4.3
## 37640 2001    5.4

The function geom_quantile() can be used for adding quantile lines:

ggplot(msamp, aes(year, rating)) +
  geom_point() + geom_quantile() +
  theme_minimal()

ggplot2 - R software and data visualization

An alternative to geom_quantile() is the function stat_quantile():

ggplot(msamp, aes(year, rating)) +
  geom_point() + stat_quantile(quantiles = c(0.25, 0.5, 0.75))

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: Continuous quantiles

  • Key function: geom_quantile()
  • Alternative function: stat_quantile()

geom_rug(): Add marginal rug to scatter plots

We’ll use faithful data set.

# Add marginal rugs using faithful data
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() + geom_rug()

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 scatter plot

  • key function: geom_rug()

geom_jitter(): Jitter points to reduce overplotting

The function geom_jitter() is a convenient default for geom_point(position = ‘jitter’). The mpg data set [in ggplot2] is used in the following examples.

p <- ggplot(mpg, aes(displ, hwy))
# Default scatter plot
p + geom_point()

# Use jitter to reduce overplotting
p + geom_jitter(
    position = position_jitter(width = 0.5, height = 0.5))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To adjust the extent of jittering, the function position_jitter() with the arguments width and height are used:

  • width: degree of jitter in x direction.
  • height: degree of jitter in y direction.

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size. Learn more here: ggplot2 jitter

  • Key functions: geom_jitter(), position_jitter()

geom_text(): Textual annotations

The argument label is used to specify a vector of labels for point annotations.

b + geom_text(aes(label = rownames(mtcars)))

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: label, alpha, angle, color, family, fontface, hjust, lineheight, size, and vjust. Learn more here: ggplot2 add textual annotations

ggplot2 - R software and data visualization

  • key function: geom_text(), annotation_custom()

Two variables: Continuous bivariate distribution

We start by using the diamonds data set [in ggplot2].

data(diamonds)
head(diamonds[, c("carat", "price")])
##   carat price
## 1  0.23   326
## 2  0.21   326
## 3  0.23   327
## 4  0.29   334
## 5  0.31   335
## 6  0.24   336

We start by creating a plot, named c, that we’ll finish in the next section by adding a layer.

c <- ggplot(diamonds, aes(carat, price))

Possible layers include:

  • geom_bin2d() for adding a heatmap of 2d bin counts. Rectangular bining.
  • geom_hex() for adding hexagon bining. The R package hexbin is required for this functionality
  • geom_density2d() for adding contours from a 2d density estimate


ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

geom_bin2d(): Add heatmap of 2d bin counts

The function geom_bin2d() produces a scatter plot with rectangular bins. The number of observations is counted in each bins and displayed as a heatmap.

# Default plot 
c + geom_bin2d()

# Change the number of bins
c + geom_bin2d(bins = 15)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: xmax, xmin, ymax, ymin, alpha, color, fill, linetype and size. Learn more here: ggplot2 Scatter plots with rectangular bins

  • Key functions: geom_bin2d()
  • Alternative functions: stat_bin2d(), stat_summary2d()
c + stat_bin2d()

c + stat_summary2d(aes(z = depth))

geom_hex(): Add hexagon bining

The function geom_hex() produces a scatter plot with hexagon bining. The hexbin R package is required for hexagon bining. If you don’t have it, use the R code below to install it:

install.packages("hexbin")

The function geom_hex() can be used as follow:

require(hexbin)
# Default plot 
c + geom_hex()

# Change the number of bins
c + geom_hex(bins = 10)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill and size. Learn more here: ggplot2 Scatter plots with rectangular bins

  • Key function: geom_hex()
  • Alternative functions: stat_binhex(), stat_summary_hex()
c + stat_binhex()

c + stat_summary_hex(aes(z = depth))

geom_density2d(): Add contours from a 2d density estimate

The functions geom_density2d() or stat_density2d() can be used to add 2d density estimate to a scatter plot.

faithful data set is used in this section, and we first start by creating a scatter plot (**sp*) as follow:

# Scatter plot 
sp <- ggplot(faithful, aes(x=eruptions, y=waiting)) 
# Default plot
sp + geom_density2d()

# Add points
sp + geom_point() + geom_density2d()

# Use stat_density2d with geom = "polygon"
sp + geom_point() + 
  stat_density2d(aes(fill = ..level..), geom="polygon")

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 Scatter plots with the 2d density estimation

  • Key function: geom_density2d()
  • Alternative functions: stat_density2d()
sp + stat_density2d()
  • See also: stat_contour(), geom_contour()

Two variables: Continuous function

In this section, we’ll see how to connect observations by line. The economics data set [in ggplot2] is used.

data(economics)
head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-06-30 507.8 198712     9.8     4.5     2944
## 2 1967-07-31 510.9 198911     9.8     4.7     2945
## 3 1967-08-31 516.7 199113     9.0     4.6     2958
## 4 1967-09-30 513.3 199311     9.8     4.9     3143
## 5 1967-10-31 518.5 199498     9.7     4.7     3066
## 6 1967-11-30 526.2 199657     9.4     4.8     3018

We start by creating a plot, named d, that we’ll finish in the next section by adding a layer.

d <- ggplot(economics, aes(x = date, y = unemploy))

Possible layers include:

  • geom_area() for area plot
  • geom_line() for line plot connecting observations, ordered by x
  • geom_step() for connecting observations by stairs


# Area plot
d + geom_area()

# Line plot: connecting observations, ordered by x
d + geom_line()

# Connecting observations by stairs
# a subset of economics data set is used
set.seed(1234)
ss <- economics[sample(1:nrow(economics), 20), ]
ggplot(ss, aes(x = date, y = unemploy)) + 
  geom_step()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, size and fill (for geom_area only). Learn more here: ggplot2 line plot.

ggplot2 - R software and data visualization

  • Key functions: geom_area(), geom_line(), geom_step()

Two variables: Discrete X, Continuous Y

The ToothGrowth data set we’ll be used to plot the continuous variable len (for tooth length) by the discrete variable dose. The following R code converts the variable dose from a numeric to a discrete factor variable.

data("ToothGrowth")
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

We start by creating a plot, named e, that we’ll finish in the next section by adding a layer.

e <- ggplot(ToothGrowth, aes(x = dose, y = len))

Possible layers include:

  • geom_boxplot() for box plot
  • geom_violin() for violin plot
  • geom_dotplot() for dot plot
  • geom_jitter() for stripchart
  • geom_line() for line plot
  • geom_bar() for bar plot


ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

geom_boxplot(): Box and whiskers plot

# Default plot
e + geom_boxplot()

# Notched box plot
e + geom_boxplot(notch = TRUE)

# Color by group (dose)
e + geom_boxplot(aes(color = dose))

# Change fill color by group (dose)
e + geom_boxplot(aes(fill = dose))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

# Box plot with multiple groups
ggplot(ToothGrowth, aes(x=dose, y=len, fill=supp)) +
  geom_boxplot()

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, shape, size and fill. Learn more here: ggplot2 box plot.

ggplot2 - R software and data visualization

  • Key function: geom_boxplot()
  • Alternative functions: stat_boxplot()
e + stat_boxplot(coeff = 1.5)

geom_violin(): Violin plot

Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

# Default plot
e + geom_violin(trim = FALSE)

# violin plot with mean points (+/- SD)
e + geom_violin(trim = FALSE) + 
  stat_summary(fun.data="mean_sdl",  mult = 1, 
               geom="pointrange", color = "red")

# Combine with box plot
e + geom_violin(trim = FALSE) + 
  geom_boxplot(width = 0.2)

# Color by group (dose) 
e + geom_violin(aes(color = dose), trim = FALSE)

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, size and fill. Learn more here: ggplot2 violin plot.

ggplot2 - R software and data visualization

  • Key functions: geom_violin()
  • Alternative functions: stat_ydensity()
e + stat_ydensity(trim = FALSE)

geom_dotplot(): Dot plot

# Default plot
e + geom_dotplot(binaxis = "y", stackdir = "center")

# Dot plot with mean points (+/- SD)
e + geom_dotplot(binaxis = "y", stackdir = "center") + 
  stat_summary(fun.data="mean_sdl",  mult = 1, 
               geom="pointrange", color = "red")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

# Combine with box plot
e + geom_boxplot() + 
  geom_dotplot(binaxis = "y", stackdir = "center") 


# Add violin plot
e + geom_violin(trim = FALSE) +
  geom_dotplot(binaxis='y', stackdir='center')

# Color and fill by group (dose) 
e + geom_dotplot(aes(color = dose, fill = dose), 
                 binaxis = "y", stackdir = "center")

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, dotsize and fill. Learn more here: ggplot2 dot plot.

ggplot2 - R software and data visualization

  • Key functions: geom_dotplot(), stat_summary()

geom_jitter(): Strip charts

Stripcharts are also known as one dimensional scatter plots. These plots are suitable compared to box plots when sample sizes are small.

# Default plot
e + geom_jitter(position=position_jitter(0.2))

# Strip charts with mean points (+/- SD)
e + geom_jitter(position=position_jitter(0.2)) + 
  stat_summary(fun.data="mean_sdl",  mult = 1, 
               geom="pointrange", color = "red")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

# Combine with box plot
e + geom_jitter(position=position_jitter(0.2)) + 
  geom_dotplot(binaxis = "y", stackdir = "center") 


# Add violin plot
e + geom_violin(trim = FALSE) +
  geom_jitter(position=position_jitter(0.2))
  

# Change color and shape by group (dose) 
e +  geom_jitter(aes(color = dose, shape = dose),
                 position=position_jitter(0.2))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, shape, size and fill. Learn more here: ggplot2 strip charts.

ggplot2 - R software and data visualization

  • Key functions: geom_jitter(), stat_summary()

geom_line(): Line plot

Data derived from ToothGrowth data sets are used.

df <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))

head(df)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

In the graphs below, line types and point shapes are controlled automatically by the levels of the variable supp :

# Change line types by groups (supp)
ggplot(df, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()

# Change line types, point shapes and colors
ggplot(df, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp, color = supp))+
  geom_point(aes(shape=supp, color = supp))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 line plot.

ggplot2 - R software and data visualization

  • Key functions: geom_line(), geom_step()

geom_bar(): Bar plot

Data derived from ToothGrowth data sets are used.

df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))

head(df)
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))

head(df2)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

We start by creating a simple bar plot (named f) using the df data set:

f <- ggplot(df, aes(x = dose, y = len))
# Basic bar plot
f + geom_bar(stat = "identity")

# Change fill color and add labels
f + geom_bar(stat="identity", fill="steelblue")+
  geom_text(aes(label=len), vjust=-0.3, size=3.5)+
  theme_minimal()

# Change bar plot line colors by groups
f + geom_bar(aes(color = dose),
             stat="identity", fill="white")

# Change bar plot fill colors by groups
f + geom_bar(aes(fill = dose), stat="identity")

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Bar plot with multiple groups:

g <- ggplot(data=df2, aes(x=dose, y=len, fill=supp)) 

# Stacked bar plot
g + geom_bar(stat = "identity")

# Use position=position_dodge()
g + geom_bar(stat="identity", position=position_dodge())

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 bar plot.

ggplot2 - R software and data visualization

  • Key function: geom_bar()
  • Alternative function: stat_identity()
g + stat_identity(geom = "bar")

g + stat_identity(geom = "bar", position = "dodge")

Two variables: Discrete X, Discrete Y

The diamonds data set [in ggplot2] we’ll be used to plot the discrete variable color (for diamond colors) by the discrete variable cut (for diamond cut types). The plot is created using the function geom_jitter().

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, shape and size.

  • Key function: geom_jitter()

Two variables: Visualizing error

The ToothGrowth data set we’ll be used. We start by creating a data set named df which holds ToothGrowth data.

# ToothGrowth data set
df <- ToothGrowth
df$dose <- as.factor(df$dose)
head(df)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

The helper function below (data_summary()) will be used to calculate the mean and the standard deviation (used as error), for the variable of interest, in each group. The plyr package is required.

# Calculate the mean and the SD in each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of the variable to be summariezed
# grps : column names to be used as grouping variables
data_summary <- function(data, varname, grps){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, grps, .fun=summary_func, varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}

Using the function data_summary(), the following R code creates a data set named df2 which holds the mean and the SD of tooth length (len) by groups (dose).

df2 <- data_summary(df, varname="len", grps= "dose")
# Convert dose to a factor variable
df2$dose=as.factor(df2$dose)
head(df2)
##   dose    len       sd
## 1  0.5 10.605 4.499763
## 2    1 19.735 4.415436
## 3    2 26.100 3.774150

We start by creating a plot, named f, that we’ll finish in the next section by adding a layer.

f <- ggplot(df2, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))

Possible layers include:

  • geom_crossbar() for hollow bar with middle indicated by horizontal line
  • geom_errorbar() for error bars
  • geom_errorbarh() for horizontal error bars
  • geom_linerange() for drawing an interval represented by a vertical line
  • geom_pointrange() for creating an interval represented by a vertical line, with a point in the middle.


ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

geom_crossbar(): Hollow bar with middle indicated by horizontal line

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose).

# Default plot
f + geom_crossbar()

# color by groups
f + geom_crossbar(aes(color = dose))

# Change color manually
f + geom_crossbar(aes(color = dose)) + 
  scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

# fill by groups and change color manually
f + geom_crossbar(aes(fill = dose)) + 
  scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Cross bar with multiple groups: Using the function data_summary(), we start by creating a data set named df3 which holds the mean and the SD of tooth length (len) by 2 groups (supp and dose).

df3 <- data_summary(df, varname="len", grps= c("supp", "dose"))
head(df3)
##   supp dose   len       sd
## 1   OJ  0.5 13.23 4.459709
## 2   OJ    1 22.70 3.910953
## 3   OJ    2 26.06 2.655058
## 4   VC  0.5  7.98 2.746634
## 5   VC    1 16.77 2.515309
## 6   VC    2 26.14 4.797731

The data set df3 is used to create cross bars with multiple groups. For this end, the variable len is plotted by dose and the color is changed by the levels of the factor supp.

f <- ggplot(df3, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Default plot
f + geom_crossbar(aes(color = supp))

# Use position_dodge() to avoid overlap
f + geom_crossbar(aes(color = supp), 
                  position = position_dodge(1))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

A simple alternative to geom_crossbar() is to use the function stat_summary() as follow. In this case, the mean and the SD can be computed automatically.

f <- ggplot(df, aes(x = dose, y = len, color = supp)) 
# Use geom_crossbar()
f + stat_summary(fun.data="mean_sdl", mult = 1, 
                 geom="crossbar", width = 0.6, 
                 position = position_dodge(0.8))

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 error bars.

  • Key functions: geom_crossbar(), stat_summary()

geom_errorbar(): Error bars

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose).

We start by creating a plot, named f, that we’ll finish next by adding a layer.

f <- ggplot(df2, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Error bars colored by groups
f + geom_errorbar(aes(color = dose), width = 0.2)

# Combine with line plot
f + geom_line(aes(group = 1)) + 
  geom_errorbar(width = 0.2)

# Combine with bar plot, color by groups
f + geom_bar(aes(color = dose), stat = "identity", fill ="white") + 
  geom_errorbar(aes(color = dose), width = 0.2)

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Error bars with multiple groups:

The data set df3 is used to create cross bars with multiple groups. For this end, the variable len is plotted by dose and the color is changed by the levels of the factor supp.

f <- ggplot(df3, aes(x = dose, y = len, 
                     ymin = len-sd, ymax = len+sd))
# Default plot
f + geom_bar(aes(fill = supp), stat = "identity",
             position = "dodge") + 
  geom_errorbar(aes(color = supp),  position = "dodge")

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, size and width.

Learn more here:

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  • Key functions: geom_errorbar(), stat_summary()

geom_errorbarh(): Horizontal error bars

We’ll use the data set named df2, which holds the mean and the SD of tooth length (len) by groups (dose):

df2 <- data_summary(ToothGrowth, varname="len", grps = "dose")
head(df2)
##   dose    len       sd
## 1  0.5 10.605 4.499763
## 2    1 19.735 4.415436
## 3    2 26.100 3.774150

We start by creating a plot, named f, that we’ll finish next by adding a layer.

f <- ggplot(df2, aes(x = len, y = dose ,
                     xmin=len-sd, xmax=len+sd))

The arguments xmin and xmax are used for horizontal error bars:

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, size and height.

  • Key functions: geom_errorbarh()

geom_linerange() and geom_pointrange(): An interval represented by a vertical line

  • geom_linerange(): Add an interval represented by a vertical line
  • geom_pointrange(): Add an interval represented by a vertical line with a point in the middle

We’ll use the data set df2.

f <- ggplot(df2, aes(x = dose, y = len,
                     ymin=len-sd, ymax=len+sd))
# Line range
f + geom_linerange()

# Point range
f + geom_pointrange()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype, size, shape and fill (for geom_pointrange()).

Combine geom_dotplot and error bars

It’s also possible to combine geom_dotplot() and error bars. We’ll use the ToothGrowth data set. You don’t need to compute the mean and SD. This can be done automatically by using the function stat_summary() in combination with the argument fun.data = “mean_sdl”.

We start by creating a dot plot, named g, that we’ll finish in the next section by adding error bar layers.

g <- ggplot(df, aes(x=dose, y=len)) + 
  geom_dotplot(binaxis='y', stackdir='center')
# use geom_crossbar()
g + stat_summary(fun.data="mean_sdl", mult=1, 
                 geom="crossbar", width=0.5)

# Use geom_errorbar()
g + stat_summary(fun.data=mean_sdl, mult=1, 
        geom="errorbar", color="red", width=0.2) +
  stat_summary(fun.y=mean, geom="point", color="red")
# Use geom_pointrange()
g + stat_summary(fun.data=mean_sdl, mult=1, 
                 geom="pointrange", color="red")

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 error bars.

  • Key functions: geom_errorbarh(), geom_errorbar(), geom_linerange(), geom_pointrange(), geom_crossbar(), stat_summary()

Two variables: Maps

The function geom_map() can be used to create a map with ggplot2. The R package map is required. It contains geographical information useful for drawing easily maps in ggplot2.

Install map package (if you don’t have it):

install.packages("map")

In the following R code, we’ll create USA map and USArrests crime data to shade each region.

# Prepare the data
crimes <- data.frame(state = tolower(rownames(USArrests)), 
                     USArrests)
library(reshape2) # for melt
crimesm <- melt(crimes, id = 1)

# Get map data
require(maps) 
map_data <- map_data("state")

# Plot the map with Murder data
ggplot(crimes, aes(map_id = state)) + 
  geom_map(aes(fill = Murder), map = map_data) + 
  expand_limits(x = map_data$long, y = map_data$lat)

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 map.

Key function:geom_map()

Three variables

The mtcars data set we’ll be used. We first compute a correlation matrix, which will be visualized using specific ggplot2 functions.

Prepare the data:

df <- mtcars[, c(1,3,4,5,6,7)]
# Correlation matrix
cormat <- round(cor(df),2)
# Melt the correlation matrix
require(reshape2)
cormat <- melt(cormat)
head(cormat)
##   Var1 Var2 value
## 1  mpg  mpg  1.00
## 2 disp  mpg -0.85
## 3   hp  mpg -0.78
## 4 drat  mpg  0.68
## 5   wt  mpg -0.87
## 6 qsec  mpg  0.42

We start by creating a plot, named g, that we’ll finish in the next section by adding a layer.

g <- ggplot(cormat, aes(x = Var1, y = Var2))

Possible layers include:

  • geom_tile(): Tile plane with rectangles (similar to levelplot and image)
  • geom_raster(): High-performance rectangular tiling. This is a special case of geom_tile where all tiles are the same size.


We’ll use the function geom_tile() to visualize a correlation matrix.

Compute and visualize correlation matrix:

# 1. Compute correlation
cormat <- round(cor(df),2)

# 2. Reorder the correlation matrix by 
# Hierarchical clustering
hc <- hclust(as.dist(1-cormat)/2)
cormat.ord <- cormat[hc$order, hc$order]

# 3. Get the upper triangle
cormat.ord[lower.tri(cormat.ord)]<- NA

# 4. Melt the correlation matrix
require(reshape2)
melted_cormat <- melt(cormat.ord, na.rm = TRUE)

# Create the heatmap
ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab",
   name="Pearson\nCorrelation") + # Change gradient color
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                  size = 12, hjust = 1))+
 coord_fixed()

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size. Learn more here: ggplot2 correlation matrix heatmap.

  • Key functions: geom_tile(), geom_raster()

Other types of graphs

ggplot2 - R software and data visualization

ggplot2 - R software and data visualization

Graphical primitives: polygon, path, ribbon, segment, rectangle

This section describes how to add graphical elements to a plot. The functions below we’ll be used:


  • geom_polygon(): Add polygon, a filled path
  • geom_path(): Connect observations in original order
  • geom_ribbon(): Add ribbons, y range with continuous x values.
  • geom_segment(): Add a single line segments
  • geom_rect(): Add a 2d rectangles.


  1. The R code below draws France map using geom_polygon():
require(maps)
france = map_data('world', region = 'France')
ggplot(france, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'white', colour = 'black')

ggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill, linetype and size.

  1. The following R code uses econimics data [in ggplot2] and produces path, ribbon and rectangles.
h <- ggplot(economics, aes(date, unemploy))

# Path
h + geom_path()

# Ribbon
h + geom_ribbon(aes(ymin = unemploy-900, ymax = unemploy+900),
                fill = "steelblue") +
  geom_path(size = 0.8)

# Rectangle
h + geom_rect(aes(xmin = as.Date('1980-01-01'), ymin = -Inf, 
                 xmax = as.Date('1985-01-01'), ymax = Inf),
             fill = "steelblue") +
  geom_path(size = 0.8) 

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, fill (for ribbon only), linetype and size.

  1. Add line segments:
# Create a scatter plot
i <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

# Add segment
i + geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15))

# Add arrow
require(grid)
i + geom_segment(aes(x = 5, y = 30, xend = 3.5, yend = 25),
                  arrow = arrow(length = unit(0.5, "cm")))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

To customize the plot, the following arguments can be used: alpha, color, linetype and size. Learn more here: ggplot2 add line segment.

  • Key functions: geom_path(), geom_ribbon(), geom_rect(), geom_segment()

Graphical parameters

Main title, axis labels and legend title

We start by creating a box plot using the data set ToothGrowth:

p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()

The function below can be used for changing titles and labels:


  • p + ggtitle(“New main title”): Adds a main title above the plot
  • p + xlab(“New X axis label”): Changes the X axis label
  • p + ylab(“New Y axis label”): Changes the Y axis label
  • p + labs(title = “New main title”, x = “New X axis label”, y = “New Y axis label”): Changes main title and axis labels


The function labs() can be also used to change the legend title.

  1. Change main title and axis labels
# Default plot
print(p)

# Change title and axis labels
p <- p +labs(title="Plot of length \n by dose",
        x ="Dose (mg)", y = "Teeth length")
p

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Note that, \n is used to split long title into multiple lines.

  1. Change the appearance of labels:

To change the appearance(color, size and face ) of labels, the functions theme() and element_text() can be used.

The function element_blank() hides the labels.

# Change the appearance of labels
p + theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=14, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

# Hide labels
p + theme(plot.title = element_blank(), 
          axis.title.x = element_blank(),
          axis.title.y = element_blank())

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Change legend titles: Scale functions (fill, color, size, shape, …) are used to update legend titles.
# Default plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len, fill=dose))+
  geom_boxplot()
p

# Modify legend titles
p + labs(fill = "Dose (mg)")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 title: main, axis and legend titles.

Legend position and appearance

  1. Create a box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len, fill=dose))+
  geom_boxplot()
  1. Change legend position and appearance
# Change legend position: "left","top", "right", "bottom", "none"
p + theme(legend.position="top")

# Remove legends
p + theme(legend.position = "none")

# Change the appearance of legend title and labels
p + theme(legend.title = element_text(colour="blue"),
          legend.text = element_text(colour="red"))

# Change legend box background color
p + theme(legend.background = element_rect(fill="lightblue"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Customize legends using scale functions
    • Change the order of legend items: scale_x_discrete()
    • Set legend title and labels: scale_fill_discrete()
# Change the order of legend items
p + scale_x_discrete(limits=c("2", "0.5", "1"))

# Set legend title and labels
p + scale_fill_discrete(name = "Dose", labels = c("A", "B", "C"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 legend position and appearance.

Change colors automatically and manually

ToothGrowth and mtcars data sets are used in the examples below.

# Convert dose and cyl columns from numeric to factor variables
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
mtcars$cyl <- as.factor(mtcars$cyl)

We start by creating some plots which will be finished hereafter:

# Box plot
bp <- ggplot(ToothGrowth, aes(x=dose, y=len))

# Scatter plot
sp <- ggplot(mtcars, aes(x=wt, y=mpg))
  1. Draw plots: change fill and outline colors
# box plot
bp + geom_boxplot(fill='steelblue', color="red")

# scatter plot
sp + geom_point(color='darkblue')

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Change color by groups using the levels of dose variable
# Box plot
bp <- bp + geom_boxplot(aes(fill = dose))
bp

# Scatter plot
sp <- sp + geom_point(aes(color = cyl))
sp

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Change colors manually:
  • scale_fill_manual() for box plot, bar plot, violin plot, etc
  • scale_color_manual() for lines and points
# Box plot
bp + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))

# Scatter plot
sp + scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Use RColorBrewer palettes: (Read more about RColorBrewer: color in R)
  • scale_fill_brewer() for box plot, bar plot, violin plot, etc
  • scale_color_brewer() for lines and points
# Box plot
bp + scale_fill_brewer(palette="Dark2")

# Scatter plot
sp + scale_color_brewer(palette="Dark2")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Available color palettes in the RColorBrewer package:

RColorBrewer palettes

  1. Use gray colors:
  • scale_fill_grey() for box plot, bar plot, violin plot, etc
  • scale_colour_grey() for points, lines, etc
# Box plot
bp + scale_fill_grey() + theme_classic()

# Scatter plot
sp + scale_color_grey() + theme_classic()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Gradient or continuous colors:

Plots can be colored according to the values of a continuous variable using the functions :

  • scale_color_gradient(), scale_fill_gradient() for sequential gradients between two colors
  • scale_color_gradient2(), scale_fill_gradient2() for diverging gradients
  • scale_color_gradientn(), scale_fill_gradientn() for gradient between n colors

Gradient colors for scatter plots: The graphs are colored using the qsec continuous variable :

# Color by qsec values
sp2<-ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(aes(color = qsec))
sp2

# Change the low and high colors
# Sequential color scheme
sp2+scale_color_gradient(low="blue", high="red")

# Diverging color scheme
mid<-mean(mtcars$qsec)
sp2+scale_color_gradient2(midpoint=mid, low="blue", mid="white",
                          high="red", space = "Lab" )

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 colors.

Point shapes, colors and size

The different points shapes commonly used in R are shown in the image below:

r point shape

mtcars data is used in the following examples.

# Convert cyl as factor variable
mtcars$cyl <- as.factor(mtcars$cyl)

Create a scatter plot and change point shapes, colors and size:

# Basic scatter plot
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(shape = 18, color = "steelblue", size = 4)

# Change point shapes and colors by groups
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(aes(shape = cyl, color = cyl))

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

It’s also possible to manually change the appearance of points:

  • scale_shape_manual() : to change point shapes
  • scale_color_manual() : to change point colors
  • scale_size_manual() : to change the size of points
# Change colors and shapes manually
ggplot(mtcars, aes(x=wt, y=mpg, group=cyl)) +
  geom_point(aes(shape=cyl, color=cyl), size=2)+
  scale_shape_manual(values=c(3, 16, 17))+
  scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))+
  theme(legend.position="top")

ggplot2 - R software and data visualization

Learn more here: ggplot2 point shapes, colors and size.

Add text annotations to a graph

There are three important functions for adding texts to a plot:

  • geom_text(): Textual annotations
  • annotate(): Textual annotations
  • annotation_custom(): Static annotations that are the same in every panel. These annotations are not affected by the plot scales.

A subset of mtcars data is used:

set.seed(1234)
df <- mtcars[sample(1:nrow(mtcars), 10), ]
df$cyl <- as.factor(df$cyl)

Scatter plots with textual annotations:

# Scatter plot
sp <- ggplot(df, aes(x=wt, y=mpg))+ geom_point() 

# Add text, change colors by groups
sp + geom_text(aes(label = rownames(df), color = cyl),
               size = 3, vjust = -1)

# Add text at a particular coordinate
sp + geom_text(x = 3, y = 30, label = "Scatter plot",
              color="red")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 text: Add text annotations to a graph.

Line types

The different line types available in R software are : “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”.

Note that, line types can be also specified using numbers : 0, 1, 2, 3, 4, 5, 6. 0 is for “blank”, 1 is for “solid”, 2 is for “dashed”, ….

A graph of the different line types is shown below :

ggplot2 - R software and data visualization

  1. Basic line plot
# Create some data
df <- data.frame(time=c("breakfeast", "Lunch", "Dinner"),
                bill=c(10, 30, 15))
head(df)
##         time bill
## 1 breakfeast   10
## 2      Lunch   30
## 3     Dinner   15
# Basic line plot with points
# Change the line type
ggplot(data=df, aes(x=time, y=bill, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()

ggplot2 - R software and data visualization

  1. Line plots with multiple groups
# Create some data
df2 <- data.frame(sex = rep(c("Female", "Male"), each=3),
                  time=c("breakfeast", "Lunch", "Dinner"),
                  bill=c(10, 30, 15, 13, 40, 17) )
head(df2)
##      sex       time bill
## 1 Female breakfeast   10
## 2 Female      Lunch   30
## 3 Female     Dinner   15
## 4   Male breakfeast   13
## 5   Male      Lunch   40
## 6   Male     Dinner   17
# Line plot with multiple groups
# Change line types and colors by groups (sex)
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype = sex, color = sex))+
  geom_point(aes(color=sex))+
  theme(legend.position="top")

ggplot2 - R software and data visualization

The functions below can be used to change the appearance of line types manually:

  • scale_linetype_manual() : to change line types
  • scale_color_manual() : to change line colors
  • scale_size_manual() : to change the size of lines
# Change line colors and sizes
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex, color=sex, size=sex))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))+
  scale_color_manual(values=c('#999999','#E69F00'))+
  scale_size_manual(values=c(1, 1.5))+
  theme(legend.position="top")

ggplot2 - R software and data visualization

Learn more here: ggplot2 line types.

Themes and background colors

ToothGrowth data is used :

# Convert the column dose from numeric to factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
  1. Create a box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + 
  geom_boxplot()
  1. Change plot themes

Several functions are available in ggplot2 package for changing quickly the theme of plots :

  • theme_gray(): gray background color and white grid lines
  • theme_bw() : white background and gray grid lines
p + theme_gray(base_size = 14)

p + theme_bw()

ggplot2 background color, theme_gray and theme_bw, R programmingggplot2 background color, theme_gray and theme_bw, R programming

  • theme_linedraw : black lines around the plot
  • theme_light : light gray lines and axis (more attention towards the data)
p + theme_linedraw()

p + theme_light()

ggplot2 background color, theme_linedraw and theme_light, R programmingggplot2 background color, theme_linedraw and theme_light, R programming

  • theme_minimal: no background annotations
  • theme_classic : theme with axis lines and no grid lines
p + theme_minimal()

p + theme_classic()

ggplot2 background color, theme_minimal and theme_classic, R programmingggplot2 background color, theme_minimal and theme_classic, R programming

Learn more here: ggplot2 themes and background colors.

Axis limits: Minimum and Maximum values

Create a plot:

p <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()

Different functions are available for setting axis limits:


  1. Without clipping (preferred):
    • p + coord_cartesian(xlim = c(5, 20), ylim = (0, 50)): Cartesian coordinates. The Cartesian coordinate system is the most common type of coordinate system. It will zoom the plot (like you’re looking at it with a magnifying glass), without clipping the data.
  2. With clipping the data (removes unseen data points): Observations not in this range will be dropped completely and not passed to any other layers.
    • p + xlim(5, 20) + ylim(0, 50)
    • p + scale_x_continuous(limits = c(5, 20)) + scale_y_continuous(limits = c(0, 50))
  3. Expand the plot limits with data: This function is a thin wrapper around geom_blank() that makes it easy to add data to a plot.
    • p + expand_limits(x = 0, y = 0): set the intercept of x and y axes at (0,0)
    • p + expand_limits(x = c(5, 50), y = c(0, 150))


# Default plot
print(p)

# Change axis limits using coord_cartesian()
p + coord_cartesian(xlim =c(5, 20), ylim = c(0, 50))

# Use xlim() and ylim()
p + xlim(5, 20) + ylim(0, 50)

# Expand limits
p + expand_limits(x = c(5, 50), y = c(0, 150))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 axis limits.

Note that, date axis limits can be set using the functions scale_x_date() and scale_y_date(). Read more here: ggplot2 date axis.

Axis transformations: log and sqrt scales

  1. Create a scatter plot:
p <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
  1. ggplot2 functions for continuous axis transformations:

  • p + scale_x_log10(), p + scale_y_log10() : Plot x and y on log10 scale, respectively.

  • p + scale_x_sqrt(), p + scale_y_sqrt() : Plot x and y on square root scale, respectively.

  • p + scale_x_reverse(), p + scale_y_reverse() : Reverse direction of axes

  • p + coord_trans(x =“log10”, y=“log10”) : transformed cartesian coordinate system. Possible values for x and y are “log2”, “log10”, “sqrt”, …

  • p + scale_x_continuous(trans=‘log2’), p + scale_y_continuous(trans=‘log2’) : another allowed value for the argument trans is ‘log10’


  1. The R code below uses the function scale_xx_continuous() to transform axis scales:
# Default scatter plot
print(p)

# Log transformation using scale_xx()
# possible values for trans : 'log2', 'log10','sqrt'
p + scale_x_continuous(trans='log2') +
  scale_y_continuous(trans='log2')

# Format axis tick mark labels
require(scales)
p + scale_y_continuous(trans = log2_trans(),
    breaks = trans_breaks("log2", function(x) 2^x),
    labels = trans_format("log2", math_format(2^.x)))

# Reverse coordinates
p + scale_y_reverse() 

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 axis limits.

Axis ticks: customize tick marks and labels, reorder and select items

  1. Functions for changing the style of axis tick mark labels:

  • element_text(face, color, size, angle): change text style
  • element_blank(): Hide text


  1. Create a box plot:
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + geom_boxplot()
# print(p)
  1. Change the style and the orientation angle of axis tick labels
# Change the style of axis tick labels
# face can be "plain", "italic", "bold" or "bold.italic"
p + theme(axis.text.x = element_text(face="bold", color="#993333", 
                           size=14, angle=45),
          axis.text.y = element_text(face="bold", color="blue", 
                           size=14, angle=45))


# Remove axis ticks and tick mark labels
p + theme(
  axis.text.x = element_blank(), # Remove x axis tick labels
  axis.text.y = element_blank(), # Remove y axis tick labels
  axis.ticks = element_blank()) # Remove ticks

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Customize continuous and discrete axes:
  • Discrete axes
    • scale_x_discrete(name, breaks, labels, limits): for X axis
    • scale_y_discrete(name, breaks, labels, limits): for y axis
  • Continuous axes
    • scale_x_continuous(name, breaks, labels, limits, trans): for X axis
    • scale_y_continuous(name, breaks, labels, limits, trans): for y axis

Briefly, the meaning of the arguments are as follow:


  • name : x or y axis labels
  • breaks : vector specifying which breaks to display
  • labels : labels of axis tick marks
  • limits : vector indicating the data range
(Read more here: Set axis ticks for discrete and continuous axes)


scale_xx() functions can be used to change the following x or y axis parameters :

  • axis titles
  • axis limits (data range to display)
  • choose where tick marks appear
  • manually label tick marks

4.1. Discrete axes:

# Change x axis label and the order of items
p + scale_x_discrete(name ="Dose (mg)", 
                    limits=c("2","1","0.5"))

# Change tick mark labels
p + scale_x_discrete(breaks=c("0.5","1","2"),
        labels=c("Dose 0.5", "Dose 1", "Dose 2"))

# Choose which items to display
p + scale_x_discrete(limits=c("0.5", "2"))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

4.2. Continuous axes:

# Default scatter plot
# +++++++++++++++++
sp <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
sp

# Customize the plot
#+++++++++++++++++++++
# 1. Change x and y axis labels, and limits
sp <- sp + scale_x_continuous(name="Speed of cars", limits=c(0, 30)) +
  scale_y_continuous(name="Stopping distance", limits=c(0, 150))
# 2. Set tick marks on y axis: a tick mark is shown on every 50
sp + scale_y_continuous(breaks=seq(0, 150, 50))

# Format the labels
# +++++++++++++++++
require(scales)
sp + scale_y_continuous(labels = percent) # labels as percents

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 Axis ticks: tick marks and labels.

Add straight lines to a plot: horizontal, vertical and regression lines

The R function below can be used :

  • geom_hline(yintercept, linetype, color, size): for horizontal lines
  • geom_vline(xintercept, linetype, color, size): for vertical lines
  • geom_abline(intercept, slope, linetype, color, size): for regression lines
  • geom_segment() to add segments
  1. Create a simple scatter plot
# Simple scatter plot
sp <- ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()
  1. Add straight lines
# Add horizontal line at y = 2O; change line type and color
sp + geom_hline(yintercept=20, linetype="dashed", color = "red")

# Add vertical line at x = 3; change line type, color and size
sp + geom_vline(xintercept = 3, color = "blue", size=1.5)

# Add regression line
sp + geom_abline(intercept = 37, slope = -5, color="blue")+
  ggtitle("y = -5X + 37")

# Add horizontal line segment
sp + geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15))

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 add straight lines to a plot.

Rotate a plot: flip and reverse

  • coord_flip(): Create horizontal plots
  • scale_x_reverse(), scale_y_reverse(): Reverse the axes
set.seed(1234)
# Basic histogram
hp <- qplot(x=rnorm(200), geom="histogram")
hp

# Horizontal histogram
hp + coord_flip()

# Y axis reversed
hp + scale_y_reverse()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

Learn more here: ggplot2 rotate a graph.

Faceting: split a plot into a matrix of panels

Facets divide a plot into subplots based on the values of one or more categorical variables.

There are two main functions for faceting :

  • facet_grid()
  • facet_wrap()

Create a box plot filled by groups:

p <- ggplot(ToothGrowth, aes(x=dose, y=len, group=dose)) + 
  geom_boxplot(aes(fill=dose))
p

ggplot2 - R software and data visualization

The following functions can be used for facets:


  • p + facet_grid(supp ~ .): Facet in vertical direction based on the levels of supp variable.

  • p + facet_grid(. ~ supp): Facet in horizontal direction based on the levels of supp variable.

  • p + facet_grid(dose ~ supp): Facet in horizontal and vertical directions based on two variables: dose and supp.

  • p + facet_wrap(~ fl): Place facet side by side into a rectangular layout


  1. Facet with one discrete variable: Split by the levels of the group “supp”
# Split in vertical direction
p + facet_grid(supp ~ .)

# Split in horizontal direction
p + facet_grid(. ~ supp)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Facet with two discrete variables: Split by the levels of the groups “dose” and “supp”
# Facet by two variables: dose and supp.
# Rows are dose and columns are supp
p + facet_grid(dose ~ supp)

# Facet by two variables: reverse the order of the 2 variables
# Rows are supp and columns are dose
p + facet_grid(supp ~ dose)

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

By default, all the panels have the same scales (scales=“fixed”). They can be made independent, by setting scales to free, free_x, or free_y.

p + facet_grid(dose ~ supp, scales='free')

Learn more here: ggplot2 facet : split a plot into a matrix of panels.

Position adjustements

Position adjustments determine how to arrange geoms. The argument position is used to adjust geom positions:

p <- ggplot(mpg, aes(fl, fill = drv))

# Arrange elements side by side
p + geom_bar(position = "dodge")

# Stack objects on top of one another, 
# and normalize to have equal height
p + geom_bar(position = "fill")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

# Stack elements on top of one another
p + geom_bar(position = "stack")

# Add random noise to X and Y position 
# of each element to avoid overplotting
ggplot(mpg, aes(cty, hwy)) + 
  geom_point(position = "jitter")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Note that, each of these position adjustments can be done using a function with manual width and height argument.

  • position_dodge(width, height)
  • position_fill(width, height)
  • position_stack(width, height)
  • position_jitter(width, height)
p + geom_bar(position = position_dodge(width = 1))

ggplot2 - R software and data visualization

Learn more here: ggplot2 bar plots.

Coordinate systems

p <- ggplot(mpg, aes(fl)) + geom_bar()

The coordinate systems in ggplot2 are:


  • p + coord_cartesian(xlim = NULL, ylim = NULL): Cartesian coordinate system (default). It’s the most familiar and common, type of coordinate system.

  • p + coord_fixed(ratio = 1, xlim = NULL, ylim = NULL): Cartesian coordinates with fixed relationship between x and y scales. The ratio represents the number of units on the y-axis equivalent to one unit on the x-axis. The default, ratio = 1, ensures that one unit on the x-axis is the same length as one unit on the y-axis.

  • p + coord_flip(…): Flipped cartesian coordinates. Useful for creating horizontal plot by rotating.

  • p + coord_polar(theta = “x”, start = 0, direction = 1): Polar coordinates. The polar coordinate system is most commonly used for pie charts, which are a stacked bar chart in polar coordinates.

  • p + coord_trans(xtrans, ytrans, limx, limy): Transformed cartesian coordinate system.

  • coord_map(): Map projections. Provides the full range of map projections available in the mapproj package.


  1. Arguments for coord_cartesian(), coord_fixed() and coord_flip()
    • xlim: limits for the x axis
    • ylim: limits for the y axis
    • ratio: aspect ratio, expressed as y/x
    • …: Other arguments passed onto coord_cartesian
  2. Arguments for coord_polar()
    • theta: variable to map angle to (x or y)
    • start: offset of starting point from 12 o’clock in radians
    • direction: 1, clockwise; -1, anticlockwise
  3. Arguments for coord_trans()
    • xtrans, ytrans: transformers for x and y axes
    • limx, limy: limits for x and y axes.
p + coord_cartesian(ylim = c(0, 200))

p + coord_fixed(ratio = 1/50)

p + coord_flip()

ggplot2 - R software and data visualizationggplot2 - R software and data visualizationggplot2 - R software and data visualization

p + coord_polar(theta = "x", direction = 1)

p + coord_trans(ytrans = "sqrt")

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

Extensions to ggplot2: R packages and functions

  • factoextra: factoextra : Extract and Visualize the outputs of a multivariate analysis. factoextra provides some easy-to-use functions to extract and visualize the output of PCA (Principal Component Analysis), CA (Correspondence Analysis) and MCA (Multiple Correspondence Analysis) functions from several packages (FactoMineR, stats, ade4 and MASS). It contains also many functions for simplifying clustering analysis workflows. Ggplot2 plotting system is used.

  • easyggplot2: Perform and customize easily a plot with ggplot2. The idea behind ggplot2 is seductively simple but the detail is, yes, difficult. To customize a plot, the syntax is sometimes a tiny bit opaque and this raises the level of difficulty. easyGgplot2 package (which depends on ggplot2) to make and customize quickly plots including box plot, dot plot, strip chart, violin plot, histogram, density plot, scatter plot, bar plot, line plot, etc, …

  • ggplot2 - Easy way to mix multiple graphs on the same page: The R package gridExtra and cowplot are used.

  • ggplot2: Correlation matrix heatmap

  • ggfortify: Define fortify and autoplot functions to allow ggplot2 to handle some popular R packages. These include plotting 1) Matrix; 2) Linear Model and Generalized Linear Model; 3) Time Series; 4) PCA/Clustering; 5) Survival Curve; 6) Probability distribution

  • GGally: GGally extends ggplot2 by providing several functions including pairwise correlation matrix, scatterplot plot matrix, parallel coordinates plot, survival plot and several functions to plot networks.

  • ggRandomForests: Graphical analysis of random forests with the randomForestSRC and ggplot2 packages.

  • ggdendro: Create dendrograms and tree diagrams using ggplot2

  • ggmcmc: Tools for Analyzing MCMC Simulations from Bayesian Inference

  • ggthemes: Package with additional ggplot2 themes and scales

  • Theme used to create journal ready figures easily

Acknoweledgment

Infos

This analysis was performed using R (ver. 3.2.1) and ggplot2 (ver 1.0.1).

ggplot2 - Essentials

$
0
0

Introduction

ggplot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics.

The concept behind ggplot2 divides plot into three different fundamental parts: Plot = data + Aesthetics + Geometry.

The principal components of every plot can be defined as follow:

  • data is a data frame
  • Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
  • Geometry defines the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)

There are two major functions in ggplot2 package: qplot() and ggplot() functions.

  • qplot() stands for quick plot, which can be used to produce easily simple plots.
  • ggplot() function is more flexible and robust than qplot for building a plot piece by piece.

This document provides R course material for producing different types of plots using ggplot2.

Note that, the content provided here is available as a book: ggplot2: The Elements for Elegant Data Visualization in R

ggplot2 book

Install and load ggplot2 package

# Installation
install.packages('ggplot2')

# Loading
library(ggplot2)

Data format and preparation

The data should be a data.frame (columns are variables and rows are observations).

The data set mtcars is used in the examples below:

# Load the data
data(mtcars)
df <- mtcars[, c("mpg", "cyl", "wt")]
head(df)
##                    mpg cyl    wt
## Mazda RX4         21.0   6 2.620
## Mazda RX4 Wag     21.0   6 2.875
## Datsun 710        22.8   4 2.320
## Hornet 4 Drive    21.4   6 3.215
## Hornet Sportabout 18.7   8 3.440
## Valiant           18.1   6 3.460

Plotting with ggplot2

  1. qplot(): Quick plot with ggplot2
    • Scatter plots
    • Bar plot
    • Box plot, violin plot and dot plot
    • Histogram and density plots
  2. Box plots
    • Basic box plots
    • Box plot with dots
    • Change box plot colors by groups
      • Change box plot line colors
      • Change box plot fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Box plot with multiple groups
    • Functions: geom_boxplot(), stat_boxplot(), stat_summary()

ggplot2 - R software and data visualization

  1. Violin plots
    • Basic violin plots
    • Add summary statistics on a violin plot
      • Add mean and median points
      • Add median and quartile
      • Add mean and standard deviation
    • Violin plot with dots
    • Change violin plot colors by groups
      • Change violin plot line colors
      • Change violin plot fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Violin plot with multiple groups
    • Functions: geom_violin(), stat_ydensity()

ggplot2 - R software and data visualization

  1. Dot plots
    • Basic dot plots
    • Add summary statistics on a dot plot
      • Add mean and median points
      • Dot plot with box plot and violin plot
      • Add mean and standard deviation
    • Change dot plot colors by groups
    • Change the legend position
    • Change the order of items in the legend
    • Dot plot with multiple groups
    • Functions: geom_dotplot(), stat_bindot()

ggplot2 - R software and data visualization

  1. Stripcharts
    • Basic stripcharts
    • Add summary statistics on a stripchart
      • Add mean and median points
      • Stripchart with box blot and violin plot
      • Add mean and standard deviation
    • Change point shapes by groups
    • Change stripchart colors by groups
    • Change the legend position
    • Change the order of items in the legend
    • Stripchart with multiple groups
    • Functions: geom_jitter(), stat_summary()

ggplot2 - R software and data visualization

  1. Density plots
    • Basic density plots
    • Change density plot line types and colors
    • Change density plot colors by groups
      • Calculate the mean of each group :
      • Change line colors
      • Change fill colors
    • Change the legend position
    • Combine histogram and density plots
    • Use facets
    • Functions: geom_density(), stat_density()

ggplot2 - R software and data visualization

  1. Histogram plots
    • Basic histogram plots
    • Add mean line and density plot on the histogram
    • Change histogram plot line types and colors
    • Change histogram plot colors by groups
      • Calculate the mean of each group
      • Change line colors
      • Change fill colors
    • Change the legend position
    • Use facets
    • Functions: geom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge().

ggplot2 - R software and data visualization

  1. Scatter plots
    • Basic scatter plots
    • Label points in the scatter plot
      • Add regression lines
      • Change the appearance of points and lines
    • Scatter plots with multiple groups
      • Change the point color/shape/size automatically
      • Add regression lines
      • Change the point color/shape/size manually
    • Add marginal rugs to a scatter plot
    • Scatter plots with the 2d density estimation
    • Scatter plots with ellipses
    • Scatter plots with rectangular bins
    • Scatter plot with marginal density distribution plot
    • Functions: geom_point(), geom_smooth(), stat_smooth(), geom_rug(), geom_density2d(), stat_density2d(), stat_bin2d(), geom_bin2d(), stat_summary2d(), geom_hex() (see stat_binhex()), stat_summary_hex()

ggplot2 - R software and data visualization

  1. Bar plots
    • Basic bar plots
      • Bar plot with labels
      • Bar plot of counts
    • Change bar plot colors by groups
      • Change outline colors
      • Change fill colors
    • Change the legend position
    • Change the order of items in the legend
    • Bar plot with multiple groups
    • Bar plot with a numeric x-axis
    • Bar plot with error bars
    • Functions: geom_bar(), geom_errorbar()

ggplot2 - R software and data visualization

  1. Line plots
    • Line types in R
    • Basic line plots
    • Line plot with multiple groups
      • Change globally the appearance of lines
      • Change automatically the line types by groups
      • Change manually the appearance of lines
    • Functions: geom_line(), geom_step(), geom_path(), geom_errorbar()

ggplot2 - R software and data visualization

  1. Error bars
    • Add error bars to a bar and line plots
      • Bar plot with error bars
      • Line plot with error bars
    • Dot plot with mean point and error bars
    • Functions: geom_errorbarh(), geom_errorbar(), geom_linerange(), geom_pointrange(), geom_crossbar(), stat_summary()
  2. Pie chart
    • Simple pie charts
    • Change the pie chart fill colors
    • Create a pie chart from a factor variable
    • Functions: coord_polar()

ggplot2 - R software and data visualization

  1. QQ plots
    • Basic qq plots
    • Change qq plot point shapes by groups
    • Change qq plot colors by groups
    • Change the legend position
    • Functions: stat_qq()

ggplot2 - R software and data visualization

  1. ECDF plots

ggplot2 - R software and data visualization

  1. ggsave(): Save a ggplot
    • print(): print a ggplot to a file
    • ggsave: save the last ggplot
    • Functions: print(), ggsave()

Graphical parameters

  1. Main title, axis labels and legend title
    • Change the main title and axis labels
    • Change the appearance of the main title and axis labels
    • Remove x and y axis labels
    • Functions: labs(), ggtitle(), xlab(), ylab(), update_labels()

ggplot2 - R software and data visualization

  1. Legend position and appearance
    • Change the legend position
    • Change the legend title and text font styles
    • Change the background color of the legend box
    • Change the order of legend items
    • Remove the plot legend
    • Remove slashes in the legend of a bar plot
    • guides() : set or remove the legend for a specific aesthetic
    • Functions: guides(), guide_legend(), guide_colourbar()

ggplot2 - R software and data visualization

  1. Change colors automatically and manually
    • Use a single color
    • Change colors by groups
      • Default colors
      • Change colors manually
      • Use RColorBrewer palettes
      • Use Wes Anderson color palettes
    • Use gray colors
    • Continuous colors: Gradient colors
    • Functions:
      • Brewer palettes: scale_colour_brewer(), scale_fill_brewer(), scale_color_brewer()
      • Gray scales: scale_color_grey(), scale_fill_grey()
      • Manual colors: scale_color_manual(), scale_fill_manual()
      • Hue colors: scale_colour_hue()
      • Gradient, continuous colors: scale_color_gradient(), scale_fill_gradient(), scale_fill_continuous(), scale_color_continuous()
      • Gradient, diverging colors: scale_color_gradient2(), scale_fill_gradient2(), scale_colour_gradientn()

ggplot2 - R software and data visualization

  1. Point shapes, colors and size
    • Change the point shapes, colors and sizes automatically
    • Change point shapes, colors and sizes manually
    • Functions: scale_shape_manual(), scale_color_manual(), scale_size_manual()

Points shapes available in R:

r point shape

ggplot2 - R software and data visualization

  1. Add text annotations to a graph
    • Text annotations using the function geom_text
    • Change the text color and size by groups
    • Add a text annotation at a particular coordinate
    • annotation_custom : Add a static text annotation in the top-right, top-left, …
    • Functions: geom_text(), annotate(), annotation_custom()

ggplot2 - R software and data visualization

  1. Line types
    • Line types in R
    • Basic line plots
    • Line plot with multiple groups
      • Change globally the appearance of lines
      • Change automatically the line types by groups
      • Change manually the appearance of lines
    • Functions: scale_linetype(), scale_linetype_manual(), scale_color_manual(), scale_size_manual()

ggplot2 - R software and data visualization

  1. Themes and background colors
    • Quick functions to change plot themes
    • Customize the appearance of the plot background
      • Change the colors of the plot panel background and the grid lines
      • Remove plot panel borders and grid lines
      • Change the plot background color (not the panel)
    • Use a custom theme
      • theme_tufte : a minimalist theme
      • theme_economist : theme based on the plots in the economist magazine
      • theme_stata: theme based on Stata graph schemes.
      • theme_wsj: theme based on plots in the Wall Street Journal
      • theme_calc : theme based on LibreOffice Calc
      • theme_hc : theme based on Highcharts JS
      • Functions: theme(), theme_bw(), theme_grey(), theme_update(), theme_blank(), theme_classic(), theme_minimal(), element_blank(), element_line(), element_rect(), element_text(), rel()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Axis scales and transformations
    • Change x and y axis limits
      • Use xlim() and ylim() functions
      • Use expand_limts() function
      • Use scale_xx() functions
    • Axis transformations
      • Log and sqrt transformations
      • Format axis tick mark labels
      • Display log tick marks
    • Format date axes
      • Plot with dates
      • Format axis tick mark labels
      • Date axis limits
    • Functions:
      • xlim(), ylim(), expand_limits() : x, y axis limits
      • scale_x_continuous(), scale_y_continuous()
      • scale_x_log10(), scale_y_log10(): log10 transformation
      • scale_x_sqrt(), scale_y_sqrt(): sqrt transformation
      • coord_trans()
      • scale_x_reverse(), scale_y_reverse()
      • annotation_logticks()
      • scale_x_date(), scale_y_date()
      • scale_x_datetime(), scale_y_datetime()

ggplot2 - R software and data visualization

  1. Axis ticks: customize tick marks and labels, reorder and select items
    • Change the appearance of the axis tick mark labels
    • Hide x and y axis tick mark labels
    • Change axis lines
    • Set axis ticks for discrete and continuous axes
      • Customize a discrete axis
        • Change the order of items
        • Change tick mark labels
        • Choose which items to display
      • Customize a continuous axis
        • Set the position of tick marks
        • Format the text of tick mark labels
    • Functions: theme(), scale_x_discrete(), scale_y_discrete(), scale_x_continuous(), scale_y_continuous()

ggplot2 - R software and data visualization

  1. Add straight lines to a plot: horizontal, vertical and regression lines
    • geom_hline : Add horizontal lines
    • geom_vline : Add vertical lines
    • geom_abline : Add regression lines
    • geom_segment : Add a line segment
    • Functions: geom_hline(), geom_vline(), geom_abline(), geom_segment()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Rotate a plot: flip and reverse
    • Horizontal plot : coord_flip()
    • Reverse y axis
    • Functions: coord_flip(), scale_x_reverse(), scale_y_reverse()

ggplot2 - R software and data visualizationggplot2 - R software and data visualization

  1. Faceting: split a plot into a matrix of panels
    • Facet with one variable
    • Facet with two variables
    • Facet scales
    • Facet labels
    • facet_wrap
    • Functions: facet_grid(), facet_wrap(), label_both(), label_bquote(), label_parsed()

ggplot2 - R software and data visualization

Extensions to ggplot2: R packages and functions

Acknoweledgment

Infos

This analysis was performed using R (ver. 3.2.1) and ggplot2 (ver 1.0.1).

Beautiful dendrogram visualizations in R: 5+ must known methods - Unsupervised Machine Learning

$
0
0


A variety of functions exists in R for visualizing and customizing dendrogram. The aim of this article is to describe 5+ methods for drawing a beautiful dendrogram using R software.

We start by computing hierarchical clustering using the data set USArrests:

# Load data
data(USArrests)

# Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")

1 plot.hclust(): R base function

As you already know, the standard R function plot.hclust() can be used to draw a dendrogram from the results of hierarchical clustering analyses (computed using hclust() function).

A simplified format is:

plot(x, labels = NULL, hang = 0.1, 
     main = "Cluster dendrogram", sub = NULL,
     xlab = NULL, ylab = "Height", ...)

  • x: an object of the type produced by hclust()
  • labels: A character vector of labels for the leaves of the tree. The default value is row names. if labels = FALSE, no labels are drawn.
  • hang: The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0.
  • main, sub, xlab, ylab: character strings for title.


# Default plot
plot(hc)

dendrogram visualization - Unsupervised Machine Learning

# Put the labels at the same height: hang = -1
plot(hc, hang = -1, cex = 0.6)

dendrogram visualization - Unsupervised Machine Learning

2 plot.dendrogram() function

In order to visualize the result of a hierarchical clustering analysis using the function plot.dendrogram(), we must firstly convert it as a dendrogram.

The format of the function plot.dendrogram() is:

plot(x, type = c("rectangle", "triangle"), horiz = FALSE)

  • x: an object of class dendrogram
  • type of plot. Possible values are “rectangle” or “triangle”
  • horiz: logical indicating if the dendrogram should be drawn horizontally or no


# Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(hc)
# Default plot
plot(hcd, type = "rectangle", ylab = "Height")

dendrogram visualization - Unsupervised Machine Learning

# Triangle plot
plot(hcd, type = "triangle", ylab = "Height")

dendrogram visualization - Unsupervised Machine Learning

# Zoom in to the first dendrogram
plot(hcd, xlim = c(1, 20), ylim = c(1,8))

dendrogram visualization - Unsupervised Machine Learning

The above dendrogram can be customized using the arguments:

  • nodePar: a list of plotting parameters to use for the nodes (see ?points). Default value is NULL. The list may contain components named pch, cex, col, xpd, and/or bg each of which can have length two for specifying separate attributes for inner nodes and leaves.
  • edgePar: a list of plotting parameters to use for the edge segments (see ?segments). The list may contain components named col, lty and lwd (for the segments). As with nodePar, each can have length two for differentiating leaves and inner nodes.
  • leaflab: a string specifying how leaves are labeled. The default “perpendicular” write text vertically; “textlike” writes text horizontally (in a rectangle), and “none” suppresses leaf labels.
# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), 
                cex = 0.7, col = "blue")
# Customized plot; remove labels
plot(hcd, ylab = "Height", nodePar = nodePar, leaflab = "none")

dendrogram visualization - Unsupervised Machine Learning

# Horizontal plot
plot(hcd,  xlab = "Height",
     nodePar = nodePar, horiz = TRUE)

dendrogram visualization - Unsupervised Machine Learning

# Change edge color
plot(hcd,  xlab = "Height", nodePar = nodePar, 
     edgePar = list(col = 2:3, lwd = 2:1))

dendrogram visualization - Unsupervised Machine Learning

3 Phylogenetic trees

The package ape (Analyses of Phylogenetics and Evolution) can be used to produce a more sophisticated dendrogram.

The function plot.phylo() can be used for plotting a dendrogram. A simplified format is:

plot(x, type = "phylogram", show.tip.label = TRUE,
     edge.color = "black", edge.width = 1, edge.lty = 1,
     tip.color = "black")

  • x: an object of class “phylo”
  • type: the type of phylogeny to be drawn. Possible values are: “phylogram” (the default), “cladogram”, “fan”, “unrooted” and “radial”
  • show.tip.label: if true labels are shown
  • edge.color, edge.width, edge.lty: line color, width and type to be used for edge
  • tip.color: color used for labels


# install.packages("ape")
library("ape")
# Default plot
plot(as.phylo(hc), cex = 0.6, label.offset = 0.5)

dendrogram visualization - Unsupervised Machine Learning

# Cladogram
plot(as.phylo(hc), type = "cladogram", cex = 0.6, 
     label.offset = 0.5)

dendrogram visualization - Unsupervised Machine Learning

# Unrooted
plot(as.phylo(hc), type = "unrooted", cex = 0.6,
     no.margin = TRUE)

dendrogram visualization - Unsupervised Machine Learning

# Fan
plot(as.phylo(hc), type = "fan")

dendrogram visualization - Unsupervised Machine Learning

# Radial
plot(as.phylo(hc), type = "radial")

dendrogram visualization - Unsupervised Machine Learning

# Cut the dendrogram into 4 clusters
colors = c("red", "blue", "green", "black")
clus4 = cutree(hc, 4)
plot(as.phylo(hc), type = "fan", tip.color = colors[clus4],
     label.offset = 1, cex = 0.7)

dendrogram visualization - Unsupervised Machine Learning

# Change the appearance
# change edge and label (tip)
plot(as.phylo(hc), type = "cladogram", cex = 0.6,
     edge.color = "steelblue", edge.width = 2, edge.lty = 2,
     tip.color = "steelblue")

dendrogram visualization - Unsupervised Machine Learning

4 ggdendro package : ggplot2 and dendrogram

The R package ggdendro can be used to extract the plot data from dendrogram and for drawing a dendrogram using ggplot2.

4.1 Installation and loading

ggdendro can be installed as follow:

install.packages("ggdendro")

ggdendro requires the package ggplot2. Make sure that ggplot2 is installed and loaded before using ggdendro.

Load ggdendro as follow:

library("ggplot2")
library("ggdendro")

4.2 Visualize dendrogram using ggdendrogram() function

The function ggdendrogram() creates dendrogram plot using ggplot2.

# Visualization using the default theme named theme_dendro()
ggdendrogram(hc)

dendrogram visualization - Unsupervised Machine Learning

# Rotate the plot and remove default theme
ggdendrogram(hc, rotate = TRUE, theme_dendro = FALSE)

dendrogram visualization - Unsupervised Machine Learning

4.3 Extract dendrogram plot data

The function dendro_data() can be used for extracting the data. It returns a list of data frames which can be extracted using the functions below:

  • segment(): To extract the data for dendrogram line segments
  • label(): To extract the labels
# Build dendrogram object from hclust results
dend <- as.dendrogram(hc)

# Extract the data (for rectangular lines)
# Type can be "rectangle" or "triangle"
dend_data <- dendro_data(dend, type = "rectangle")
# What contains dend_data
names(dend_data)
## [1] "segments"    "labels"      "leaf_labels" "class"
# Extract data for line segments
head(dend_data$segments)
##           x         y     xend      yend
## 1 19.771484 13.516242 8.867188 13.516242
## 2  8.867188 13.516242 8.867188  6.461866
## 3  8.867188  6.461866 4.125000  6.461866
## 4  4.125000  6.461866 4.125000  2.714554
## 5  4.125000  2.714554 2.500000  2.714554
## 6  2.500000  2.714554 2.500000  1.091092
# Extract data for labels
head(dend_data$labels)
##   x y          label
## 1 1 0        Alabama
## 2 2 0      Louisiana
## 3 3 0        Georgia
## 4 4 0      Tennessee
## 5 5 0 North Carolina
## 6 6 0    Mississippi

dend_data can be used to draw a customized dendrogram using ggplot2:

# Plot line segments and add labels
p <- ggplot(dend_data$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = dend_data$labels, aes(x, y, label = label),
            hjust = 1, angle = 90, size = 3)+
  ylim(-3, 15)
print(p)

dendrogram visualization - Unsupervised Machine Learning

5 dendextend package: Extending R’s dendrogram functionality

The package dendextend contains many functions for changing the appearance of a dendrogram and for comparing dendrograms.

In this section we’ll use the chaining operator (%>%) to simplify our code.

5.1 Chaining

The chaining operator (%>%) turns x %>% f(y) into f(x, y) so you can use it to rewrite multiple operations such that they can be read from left-to-right, top-to-bottom. For instance, the results of the two R codes below are equivalent.

Standard R code for creating a dendrogram:

data <- scale(USArrests)
dist.res <- dist(data)
hc <- hclust(dist.res, method = "ward.D2")
dend <- as.dendrogram(hc)
plot(dend)

R code for creating a dendrogram using chaining operator:

dend <- USArrests[1:5,] %>% # data
        scale %>% # Scale the data
        dist %>% # calculate a distance matrix, 
        hclust(method = "ward.D2") %>% # Hierarchical clustering 
        as.dendrogram # Turn the object into a dendrogram.
plot(dend)

5.2 Installation and loading

Install the stable version as follow:

install.packages('dendextend')

Loading:

library(dendextend)

5.3 How to change a dendrogram

The function set() can be used to change the parameters with dendextend.

The format is:

set(object, what, value)

  1. object: a dendrogram object
  2. what: a character indicating what is the property of the tree that should be set/updated
  3. value: a vector with the value to set in the tree (the type of the value depends on the “what”).


Possible values for the argument what include:

Value for the argument whatDescription
labelsset the labels
labels_colors and labels_cexSet the color and the size of labels, respectively
leaves_pch, leaves_cex and leaves_colset the point type, size and color for leaves, respectively
nodes_pch, nodes_cex and nodes_colset the point type, size and color for nodes, respectively
hang_leaveshang the leaves
branches_k_colorcolor the branches
branches_col, branches_lwd , branches_ltySet the color, the line width and the line type of branches, respectively
by_labels_branches_col, by_labels_branches_lwd and by_labels_branches_lty Set the color, the line width and the line type of branches with specific labels, respectively
clear_branches and clear_leavesClear branches and leaves, respectively

5.4 Create a simple dendrogram

# Create a dendrogram and plot it
dend <- USArrests[1:5,] %>%  scale %>% 
        dist %>% hclust %>% as.dendrogram

dend %>% plot

dendrogram visualization - Unsupervised Machine Learning

# Get the labels of the tree
labels(dend)
## [1] "Alaska"     "Arizona"    "California" "Alabama"    "Arkansas"

5.5 Change labels

This section describes how to change label names as well as the color and the size for labels.

# Change the labels, and then plot:
dend %>% set("labels", c("a", "b", "c", "d", "e")) %>% plot

dendrogram visualization - Unsupervised Machine Learning

# Change color and size for labels
dend %>% set("labels_col", c("green", "blue")) %>% # change color
  set("labels_cex", 2) %>% # Change size
  plot(main = "Change the color \nand size") # plot

dendrogram visualization - Unsupervised Machine Learning

# Color labels by specifying the number of cluster (k)
dend %>% set("labels_col", value = c("green", "blue"), k=2) %>% 
          plot(main = "Color labels \nper cluster")
abline(h = 2, lty = 2)

dendrogram visualization - Unsupervised Machine Learning

In the R code above, the value of color vectors are too short. Hence, it’s recycled.

5.6 Change the points of a dendrogram nodes/leaves

# Change the type, the color and the size of node points
# +++++++++++++++++++++++++++++
dend %>% set("nodes_pch", 19) %>%  # node point type
  set("nodes_cex", 2) %>%  # node point size
  set("nodes_col", "blue") %>% # node point color
  plot(main = "Node points")

dendrogram visualization - Unsupervised Machine Learning

# Change the type, the color and the size of leave points
# +++++++++++++++++++++++++++++
dend %>% set("leaves_pch", 19) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", "blue") %>% # node point color
  plot(main = "Leaves points")

dendrogram visualization - Unsupervised Machine Learning

# Specify different point types and colors for each leave
dend %>% set("leaves_pch", c(17, 18, 19)) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", c("blue", "red", "green")) %>% #node point color
  plot(main = "Leaves points")

dendrogram visualization - Unsupervised Machine Learning

5.7 Change the color of branches

The color for branches can be controlled using k-means clustering:

# Default colors
dend %>% set("branches_k_color", k = 2) %>% 
  plot(main = "Default colors")

# Customized colors
dend %>% set("branches_k_color", 
             value = c("red", "blue"), k = 2) %>% 
   plot(main = "Customized colors")

dendrogram visualization - Unsupervised Machine Learningdendrogram visualization - Unsupervised Machine Learning

It’s also possible to use the function color_branches().

5.8 Adding colored rectangles

Clusters can be highlighted by adding colored rectangles. This is done using the rect.dendrogram() function (modeled based on the rect.hclust() function). One advantage of rect.dendrogram over rect.hclust, is that it also works on horizontally plotted trees:

# Vertical plot
dend %>% set("branches_k_color", k = 3) %>% plot
dend %>% rect.dendrogram(k=3, border = 8, lty = 5, lwd = 2)

# Horizontal plot
dend %>% set("branches_k_color", k = 3) %>% plot(horiz = TRUE)
dend %>% rect.dendrogram(k = 3, horiz = TRUE, border = 8, lty = 5, lwd = 2)

dendrogram visualization - Unsupervised Machine Learningdendrogram visualization - Unsupervised Machine Learning

5.9 Adding colored bars

This is useful for annotating the items in the clusters:

grp <- c(1,1,1, 2,2)
k_3 <- cutree(dend,k = 3, order_clusters_as_data = FALSE) 
# The FALSE above makes sure we get the clusters in the order of the
# dendrogram, and not in that of the original data. It is like:
# cutree(dend, k = 3)[order.dendrogram(dend)]

the_bars <- cbind(grp, k_3)

dend %>% set("labels", "") %>% plot
colored_bars(colors = the_bars, dend = dend)

dendrogram visualization - Unsupervised Machine Learning

5.10 ggplot2 integration

The following 2 steps are used:

  1. Transform a dendrogram into a ggdend object using as.ggdend() function
  2. Make the plot using the function ggplot()
dend <- iris[1:30,-5] %>% scale %>% dist %>% 
   hclust %>% as.dendrogram %>%
   set("branches_k_color", k=3) %>% set("branches_lwd", 1.2) %>%
   set("labels_colors") %>% set("labels_cex", c(.9,1.2)) %>% 
   set("leaves_pch", 19) %>% set("leaves_col", c("blue", "red"))
# plot the dend in usual "base" plotting engine:
plot(dend)

dendrogram visualization - Unsupervised Machine Learning

Produce the same plot in ggplot2 using the function:

library(ggplot2)
# Rectangle dendrogram using ggplot2
ggd1 <- as.ggdend(dend)
ggplot(ggd1) 

dendrogram visualization - Unsupervised Machine Learning

# Change the theme to the default ggplot2 theme
ggplot(ggd1, horiz = TRUE, theme = NULL) 

dendrogram visualization - Unsupervised Machine Learning

# Theme minimal
ggplot(ggd1, theme = theme_minimal()) 

dendrogram visualization - Unsupervised Machine Learning

# Create a radial plot and remove labels
ggplot(ggd1, labels = FALSE) + 
  scale_y_reverse(expand = c(0.2, 0)) +
  coord_polar(theta="x")

dendrogram visualization - Unsupervised Machine Learning

5.11 pvclust and dendextend

The package dendextend can be used to enhance many packages including pvclust. Recall that, pvclust is for calculating p-values for hierarchical clustering.

pvclust can be used as follow:

library(pvclust)
data(lung) # 916 genes for 73 subjects
set.seed(1234)
result <- pvclust(lung[1:100, 1:10], method.dist="cor", 
                  method.hclust="average", nboot=10)
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
# Default plot of the result
plot(result)
pvrect(result)

dendrogram visualization - Unsupervised Machine Learning

# pvclust and dendextend
result %>% as.dendrogram %>% 
  set("branches_k_color", k = 2, value = c("purple", "orange")) %>%
  plot
result %>% text
result %>% pvrect

dendrogram visualization - Unsupervised Machine Learning

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

The Guide for Clustering Analysis on a Real Data: 4 steps you should know - Unsupervised Machine Learning

$
0
0


Human’s abilities are exceeded by the large amounts of data collected every day from different fields, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Consequently, unsupervised machine learning technics, such as clustering, are used for discovering knowledge from big data.

Clustering approaches classify samples into groups (i.e clusters) containing objects of similar profiles. In our previous post, we clarified distance measures for assessing similarity between observations.

In this chapter we’ll describe the different steps to follow for computing clustering on a real data using k-means clustering:



1 Required packages

The following packages will be used:

  • cluster for clustering analyses
  • factoextra for visualizing clusters using ggplot2 plotting system

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

The cluster package can be installed using the code below:

install.packages("cluster")

Load packages:

library(cluster)
library(factoextra)

2 Data preparation

We’ll use the built-in R data set USArrests, which can be loaded and prepared as follow:

# Load the data set
data(USArrests)

# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
USArrests <- na.omit(USArrests)

# View the firt 6 rows of the data
head(USArrests, n = 6)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

In this data set, columns are variables and rows are observations (i.e., samples).

To inspect the data before the K-means clustering we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.

The apply() function is used to apply a given function (e.g : min(), max(), mean(), …) on the data set. The second argument can take the value of:

  • 1: for applying the function on the rows
  • 2: for applying the function on the columns
desc_stats <- data.frame(
  Min = apply(USArrests, 2, min), # minimum
  Med = apply(USArrests, 2, median), # median
  Mean = apply(USArrests, 2, mean), # mean
  SD = apply(USArrests, 2, sd), # Standard deviation
  Max = apply(USArrests, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)
##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

Note that the variables have a large different means and variances. They must be standardized to make them comparable.

Standardization consists of transforming the variables such that they have mean zero and standard deviation one. The scale() function can be used as follow:

df<- scale(USArrests)

3 Assessing the clusterability

The function get_clust_tendency() [in factoextra] can be used. It computes Hopkins statistic and provides a visual approach.

library("factoextra")
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat
## [1] 0.3440875
# Visualize the dissimilarity matrix
res$plot
## NULL

The value of Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).

4 Estimate the number of clusters in the data

As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [in cluster] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [in factoextra] is used to visualize the gap statistic plot.

library("cluster")
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 500) 
# Plot the result
library(factoextra)
fviz_gap_stat(gap_stat)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

The gap statistic suggests a 4 cluster solutions.

It’s also possible to use the function NbClust() [in NbClust] package.

5 Compute k-means clustering

K-means clustering with k = 4:

# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)
##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           4           3           3           4           3           3 
## Connecticut    Delaware     Florida     Georgia      Hawaii       Idaho 
##           2           2           3           4           2           1 
##    Illinois     Indiana        Iowa      Kansas    Kentucky   Louisiana 
##           3           2           1           2           1           4 
##       Maine    Maryland 
##           1           3
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

6 Cluster validation statistics: Inspect cluster silhouette plot

Recall that the silhouette measures (\(S_i\)) how similar an object \(i\) is to the the other objects in its own cluster versus those in the neighbor cluster. \(S_i\) values range from 1 to - 1:

  • A value of \(S_i\) close to 1 indicates that the object is well clustered. In the other words, the object \(i\) is similar to the other objects in its group.
  • A value of \(S_i\) close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
##            cluster neighbor  sil_width
## Alabama          4        3 0.48577530
## Alaska           3        4 0.05825209
## Arizona          3        2 0.41548326
## Arkansas         4        2 0.11870947
## California       3        2 0.43555885
## Colorado         3        2 0.32654235
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   13          0.37
## 2       2   16          0.34
## 3       3   13          0.27
## 4       4    8          0.39

Step by step guide for partitioning clustering - Unsupervised Machine Learning

It can be seen that there are some samples which have negative silhouette values. Some natural questions are :

Which samples are these? To what cluster are they closer?

This can be determined from the output of the function silhouette() as follow:

neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
##          cluster neighbor   sil_width
## Missouri       3        2 -0.07318144

7 eclust(): Enhanced clustering analysis

The function eclust() [in factoextra] provides several advantages compared to the standard packages used for clustering analysis:

  • It simplifies the workflow of clustering analysis
  • It can be used to compute hierarchical clustering and partitioning clustering in a single line function call
  • The function eclust() computes automatically the gap statistic for estimating the right number of clusters.
  • It automatically provides silhouette information
  • It draws beautiful graphs using ggplot2

7.1 K-means clustering using eclust()

# Compute k-means
res.km <- eclust(df, "kmeans")

Step by step guide for partitioning clustering - Unsupervised Machine Learning

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

Step by step guide for partitioning clustering - Unsupervised Machine Learning

# Silhouette plot
fviz_silhouette(res.km)
##   cluster size ave.sil.width
## 1       1   13          0.27
## 2       2   13          0.37
## 3       3    8          0.39
## 4       4   16          0.34

Step by step guide for partitioning clustering - Unsupervised Machine Learning

7.2 Hierachical clustering using eclust()

 # Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam

Step by step guide for partitioning clustering - Unsupervised Machine Learning

The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.

fviz_silhouette(res.hc) # silhouette plot
fviz_cluster(res.hc) # scatter plot

8 Infos

This analysis has been performed using R software (ver. 3.2.3)

Model-Based Clustering - Unsupervised Machine Learning

$
0
0


1 Concept

The traditional clustering methods such as hierarchical clustering and partitioning algorithms (k-means and others) are heuristic and are not based on formal models.

An alternative is to use model-based clustering, in which, the data are considered as coming from a distribution that is mixture of two or more components (i.e. clusters) (Chris Fraley and Adrian E. Raftery, 2002 and 2012).

Each component k (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is characterized by the parameters:

  • \(\mu_k\): mean vector,
  • \(\sum_k\): covariance matrix,
  • An associated probability in the mixture. Each point has a probability of belonging to each cluster.

2 Model parameters

The model parameters can be estimated using the EM (Expectation-Maximization) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means \(\mu_k\), with increased density for points near the mean.

Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix \(\sum_k\).

Different possible parameterizations of \(\sum_k\) are available in the R package mclust (see ?mclustModelNames).

The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.

The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.

For example:

  • EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
  • EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.
  • VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.

The mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components. The “best model” is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.

3 Advantage of model-based clustering

The key advantage of model-based approach, compared to the standard clustering methods (k-means, hierarchical clustering, …), is the suggestion of the number of clusters and an appropriate model.

4 Example of data

We’ll use the bivariate faithful data set which contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).

# Load the data
data("faithful")
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

An illustration of the data can be drawn using ggplot2 package as follow:

library("ggplot2")
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density2d() # Add 2d density estimation

Model-Based Clustering - Unsupervised Machine Learning

5 Mclust(): R function for computing model-based clustering

The function Mclust() [in mclust package] can be used to compute model-based clustering.

Install and load the package as follow:

# Install
install.packages("mclust")

# Load
library("mclust")

The function Mclust() provides the optimal mixture model estimation according to BIC. A simplified format is:

Mclust(data, G = NULL)

  • data: A numeric vector, matrix or data frame. Categorical variables are not allowed. If a matrix or data frame, rows correspond to observations and columns correspond to variables.
  • G: An integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9.


The function Mclust() returns an object of class ‘Mclust’ containing the following elements:

  • modelName: A character string denoting the model at which the optimal BIC occurs.
  • G: The optimal number of mixture components (i.e: number of clusters)
  • BIC: All BIV values
  • bic Optimal BIC value
  • loglik: The loglikelihood corresponding to the optimal BIC
  • df: The number of estimated parameters
  • Z: A matrix whose \([i,k]^{th}\) entry is the probability that observation \(i\) in the test data belongs to the \(k^{th}\) class. Column names are cluster numbers, and rows are observations
  • classification: The cluster number of each observation, i.e. map(z)
  • uncertainty: The uncertainty associated with the classification

6 Example of cluster analysis using Mclust()

library(mclust)
# Model-based-clustering
mc <- Mclust(faithful)
# Print a summary
summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -1126.361 272 11 -2314.386 -2360.865
## 
## Clustering table:
##   1   2   3 
## 130  97  45
# Values returned by Mclust()
names(mc)
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "bic"           
##  [9] "loglik"         "df"             "hypvol"         "parameters"    
## [13] "z"              "classification" "uncertainty"
# Optimal selected model
mc$modelName
## [1] "EEE"
# Optimal number of cluster
mc$G
## [1] 3
# Probality for an observation to be in a given cluster
head(mc$z)
##           [,1]         [,2]         [,3]
## 1 2.181744e-02 1.130837e-08 9.781825e-01
## 2 2.475031e-21 1.000000e+00 3.320864e-13
## 3 2.521625e-03 2.051823e-05 9.974579e-01
## 4 6.553336e-14 9.999998e-01 1.664978e-07
## 5 9.838967e-01 7.642900e-20 1.610327e-02
## 6 2.104355e-07 9.975388e-01 2.461029e-03
# Cluster assignement of each observation
head(mc$classification, 10)
##  1  2  3  4  5  6  7  8  9 10 
##  3  2  3  2  1  2  1  3  2  1
# Uncertainty associated with the classification
head(mc$uncertainty)
##            1            2            3            4            5 
## 2.181745e-02 3.321787e-13 2.542143e-03 1.664978e-07 1.610327e-02 
##            6 
## 2.461239e-03

Model-based clustering results can be drawn using the function plot.Mclust():

plot(x, what = c("BIC", "classification", "uncertainty", "density"),
     xlab = NULL, ylab = NULL, addEllipses = TRUE, main = TRUE, ...)
# BIC values used for choosing the number of clusters
plot(mc, "BIC")

Model-Based Clustering - Unsupervised Machine Learning

# Classification: plot showing the clustering
plot(mc, "classification")

Model-Based Clustering - Unsupervised Machine Learning

# Classification uncertainty
plot(mc, "uncertainty")

Model-Based Clustering - Unsupervised Machine Learning

# Estimated density. Contour plot
plot(mc, "density")

Model-Based Clustering - Unsupervised Machine Learning

Clusters generated by Mclust() can be drawn using the function fviz_cluster() [in factoextra package]. Read more about [factoextra](http://www.sthda.com/english/wiki/factoextra-r-package-quick-multivariate-data-analysis-pca-ca-mca-and-visualization-r-software-and-data-mining.

library(factoextra)
fviz_cluster(mc, frame.type = "norm", geom = "point")

Model-Based Clustering - Unsupervised Machine Learning

7 Infos

This analysis has been performed using R software (ver. 3.2.3)

  • Chris Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington. pdf
  • Chris Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611:631.

Data visualization


The Elements of Choosing Colors for Great Data Visualization in R

$
0
0


Color is crucial for elegant data visualization. In our previous article we describe the list of color palettes available in R.

In this current article we define the basics for using the power of color and, we describe an R package and an online tool for generating beautiful color schemes.

Understanding color wheel

color wheel

(Image from Nancy Duarte, slide:ology)

The color wheel helps you visualize the relationships between colors.

The wheel is divided into pie slices which have the following components:

  • hue (true color): On the wheel above, the hue is four rings out of the center.
  • tints: correspond to the colors toward the center of the wheel (= hue + white color)
  • shades: corresponds to the ring of colors on the outside of the hue ( = hue + black color)

Defining the basics for choosing colors

  1. Monochromatic: Variations of the same color.

  2. Analogous: colors that are touching in the wheel creates narrow harmonious color scheme.

  3. Complementary: Colors from the opposite ends of the wheel provide the most contrast.

color wheel basics

(Image from Nancy Duarte, slide:ology)

  1. Split Complementary: A variation of the complementary scheme that uses two colors on either side of a directly complementary color. These colors have high visual contrast but with less visual tension than purely complementary colors.

  2. Triadic: Three colors equally spaced around the color wheel create vivid visual interest.

  3. Tetradic: Two pairs of complementary colors. This scheme is popular because it offers strong visual contrast while retaining harmony.

color wheel basics

(Image from Nancy Duarte, slide:ology)



colortools: R package for creating easily color schemes in R

The excellent R package colortools developed by Gaston Sanchez is an easy to use solution for generating color schemes in R.

Install colortools

install.packages("colortools")

Load colortools

library(colortools)

Color wheel

The function wheel() can be used to generate a color wheel for a given color:

wheel("darkblue", num = 12)
##  [1] "#00008B" "#46008B" "#8B008B" "#8B0046" "#8B0000" "#8B4500" "#8B8B00"
##  [8] "#468B00" "#008B00" "#008B45" "#008B8B" "#00468B"

The Elements of Choosing Colors for Great Data Visualization in R.

Analogous color scheme

The function adjacent() or analogous() can be used:

analogous("darkblue")
## [1] "#00008B" "#46008B" "#00468B"

The Elements of Choosing Colors for Great Data Visualization in R.

Complementary color scheme

complementary("steelblue")
## [1] "#4682B4" "#B47846"

The Elements of Choosing Colors for Great Data Visualization in R.

Split Complementary Color Scheme

splitComp("steelblue")
## [1] "#4682B4" "#B4464B" "#B4AF46"

The Elements of Choosing Colors for Great Data Visualization in R.

Tetradic Color Scheme

tetradic("steelblue")
## [1] "#4682B4" "#7846B4" "#B47846" "#82B446"

The Elements of Choosing Colors for Great Data Visualization in R.

Square color scheme

square("steelblue")
## [1] "#4682B4" "#AF46B4" "#B47846" "#4BB446"

The Elements of Choosing Colors for Great Data Visualization in R.

Sequential colors

sequential("steelblue")
##  [1] "#B4B4B4FF" "#ABB0B4FF" "#A2ACB4FF" "#99A8B4FF" "#90A4B4FF"
##  [6] "#87A0B4FF" "#7E9BB4FF" "#7597B4FF" "#6C93B4FF" "#638FB4FF"
## [11] "#5A8BB4FF" "#5187B4FF" "#4883B4FF" "#3F7FB4FF" "#367BB4FF"
## [16] "#2D77B4FF" "#2473B4FF" "#1B6EB4FF" "#126AB4FF" "#0966B4FF"
## [21] "#0062B4FF"

The Elements of Choosing Colors for Great Data Visualization in R.

Design your color scheme online

The online tool Colors Scheme Designer can be used:

colors scheme designer

Read more

  • slide:ology: The Art and Science of Creating Great Presentations (by Nancy Duarte)
  • R package colortools by Gaston Sanchez.

Infos

This analysis has been performed using R software (ver. 3.2.3) and colortools (ver. 0.1.5)

ggcorrplot: Visualization of a correlation matrix using ggplot2

$
0
0


The easiest way to visualize a correlation matrix in R is to use the package corrplot.

In our previous article we also provided a quick-start guide for visualizing a correlation matrix using ggplot2.

Another solution is to use the function ggcorr() in ggally package. However, the ggally package doesn’t provide any option for reordering the correlation matrix or for displaying the significance level.

In this article, we’ll describe the R package ggcorrplot for displaying easily a correlation matrix using ‘ggplot2’.

ggcorrplot main features

It provides a solution for reordering the correlation matrix and displays the significance level on the correlogram. It includes also a function for computing a matrix of correlation p-values. It’s inspired from the package corrplot.

Installation and loading

ggcorrplot can be installed from CRAN as follow:

install.packages("ggcorrplot")

Or, install the latest version from GitHub:

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggcorrplot")

Loading:

library(ggcorrplot)

Getting started

Compute a correlation matrix

The mtcars data set will be used in the following R code. The function cor_pmat() [in ggcorrplot] computes a matrix of correlation p-values.

# Compute a correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
head(corr[, 1:6])
##       mpg  cyl disp   hp drat   wt
## mpg   1.0 -0.9 -0.8 -0.8  0.7 -0.9
## cyl  -0.9  1.0  0.9  0.8 -0.7  0.8
## disp -0.8  0.9  1.0  0.8 -0.7  0.9
## hp   -0.8  0.8  0.8  1.0 -0.4  0.7
## drat  0.7 -0.7 -0.7 -0.4  1.0 -0.7
## wt   -0.9  0.8  0.9  0.7 -0.7  1.0
# Compute a matrix of correlation p-values
p.mat <- cor_pmat(mtcars)
head(p.mat[, 1:4])
##               mpg          cyl         disp           hp
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07
## cyl  6.112687e-10 0.000000e+00 1.803002e-12 3.477861e-09
## disp 9.380327e-10 1.803002e-12 0.000000e+00 7.142679e-08
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03
## wt   1.293959e-10 1.217567e-07 1.222311e-11 4.145827e-05

Correlation matrix visualization

# Visualize the correlation matrix
# --------------------------------
# method = "square" (default)
ggcorrplot(corr)

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# method = "circle"
ggcorrplot(corr, method = "circle")

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Reordering the correlation matrix
# --------------------------------
# using hierarchical clustering
ggcorrplot(corr, hc.order = TRUE, outline.col = "white")

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Types of correlogram layout
# --------------------------------
# Get the lower triangle
ggcorrplot(corr, hc.order = TRUE, type = "lower",
     outline.col = "white")

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Get the upeper triangle
ggcorrplot(corr, hc.order = TRUE, type = "upper",
     outline.col = "white")

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Change colors and theme
# --------------------------------
# Argument colors
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   outline.col = "white",
   ggtheme = ggplot2::theme_gray,
   colors = c("#6D9EC1", "white", "#E46726"))

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Add correlation coefficients
# --------------------------------
# argument lab = TRUE
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE)

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Add correlation significance level
# --------------------------------
# Argument p.mat
# Barring the no significant coefficient
ggcorrplot(corr, hc.order = TRUE,
    type = "lower", p.mat = p.mat)

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

# Leave blank on no significant coefficient
ggcorrplot(corr, p.mat = p.mat, hc.order = TRUE,
    type = "lower", insig = "blank")

ggcorrplot R package: Visualization of a correlation matrix using ggplot2

survminer R package: Survival Data Analysis and Visualization

$
0
0


Survival analysis focuses on the expected duration of time until occurrence of an event of interest. However, this failure time may not be observed within the study time period, producing the so-called censored observations.

The R package survival fits and plots survival curves using R base graphs. There are also several R packages/functions for drawing survival curves using ggplot2 system:

  • ggsurv() function in GGally R package
  • autoplot() function ggfortify R package
  • ggkm() - R function

These packages/functions are limited:

  • The default graph generated with the R package survival is ugly and it requires programming skills for drawing a nice looking survival curves. There is no option for displaying the ‘number at risk’ table.

  • GGally and ggfortify don’t contain any option for drawing the ‘number at risk’ table. You need also some knowledge in ggplot2 plotting system for drawing a ready-to-publish survival curves.

  • There are different version of the function ggkm() on the web. Most of them are not updated and don’t work with the current version of ggplot2.


Here, we developed and present the survminer R package for facilitating survival analysis and visualization.

survminer - Main features

The current version contains the function ggsurvplot() for easily drawing beautiful and ready-to-publish survival curves using ggplot2. ggsurvplot() includes also some options for displaying the p-value and the ‘number at risk’ table, under the survival curves.

Installation and loading

Install from CRAN:

install.packages("survminer")

Or, install the latest version from GitHub:

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/survminer")
# Loading
library("survminer")

Getting started

The R package survival is required for fitting survival curves.

Draw survival curves without grouping

# Fit survival curves
require("survival")
fit <- survfit(Surv(time, status) ~ 1, data = lung)

# Drawing curves
ggsurvplot(fit, color = "#2E9FDF")

survminer: Drawing survival curves using ggplot2

Draw survival curves with two groups

Basic plots

# Fit survival curves
require("survival")
fit<- survfit(Surv(time, status) ~ sex, data = lung)

# Drawing survival curves
ggsurvplot(fit)

survminer: Drawing survival curves using ggplot2

Change legend title, labels and position

# Change the legend title and labels
ggsurvplot(fit, legend = "bottom", 
           legend.title = "Sex",
           legend.labs = c("Male", "Female"))

survminer: Drawing survival curves using ggplot2

# Specify legend position by its coordinates
ggsurvplot(fit, legend = c(0.2, 0.2))

survminer: Drawing survival curves using ggplot2

Change line types and color palettes

# change line size --> 1
# Change line types by groups (i.e. "strata")
# and change color palette
ggsurvplot(fit,  size = 1,  # change line size
           linetype = "strata", # change line type by groups
           break.time.by = 250, # break time axis by 250
           palette = c("#E7B800", "#2E9FDF"), # custom color palette
           conf.int = TRUE, # Add confidence interval
           pval = TRUE # Add p-value
           )

survminer: Drawing survival curves using ggplot2

# Use brewer color palette "Dark2"
ggsurvplot(fit, linetype = "strata", 
           conf.int = TRUE, pval = TRUE,
           palette = "Dark2")

survminer: Drawing survival curves using ggplot2

# Use grey palette
ggsurvplot(fit, linetype = "strata", 
           conf.int = TRUE, pval = TRUE,
           palette = "grey")

survminer: Drawing survival curves using ggplot2

Add number at risk table

# Add Risk table
ggsurvplot(fit, pval = TRUE, conf.int = TRUE,
           risk.table = TRUE)

survminer: Drawing survival curves using ggplot2

# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit, 
           pval = TRUE, conf.int = TRUE,
           risk.table = TRUE, # Add risk table
           risk.table.col = "strata", # Risk table color by groups
           lienetype = "strata", # Change line type by groups
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"))

survminer: Drawing survival curves using ggplot2

Transform survival curves: plot cumulative events and hazard function

# Plot cumulative events
ggsurvplot(fit, conf.int = TRUE,
           palette = c("#FF9E29", "#86AA00"),
           risk.table = TRUE, risk.table.col = "strata",
           fun = "event")

survminer: Drawing survival curves using ggplot2

# Plot the cumulative hazard function
ggsurvplot(fit, conf.int = TRUE, 
           palette = c("#FF9E29", "#86AA00"),
           risk.table = TRUE, risk.table.col = "strata",
           fun = "cumhaz")

survminer: Drawing survival curves using ggplot2

# Arbitrary function
ggsurvplot(fit, conf.int = TRUE, 
          palette = c("#FF9E29", "#86AA00"),
           risk.table = TRUE, risk.table.col = "strata",
           pval = TRUE,
           fun = function(y) y*100)

survminer: Drawing survival curves using ggplot2

Survival curves with multiple groups

# Fit (complexe) survival curves
#++++++++++++++++++++++++++++++++++++
require("survival")
fit2 <- survfit( Surv(time, status) ~ rx + adhere,
    data = colon )

# Visualize
#++++++++++++++++++++++++++++++++++++
# Visualize: add p-value, chang y limits
# change color using brewer palette
ggsurvplot(fit2, pval = TRUE, 
           break.time.by = 400,
           risk.table = TRUE)

survminer: Drawing survival curves using ggplot2

# Adjust risk table and survival plot locations 
# ++++++++++++++++++++++++++++++++++++
# Adjust risk table location, shift to the left
ggsurvplot(fit2, pval = TRUE,
           break.time.by = 400, 
           risk.table = TRUE,
           risk.table.col = "strata",
           risk.table.adj = -2, # risk table location adj
           palette = "Dark2")

survminer: Drawing survival curves using ggplot2

# Adjust survival plot location, shift to the right
# ++++++++++++++++++++++++++++++++++++
ggsurvplot(fit2, pval = TRUE,
           break.time.by = 400, 
           risk.table = TRUE,
           risk.table.col = "strata",
           surv.plot.adj = 4.9, # surv plot location adj
           palette = "Dark2")

survminer: Drawing survival curves using ggplot2

# Risk table height
# ++++++++++++++++++++++++++++++++++++
ggsurvplot(fit2, pval = TRUE,
           break.time.by = 400, 
           risk.table = TRUE,
           risk.table.col = "strata",
           risk.table.height = 0.5, # Useful when you have multiple groups
           surv.plot.adj = 4.9, # surv plot location adj
           palette = "Dark2")

survminer: Drawing survival curves using ggplot2

# Change legend labels
# ++++++++++++++++++++++++++++++++++++
ggsurvplot(fit2, pval = TRUE, 
           break.time.by = 400,
           risk.table = TRUE,
           risk.table.col = "strata",
           ggtheme = theme_bw(),
           legend.labs = c("A", "B", "C", "D", "E", "F"))

survminer: Drawing survival curves using ggplot2

Infos

This article was built with:

##  setting  value                       
##  version  R version 3.2.3 (2015-12-10)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  fr_FR.UTF-8                 
##  tz       Europe/Paris                
##  date     2016-01-17                  
## 
##  package      * version date       source        
##  colorspace     1.2-6   2015-03-11 CRAN (R 3.2.0)
##  dichromat      2.0-0   2013-01-24 CRAN (R 3.2.0)
##  digest         0.6.8   2014-12-31 CRAN (R 3.2.0)
##  ggplot2      * 2.0.0   2015-12-18 CRAN (R 3.2.3)
##  gridExtra      2.0.0   2015-07-14 CRAN (R 3.2.0)
##  gtable         0.1.2   2012-12-05 CRAN (R 3.2.0)
##  labeling       0.3     2014-08-23 CRAN (R 3.2.0)
##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)
##  MASS           7.3-45  2015-11-10 CRAN (R 3.2.3)
##  munsell        0.4.2   2013-07-11 CRAN (R 3.2.0)
##  plyr           1.8.3   2015-06-12 CRAN (R 3.2.0)
##  RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.2.0)
##  Rcpp           0.12.2  2015-11-15 CRAN (R 3.2.2)
##  reshape2       1.4.1   2014-12-06 CRAN (R 3.2.0)
##  scales         0.3.0   2015-08-25 CRAN (R 3.2.0)
##  stringi        1.0-1   2015-10-22 CRAN (R 3.2.0)
##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)
##  survival     * 2.38-3  2015-07-02 CRAN (R 3.2.3)
##  survminer    * 0.1.1   2016-01-17 local

ggplot2 texts : Add text annotations to a graph in R software

$
0
0


This article describes how to add a text annotation to a plot generated using ggplot2 package.

The functions below can be used :

  • geom_text(): adds text directly to the plot
  • geom_label(): draws a rectangle underneath the text, making it easier to read.
  • annotate(): useful for adding small text annotations at a particular location on the plot
  • annotation_custom(): Adds static annotations that are the same in every panel

It’s also possible to use the R package ggrepel, which is an extension and provides geom for ggplot2 to repel overlapping text labels away from each other.

We’ll start by describing how to use ggplot2 official functions for adding text annotations. In the last sections, examples using ggrepel extensions are provided.

Install required packages

# Install ggplot2
install.packages("ggplot2")

# Install ggrepel
install.packages("ggrepel")

Create some data

We’ll use a subset of mtcars data. The function sample() can be used to randomly extract 10 rows:

# Subset 10 rows
set.seed(1234)
ss <- sample(1:32, 10)
df <- mtcars[ss, ]

Text annotations using geom_text and geom_label

library(ggplot2)

# Simple scatter plot
sp <- ggplot(df, aes(wt, mpg, label = rownames(df)))+
  geom_point()
# Add texts
sp + geom_text()

# Change the size of the texts
sp + geom_text(size=6)

# Change vertical and horizontal adjustement
sp +  geom_text(hjust=0, vjust=0)

# Change fontface. Allowed values : 1(normal),
# 2(bold), 3(italic), 4(bold.italic)
sp + geom_text(aes(fontface=2))

  • Change font family
sp + geom_text(family = "Times New Roman")
  • geom_label() works like geom_text() but draws a rounded rectangle underneath each label. This is useful when you want to label plots that are dense with data.
sp + geom_label()


Others useful arguments for geom_text() and geom_label() are:

  • nudge_x and nudge_y: let you offset labels from their corresponding points. The function position_nudge() can be also used.
  • check_overlap = TRUE: for avoiding overplotting of labels
  • hjust and vjust can now be character vectors (ggplot2 v >= 2.0.0): “left”, “center”, “right”, “bottom”, “middle”, “top”. New options include “inward” and “outward” which align text towards and away from the center of the plot respectively.


Change the text color and size by groups

It’s possible to change the appearance of the texts using aesthetics (color, size,…) :

sp2 <- ggplot(mtcars, aes(x=wt, y=mpg, label=rownames(mtcars)))+
  geom_point()

# Color by groups
sp2 + geom_text(aes(color=factor(cyl)))

# Set the size of the text using a continuous variable
sp2 + geom_text(aes(size=wt))

# Define size range
sp2 + geom_text(aes(size=wt)) + scale_size(range=c(3,6))

Add a text annotation at a particular coordinate

The functions geom_text() and annotate() can be used :

# Solution 1
sp2 + geom_text(x=3, y=30, label="Scatter plot")

# Solution 2
sp2 + annotate(geom="text", x=3, y=30, label="Scatter plot",
              color="red")

annotation_custom : Add a static text annotation in the top-right, top-left, …

The functions annotation_custom() and textGrob() are used to add static annotations which are the same in every panel.The grid package is required :

library(grid)
# Create a text
grob <- grobTree(textGrob("Scatter plot", x=0.1,  y=0.95, hjust=0,
  gp=gpar(col="red", fontsize=13, fontface="italic")))
# Plot
sp2 + annotation_custom(grob)

Facet : In the plot below, the annotation is at the same place (in each facet) even if the axis scales vary.

sp2 + annotation_custom(grob)+facet_wrap(~cyl, scales="free")

ggrepel: Avoid overlapping of text labels

There are two important functions in ggrepel R packages:

  • geom_label_repel()
  • geom_text_repel()

Scatter plots with text annotations

We start by creating a simple scatter plot using a subset of the mtcars data set containing 15 rows.

  1. Prepare some data:
# Take a subset of 15 random points
set.seed(1234)
ss <- sample(1:32, 15)
df <- mtcars[ss, ]
  1. Create a scatter plot:
p <- ggplot(df, aes(wt, mpg)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)
  1. Add text labels:
# Add text annotations using ggplot2::geom_text
p + geom_text(aes(label = rownames(df)),
              size = 3.5)

# Use ggrepel::geom_text_repel
require("ggrepel")
set.seed(42)
p + geom_text_repel(aes(label = rownames(df)),
                    size = 3.5) 

# Use ggrepel::geom_label_repel and 
# Change color by groups
set.seed(42)
p + geom_label_repel(aes(label = rownames(df),
                    fill = factor(cyl)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

Volcano plot

genes <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", header = TRUE)
genes$Significant <- ifelse(genes$padj < 0.05, "FDR < 0.05", "Not Sig")

ggplot(genes, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = Significant)) +
  scale_color_manual(values = c("red", "grey")) +
  theme_bw(base_size = 12) + theme(legend.position = "bottom") +
  geom_text_repel(
    data = subset(genes, padj < 0.05),
    aes(label = Gene),
    size = 5,
    box.padding = unit(0.35, "lines"),
    point.padding = unit(0.3, "lines")
  )

source

Infos

This analysis has been performed using R software (ver. 3.2.3) and ggplot2 (ver. )

Fast Writing of Data From R to txt|csv Files: readr package

$
0
0


There are many solutions for writing data from R to txt (i.e., tsv: tab-separated values) or csv (comma-separated values) files. In our previous articles, we described R base functions (write.table() and write.csv()) for writing data from R to txt|csv files R.


In this article, we’ll describe a most modern R package readr, developed by Hadley Wickham, for fast reading and writing delimited files. It contains the function write_delim(), write_csv() and write_tsv() to export easily a data from R.

Compared to R base functions (write.csv() and write.table()), readr functions:

  1. are much faster (X2),
  2. never write row names.


Fast Writing of Data From R to txt|csv Files: readr package

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory

Installing and loading readr

# Installing
install.packages("readr")

# Loading
library("readr")

readr functions for writing data

The function rwrite_delim()[in readr package] is a general function to export a data table from R. Depending on the format of your file, you can also use:


  • write_csv(): to write a comma (“,”) separated values
  • write_tsv(): to write a tab separated (“\t”) values


The simplified format of these functions are, as follow:

# General function
write_delim(x, path, delim = " ")

# Write comma (",") separated value files
write_csv(file, path)

# Write tab ("\t") separated value files
write_tsv(file, path)

  • x: a data frame to be written
  • path: path to the result file
  • delim: Delimiter used to separate values. Must be single character.


Writing data to a file

The R code below exports the built-in Rmtcars data set to a tab-separated ( sep = “\t”) file called mtcars.txt in the current working directory:

# Loading mtcars data
data("mtcars")


library("readr")
# Writing mtcars data to a tsv file
write_tsv(mtcars, path = "mtcars.txt")

# Writing mtcars data to a csv file
write_csv(mtcars, path = "mtcars.csv")

Summary


  • Write data from R to a txt (i.e., tsv) file: write_tsv(my_data, path = “my_data.txt”)

  • Write data from R to a csv file: write_csv(my_data, path = “my_data.csv”)


Infos

This analysis has been performed using R (ver. 3.2.3).

Writing Data From R to Excel Files (xls|xlsx)

$
0
0


Previously, we described the essentials of R programming and provided quick start guides for reading and writing txt and csv files using R base functions as well as using a most modern R package named readr, which is faster (X10) than R base functions. We also described different ways for reading data from Excel files into R.


Here, you’ll learn how to export data from R to Excel files (xls or xlsx file formats). We’ll use the xlsx R package.


Writing Data From R to Excel Files (xls|xlsx)

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory

Writing Excel files using xlsx package

The xlsx package, a java-based solution, is one of the powerful R packages to read, write and formatExcel files.

Installing and loading xlsx package

  • Install
install.packages("xlsx")
  • Load
library("xlsx")

Using xlsx package

There are two main functions in xlsx package for writing both xls and xlsx Excel files: write.xlsx() and write.xlsx2() [faster on big files compared to write.xlsx function].

The simplified formats are:

write.xlsx(x, file, sheetName = "Sheet1", 
  col.names = TRUE, row.names = TRUE, append = FALSE)

write.xlsx2(x, file, sheetName = "Sheet1",
  col.names = TRUE, row.names = TRUE, append = FALSE)

  • x: a data.frame to be written into the workbook
  • file: the path to the output file
  • sheetName: a character string to use for the sheet name.
  • col.names, row.names: a logical value specifying whether the column names/row names of x are to be written to the file
  • append: a logical value indicating if x should be appended to an existing file.


Example of usage: the following R code will write the R built-in data sets - USArrests, mtcars and iris - into the same Excel file:

library("xlsx")

# Write the first data set in a new workbook
write.xlsx(USArrests, file = "myworkbook.xlsx",
      sheetName = "USA-ARRESTS", append = FALSE)

# Add a second data set in a new worksheet
write.xlsx(mtcars, file = "myworkbook.xlsx", 
           sheetName="MTCARS", append=TRUE)

# Add a third data set
write.xlsx(iris, file = "myworkbook.xlsx",
           sheetName="IRIS", append=TRUE)

Summary


Write data from R to Excel files using xlsx package: write.xlsx(my_data, file = “result.xlsx”, sheetName = “my_data”, append = FALSE).


Infos

This analysis has been performed using R (ver. 3.2.3).

Saving Data into R Data Format: RDS and RDATA

$
0
0


In previous articles, we described the essentials of R programming and provided quick start guides for reading and writing txt and csv files using R base functions as well as using a most modern R package named readr, which is faster (X10) than R base functions. We also described different ways for reading and writing Excel files in R.

Writing data, in txt, csv or Excel file formats, is the best solution if you want to open these files with other analysis software, such as Excel. However this solution doesn’t preserve data structures, such as column data types (numeric, character or factor). In order to do that, the data should be written out in R data format.


Here, you’ll learn how to save i) a single R object, ii) multiple R objects or iii) your entire workspace in a specified file.


Saving data into R data formats can reduce considerably the size of large files by compression.

Save data into R data formats

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory

Save one object to a file

It’s possible to use the function saveRDS() to write a single R object to a specified file (in rds file format). The object can be restored back using the function readRDS().

Note that, it’s possible to restore the object under a different name

The simplified syntax for saving and restoring is as follow:

# Save an object to a file
saveRDS(object, file = "my_data.rds")

# Restore the object
readRDS(file = "my_data.rds")
  • object: An R object to save
  • file: the name of the file where the R object is saved to or read from

In the R code below, we’ll save the mtcars data set and restore it under different name:

# Save a single object to a file
saveRDS(mtcars, "mtcars.rds")

# Restore it under a different name
my_data <- readRDS("mtcars.rds")

Save multiple objects to a file

The function save() can be used to save one or more R objects to a specified file (in .RData or .rda file formats). The function can be read back from the file using the function load().

Note that if you save your data with save(), it cannot be restored under different name. The original object names are automatically used.

# Saving on object in RData format
save(data1, file = "data.RData")

# Save multiple objects
save(data1, data2, file = "data.RData")

# To load the data again
load("data.RData")

Save your entire workspace

It’s a good idea to save your workspace image when your work sessions are long.

This can be done at any time using the function save.image()

save.image() 

That stores your workspace to a file named .RData by default. This will ensure you don’t lose all your work in the event of system reboot, for instance.

When you close R/RStudio, it asks if you want to save your workspace. If you say yes, the next time you start R that workspace will be loaded. That saved file will be named .RData as well.

It’s also possible to specify the file name for saving your work space:

save.image(file = "my_work_space.RData")

To restore your workspace, type this:

load("my_work_space.RData")

Summary


  • Save and restore one single R object: saveRDS(object, file), my_data <- readRDS(file)

  • Save and restore multiple R objects: save(data1, data2, file = “my_data.RData”), load(“my_data.RData”)

  • Save and restore your entire workspace: save.image(file = “my_work_space.RData”), load(“my_work_space.RData”)


Infos

This analysis has been performed using R (ver. 3.2.3).


Add a table into a Word document using R software and ReporteRs package

$
0
0


The ReporteRs package is used to create a Word document from R software. The function addFlexTable() can be used to add a simple or customized table into the document.

  1. The first step is to create a table using one of the functions below :
  • FlexTable() to create a ‘flexible’ table which can be easily formatted
  • vanilla.table() which is shortcut to quickly produce a nice FlexTable
  1. The second step is to add the created table into the Word document using addFlexTable() function as follow :
# doc : docx object
# flextable : FlexTable object
addFlexTable(doc, flextable)

The aim of this R tutorial is to show you step by step, how to add simple and formatted table into a Word document.

In the following examples, we’ll add the first 5 rows of iris data sets into the Word document.

data<-iris[1:5, ]
data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

Add a simple table

library(ReporteRs)

doc <- docx()
data<-iris[1:5, ]

# Add a first table : Default table
doc <- addTitle(doc, "Default table")
doc <- addFlexTable( doc, FlexTable(data))

doc <- addParagraph(doc, c("", "")) # 2 line breaks

# Add a second table, theme : vanilla table
doc <- addTitle(doc, "Vanilla table")
doc <- addFlexTable( doc, vanilla.table(data))

writeDoc(doc, file = "r-reporters-word-document-add-table.docx")

R software and Reporters package, add table to a Word document



An optional argument of addFlextable() function is par.properties which value can be parRight(), parLeft(), parJustify() for the table alignments. It can be used as follow :

doc <- addFlexTable( doc, vanilla.table(data),
                     par.properties = parCenter())


Note also that, row names are not shown by default when FlexTable() function is used to create a table. To make row names visible use the function as follow :

doc <- addFlexTable( doc, FlexTable(data, row.names=TRUE))


Add a formatted table

Change the background colors of rows and columns

You should know three functions to change the appearance of table rows and columns :

  • setZebraStyle() : to color odd and even rows differently; for example, odd rows in gray color and even rows in white color.
  • setRowsColors() : to change color of a particular table row
  • setColumnsColors : to change the color of particular table columns

These functions can be used as follow :

library(ReporteRs)

doc <- docx()

data<-iris[1:5, ]

# Zebra striped tables
doc <- addTitle(doc, "Zebra striped tables")
MyFTable <- vanilla.table(data)
MyFTable <- setZebraStyle(MyFTable, odd = '#eeeeee', even = 'white')
doc <- addFlexTable( doc, MyFTable)

# Change columns and rows background colors
doc <- addTitle(doc, "Change columns and rows background colors")
MyFTable = FlexTable(data = data )
# i : row index; j : column index
MyFTable = setRowsColors(MyFTable, i=2:3, colors = 'lightblue')
MyFTable = setColumnsColors(MyFTable, j=3, colors = 'pink' )
doc <- addFlexTable(doc, MyFTable)

writeDoc(doc, file = "r-reporters-word-document-formatted-table1.docx")

R software and Reporters package, add table to a Word document

Note that, i and j are, respectively, the index of rows and column to change

Change cell background and text colors

We can change the background colors of some cells according to their values using the function setFlexTableBackgroundColors().

As an example, We’ll set up the background color of column 2 according to the value of the Sepal.Width variable (iris data sets) :

  • Cells with Sepal.Width < 3.2 are colored in gray (“#DDDDDD”)
  • Cells with Sepal.Width > = 3.2 are colored in “orange”

The text values of the table cells can be also customized as demonstrated in the example below :

library(ReporteRs)

doc <- docx()

data<-iris[1:5, ]

# Change the background colors of column 2 according to Sepal.Width
#++++++++++++++++++++++++++++
doc <- addTitle(doc, "Change the background color of cells")
MyFTable <- FlexTable(data)
MyFTable <- setFlexTableBackgroundColors(MyFTable, j = 2,
  colors = ifelse(data$Sepal.Width < 3.2, '#DDDDDD', 'orange'))

doc <- addFlexTable( doc, MyFTable)

# Format the text of some cells (column 3:4)
#++++++++++++++++++++++++++++
doc <- addTitle(doc, "Format cell text values")
MyFTable = FlexTable(data)
MyFTable[, 3:4] = textProperties(color = 'blue')
doc <- addFlexTable( doc, MyFTable)

writeDoc(doc, file = "r-reporters-word-document-format-cells.docx")

R software and Reporters package, add table to a Word document

Analyze, format and export a correlation matrix into a Word document

library(ReporteRs)
doc <- docx()

data( mtcars )
cormatrix = cor(mtcars)

col =c("#B2182B", "#D6604D", "#F4A582", "#FDDBC7","#D1E5F0", "#92C5DE", "#4393C3", "#2166AC")

mycut = cut(cormatrix,
        breaks = c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),
        include.lowest = TRUE, label = FALSE )

color_palettes = col[mycut]

corrFT = FlexTable( round(cormatrix, 2), add.rownames = TRUE )

corrFT = setFlexTableBackgroundColors(corrFT,
        j = seq_len(ncol(cormatrix)) + 1,
        colors = color_palettes )


corrFT = setFlexTableBorders( corrFT
        , inner.vertical = borderProperties( style = "dashed", color = "white" )
        , inner.horizontal = borderProperties( style = "dashed", color = "white"  )
        , outer.vertical = borderProperties( width = 2, color = "white"  )
        , outer.horizontal = borderProperties( width = 2, color = "white"  )
)

doc <- addFlexTable( doc, corrFT)

writeDoc(doc, file = "r-reporters-word-document-correlation.docx")

R software and Reporters package, add table to a Word document

Powerpoint

A pptx object works the same but does not require any parProperties

Infos

This analysis has been performed using R (ver. 3.2.3).

Create an editable graph from R software

$
0
0


In this article you’ll learn how to create an editable vector graphics from R software. This could be interesting in some cases and gives you the ability to edit your graphics from PowerPoint. You can change line types, colors, point shapes, etc, …

Editable plot from R software

Who is this article for ?

If you want to export your plot from R to PowerPoint automatically, this is for you.

If you want to bring your ggplot2 charts to PowerPoint, then this guide is for you.

If you’re looking for an exact package to create an editable plot and to save it as a PowerPoint document, then you’ll love this tutorial.

If you’re a beginner in R programming, you’ll definitely learn something new that you can use if needed.

What R package to use ?

Editable vector graphics can be created and saved in a PowerPoint document using ReporteRs package.

Install and load the package as follow :

install.packages('ReporteRs') # Install
library('ReporteRs') # Load

Create editable plots

Case of base graphs

The example below creates a PowerPoint document in your current working directory. The document contains one slide with 2 panels :

  • The first panel contains an editable box plot
  • The second panel contains a raster format

The PowerPoint document created by the R code below is available here : R software and ReporteRs package - create an editable base graph

library('ReporteRs')
# Create a new powerpoint document
doc <- pptx()

# Add a new slide into the ppt document 
doc <- addSlide(doc, "Two Content" )

# add a slide title
doc<- addTitle(doc, "Editable vector graphics format versus raster format" )

# A function for creating a box plot
boxplotFunc<- function(){
      boxplot(len ~ dose, data = ToothGrowth, 
        col=2:4,main = "Guinea Pigs' Tooth Growth",
        xlab = "Vitamin C dose mg",
        ylab = "tooth length")
      }

# Add an editable box plot
doc <- addPlot(doc, boxplotFunc, vector.graphic = TRUE )

# Add a raster box plot
doc <- addPlot(doc, boxplotFunc, vector.graphic = FALSE )

# write the document to a file
writeDoc(doc, file = "editable-graph.pptx")

Open the created PowerPoint and try to edit the first box plot by changing the fill colors, line types, etc…

Editable plot from R software using ReporteRs package

Case of graphs generated using ggplot2

ggplot2 is a powerful R package, implemented by Hadley Wickham, for producing a visually appealing charts.

Install and load it as follow :

install.packages('ggplot2') # Install
library('ggplot2') # Load

Create an editable plot with ggplot2 :

library('ReporteRs')
library(ggplot2)
# Create a new powerpoint document
doc <- pptx()

# Add a new slide into the ppt document 
doc <- addSlide(doc, "Two Content" )

# add a slide title
doc<- addTitle(doc, "Editable vector graphics format versus raster format" )

# A function for creating a box plot
bp <- ggplot(data=PlantGrowth, aes(x=group, y=weight, fill=group))+
        geom_boxplot()

# Add an editable box plot
doc <- addPlot(doc, function() print(bp), vector.graphic = TRUE )

# Add a raster box plot
doc <- addPlot(doc, function() print(bp), vector.graphic = FALSE )

# write the document to a file
writeDoc(doc, file = "editable-ggplot2.pptx")

The PowerPoint document created by the R code above is available here : R software and ReporteRs package - create an editable ggplot2

Editable plot from R software using ReporteRs package

Infos

This analysis has been performed using R (ver. 3.1.0).

You can read more about ReporteRs and download the source code at the following link :

GitHub (David Gohel): ReporteRs

Create and format PowerPoint documents from R software

$
0
0


Why is it important to be able to generate a PowerPoint report from R ?

There are at least, two reasons for this, as described in the next sections.

Write a PowerPoint document using R software and ReporteRs package

Reason I : Many collaborators works with Microsoft office tools

About 1 billion people worldwide use Microsoft Office (1 in 7 people on the planet; source: Microsoft).

Furthermore, many collaborators still working with MS Office software (Word, PowerPoint, Excel) for :

  • editing their text and tracking changes
  • copy-pasting texts, images and tables from multiple sources
  • saving and analyzing their data

In this context, a report generated as a PDF or HTMl files is less useful with some collaborators.

Reason II : keeping beautiful R graphs beautifull for publications

R plots can be customized to be as beautiful as your imagination can make them. Unfortunately, preserving this beauty is not always an easy task when you want to publish these graphs or show them in a professional presentations.

Yes, this problem can be solved using knitr/rmarkdown/Latex/Beamer/Slidify. However, it would be very difficult to remake the whole presentation in a different format. Furthermore, many journals don’t accept Latex documents.

(source).

But wasn’t this problem solved already?

The answer of this question is yes and no. There have been several attempts to solve this problem, but many of them are not easy to use.

One of the previous solutions is R2PPT package. Unfortunately R2PPT is available for Windows OS only and it depends on rcom or RDCOMClient for generating Microsoft PowerPoint presentations.

Objective

The goal of this R tutorial is to show you how to easily and quickly, format and export R outputs (including data tables, plots, paragraphs of text and R scripts) from R statistical software to a Microsoft PowerPoint document (.pptx file format) using ReporteRs package.

Write a PowerPoint document using R software and ReporteRs package

ReporteRs is a Java-based solution, so it works on Windows, Mac and Linux.

Install and load the ReporteRs package

Use the R code below :

install.packages('ReporteRs') # Install
library('ReporteRs') # Load

Note that ReporteRs relies on Java (>= 1.6) ; make sure you have an installed JRE

The version of Java installed on your computer, can be checked as follow :

system("java -version")

Create a simple PowerPoint document

Four simple steps are required :

  1. Use the pptx() function to create a PowerPoint object.
  2. Use the addSlide() function to add a slide into the PowerPoint document.
  3. Add contents into the created slide using the functions below :
    • addTitle: Add a title
    • addParagraph: Add paragraphs of text
    • addFlexTable: Add a table
    • addPlot: Add a plot generated in R
    • addImage: Add external images
    • addRScript: Highlight and add R code
    • addDate : Add a date
    • addPageNumber : Add a page number
    • AddFooter : Add a footer
  4. Write the document into a .pptx file using writeDoc() function

Slide layout

Before showing you an example of how to create and format PowerPoint from R Software, let’s first discuss about slide layout. This is very important to understand the examples provided in this tutorial.

When creating a new slide, you should specify the layout of the slide. The available layouts in the “MS Office PowerPoint” (default template, on my computer) are illustrated in the figure below :

Write a PowerPoint document using R software and ReporteRs package

addSlide() function can be used to add a new slide into a PowerPoint document from R software. A simplified format of the function is :

doc <- addSlide(doc, slide.layout)

  • doc : a pptx object where slide has to be added
  • slide.layout : the layout to use for the slide.


Among the possible values for the argument slide.layout, there are : “Title Slide”, “Title and Content”, “Two Content”, “Section Header”, “Content with Caption”, “Title Only”, “Comparison”.

However, you should use only the available slide layouts in your computer.


To view the slide layouts available in your computer, use the R code below :

library(ReporteRs)
doc = pptx()
slide.layouts(doc)
 [1] "Title Slide"             "Title and Vertical Text" "Title and Content"       "Two Content"            
 [5] "Section Header"          "Vertical Title and Text" "Content with Caption"    "Title Only"             
 [9] "Comparison"              "Blank"                  


These layouts are illustrated below :

doc <- pptx()
layouts <-slide.layouts(doc) # All available layout
#  plot each slide style
for(i in layouts ){
  par(mar=c(0.5,0.5,2,0.5), cex=0.7)
  slide.layouts(doc, i )
    title(main = paste0("'", i, "'" ))
  if(interactive()) readline(prompt = "Show next slide layout")
}

plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-plot of chunk r-reporters-powerpoint-slide-layout-

Note that, the selected layout determines the contents you can add into the slide (See the figure above).

  1. Example 1 - If you choose ‘Title and Content’ as a slide layout, you can add only :
    • a title
    • and one content which can be texts, plots, images, tables or R code
  2. Example 2 - If you choose ‘Two Content’ as a slide layout, you can add :
    • a title
    • and two contents : For example, you can add a table in the left panel and a paragraph of texts in the right panel.
  3. Example 3 - If you choose ‘Comparison’, you can add a title and four contents (plots, tables, paragraphs, images)

Whatever the slide layout chosen, you can use the functions addDate(), addFooter() and addPageNumber() to add date, footer and slide number, respectively.

Generate a simple PowerPoint document from R software

The R code below creates a PowerPoint document with a title slide, plots, tables, and an R script :

library( ReporteRs )

# Create a PowerPoint document
doc = pptx( )

# Slide 1 : Title slide
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title Slide")
doc <- addTitle(doc,"Create a PowerPoint document from R software")
doc <- addSubtitle(doc, "R and ReporteRs package")
doc <- addDate(doc)
doc <- addFooter(doc, "Isaac Newton")
doc <- addPageNumber(doc, "1/4")

# Slide 2 : Add plot
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc, "Bar plot")
plotFunc<- function(){
  barplot(VADeaths, beside = TRUE,
          col = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk"),
  legend = rownames(VADeaths), ylim = c(0, 100))
  title(main = "Death Rates in Virginia", font.main = 4)
}
doc <- addPlot(doc, plotFunc )
doc <- addPageNumber(doc, "2/4")

# Slide 3 : Add table 
#+++++++++++++++++++++++
doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc,"iris data sets")
doc <- addFlexTable(doc, FlexTable(iris[1:10,] ))
doc <- addParagraph(doc, "iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.")
doc <- addPageNumber(doc, "3/4")

# Silde 4 : Add R script
#+++++++++++++++++++++
doc <- addSlide(doc, "Content with Caption")
doc <- addTitle(doc, "R Script for histogram plot")
doc <- addPlot(doc, function() hist(iris$Sepal.Width, col=4))
r_code ="data(iris)
hist(iris$Sepal.Width, col = 4)"
doc <- addRScript(doc, text=r_code)

# write the document 
writeDoc(doc, "r-reporters-powerpoint.pptx" )

The PowerPoint document created by the R code above is available here : R software and ReporteRs package - Example of creating a PowerPoint document



Note that, you can use addPageNumber() function without specifying the value of the slide number, as follow :

doc <- addPageNumber(doc)

In this case slide number is added using the default setting (e.g : 1 for slide1, 2 for slide 2).

If you want to customize the numbering, use the function as follow :

doc <- addPageNumber(doc, "1/2") # Example 1
doc <- addPageNumber(doc, "I") # Example 2


Format the text of a PowerPoint document

Text properties : font, color and size

As illustrated in the figure below, text properties include :

  • Font : family (e.g : “Arial”), size (e.g : 11), style (e.g : “italic”)
  • Underlined text
  • Color (e.g : “blue”)
  • Vertical align (superscript, subscript)

Write a PowerPoint document using R software and ReporteRs package

The default font size and font family of the PowerPoint can be modified as follow :

options( "ReporteRs-fontsize" = 18, "ReporteRs-default-font" = "Arial")

Change the appearance of a “Piece Of Text” (pot)

Write a PowerPoint document using R software and ReporteRs package, format the text


The function pot() [Pieces Of Text] is used to modify the appearance of a text. It can be used also to create a hyperlink. The format is :

pot(value="", format = textProperties())

  • value : the text to be formatted
  • format : the properties to use for formatting the text


The allowed values for the argument format are the following functions :

  • textProperties() : the text formatting properties
  • textBold(), textItalic(), textBoldItalic() and textNormal() which are shortcuts for bold, italic, bold-italic and normal text, respectively.

These functions can take the arguments below :


  • color : font color; e.g : color=“#000000” or color = “black”.
  • font.size : a integer indicating the font size.
  • font.weight : the font weight. Possible values are “normal” or “bold”.
  • font.style : the font style. Possible values are “normal” or “italic”.
  • underlined : a logical value specifying if the text should be underlined.
  • font.family : the font family; e.g : “Arial”.
  • vertical.align : a character indicating font vertical alignments. Expected values are “baseline”" or “subscript” or “superscript”. Default value is baseline.
  • shading.color : background color of the text (e.g “#000000” or “black”)


The R code below creates a PowerPoint document containing a formatted text and a hyperlink :

library( ReporteRs )

# Change the default font size and font family
options('ReporteRs-fontsize'= 18, 'ReporteRs-default-font'='Arial')

doc = pptx( )

doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc,"Document with formatted texts")
doc <- addFlexTable(doc, FlexTable(iris[1:10,] ))

my_text <- pot("iris data set", textBold(color = "blue"))+" contains the measurements of " + 
          pot("sepal length", textBold(color="red"))+ " and width and petal length and width"
my_link <- pot('Click here to visit STHDA web site!', 
    hyperlink = 'http://www.sthda.com/english',
    format=textBoldItalic(color = 'blue', underline = TRUE ))

doc <- addParagraph(doc, 
      value = set_of_paragraphs(my_text, " ",  my_link),
     par.properties=parProperties(text.align="justify")
    )

writeDoc(doc, "r-reporters-powerpoint-formatted.pptx" )

Write a PowerPoint document using R software and ReporteRs package, format the text

Add plots and images

The functions addPlot() and addImage() can be used for adding a plot or an external image to the document. addPlot() works with all R plots (base graphics, lattice, ggplot2 and grid).

These two functions can be used as follow :

# Add plots
# fun : R plotting function
# ... : other arguments to pass to the plotting function
addPlot(doc, fun, ...)

# Add images
# filename : path to the external image
addImage(doc, filename)

The R code below creates a PowerPoint document containing a histogram and an image (downloaded from R website) :

library( ReporteRs )

doc = pptx()

# Slide 1 : Title slide
doc <- addSlide(doc, "Title Slide")
doc <- addTitle(doc,"Document containing plots and images")
doc <- addSubtitle(doc, "R and ReporteRs package")

# Slide 2 : Add plot
doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc,"Histogram plot")
doc <- addPlot(doc, function() hist(iris$Sepal.Width, col="lightblue"))
doc <- addParagraph(doc, "This histogram is generated using iris data sets")

# Slide 3 : Add  an image
# download an image from R website
download.file(url="http://www.r-project.org/hpgraphic.png",
              destfile="r-home-image.png", quiet=TRUE)
doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc,"Image from R website")
doc <- addImage(doc, "r-home-image.png")
doc <- addParagraph(doc, "This image has been downloaded from R website")
writeDoc(doc, "r-reporters-powerpoint-plot-image.pptx")

The PowerPoint document created by the R code above is available here : R software and ReporteRs package - PowerPoint document containing plots and images



  1. Note that, addPlot() function can take other arguments such as pointsize to change the size of plotted texts (default value is 12; in pixels)

  2. For addImage() function, the allowed file formats are PNG, WMF, JPEG and GIF images..


Add a table

addFlexTable() function is used to format and add a table into the PowerPoint.

It can be used as follow :

  • STEP 1 : Create a table using FlexTable() or vanilla.table() function. These two functions generate a ‘flexible’ table which can be easily formatted before adding into the slide.
  • STEP 2 : Add the create table into the document using addFlexTable() function as follow :
# doc : pptx object
# flextable : FlexTable object

# Example 1
doc <- addFlexTable(doc, flextable = FlexTable(data))

# Example 2 
doc <- addFlexTable(doc, flextable = vanilla.table(data))

setZebraStyle() function can be used to color odd and even rows differently; for example, odd rows in gray color and even rows in white color.

The example below creates a PowerPoint document with 3 slides containing a simple table (slide 1), vanilla table (slide 2) and a zebra striped table (slide 3) :

doc = pptx()

data<-iris[1:5, ]

# Slide 1 : Simple table
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc,"Simple table")
doc <- addFlexTable(doc, FlexTable(data))

# Slide 2 : vanilla table
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc,"Vanilla table")
doc <- addFlexTable(doc, vanilla.table(data))

# Slide 3 : Zebra striped table
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc,"Zebra striped table")
MyFTable <- vanilla.table(data)
MyFTable <- setZebraStyle(MyFTable, odd = '#eeeeee', even = 'white')
doc <- addFlexTable( doc, MyFTable)

writeDoc(doc, "r-reporters-powerpoint-add-table.pptx")

The PowerPoint document created by the R code above is available here : R software and ReporteRs package - PowerPoint document containing tables

Add ordered and unordered lists

Lists can be added using addParagraph() function as follow :

doc  <- addParagraph(doc, 
  value = c('Item 1', "Item 2", "Item 3")
  par.properties = parProperties(list.style = 'ordered', level = 1 )

  • value : a set of items to be added as a list
  • par.properties : the paragraph formatting properties. It takes list.style and level as arguments :
    • list.style : possible values are ‘unordered’ and ‘ordered’
    • level : a numeric value indicating the level of the item to be added in the list


The example below generates a one-slide PowerPoint document containing an ordered and unordered lists :

doc <- pptx()

doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc, "Ordered and unordored lists")

# 1. Ordered list
doc <- addParagraph(doc, value= c("Item 1", "Item 2", "Item 3"),
          par.properties =  parProperties(list.style = 'ordered'))

# 2. Unordered list
doc <- addParagraph(doc, value= c("Item 1", "Item 2", "Item 3"),
          par.properties =  parProperties(list.style = 'unordered'))

writeDoc(doc, file = "r-reporters-powerpoint-lists.pptx")

R software and Reporters package, add lists to a Powerpoint document

Create a PowerPoint document from a template file

This approach is useful in many situations :

  • If you work in a corporate environment and you want to generate a PowerPoint document based on a template with specific fonts, color, logos, etc.
  • If you want to modify and insert new contents into an existing PowerPoint document.
  • If you want to use text formatting styles and slide layouts from a given template file.

Note that, if you use a template file to create a PowerPoint document, slide layouts are those available in the template.

A template file can be specified to the pptx() function as follow :

# Create a PowerPoint document
doc <- pptx(template="path/to/your/powerpoint/template/file.pptx")

# ...............
# Add contents
# ...............

# Write the PowerPoint document to a file 
writeDoc(doc, file = "output-file.pptx")

In the next section We’ll :

  • download a PowerPoint template file from STHDA website
  • Check the available slide layouts in the template file
  • Create a PowerPoint document based on the template

Download a template file

# Download a PowerPoint template file from STHDA website
download.file(url="http://www.sthda.com/sthda/RDoc/example-files/r-reporters-powerpoint-template.pptx",
    destfile="r-reporters-powerpoint-template.pptx", quiet=TRUE)

Slide layouts available in the template file

You can use one of the layout below when adding a new slide into the PowerPoint :

doc <- pptx(template="r-reporters-powerpoint-template.pptx")
layouts <-slide.layouts(doc) # All available layout
# Plot the layouts
for(i in layouts ){
  par(mar=c(0.5,0.5,2,0.5), cex=0.7)
  slide.layouts(doc, i )
  title(main = paste0("'", i, "'" ))
  if(interactive()) readline(prompt = "Show next slide layout")
}

plot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layoutplot of chunk r-reporters-powerPoint-template-layout

Create a PowerPoint document from the template file

Note that, the template file contains already one empty slide which can be removed manually.

library( ReporteRs )

# Download a PowerPoint template file from STHDA website
download.file(url="http://www.sthda.com/sthda/RDoc/example-files/r-reporters-powerpoint-template.pptx",
    destfile="r-reporters-powerpoint-template.pptx", quiet=TRUE)

options('ReporteRs-fontsize'= 18, 'ReporteRs-default-font'='Arial')

doc <- pptx(template="r-reporters-powerpoint-template.pptx" )

# Slide 1 : Title slide
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title Slide")
doc <- addTitle(doc,"Create a PowerPoint from template using R software")
doc <- addSubtitle(doc, "R and ReporteRs package")
doc <- addDate(doc)
doc <- addFooter(doc, "Isaac Newton")
doc <- addPageNumber(doc, "1/4")

# Slide 2 : Add plot
#+++++++++++++++++++++++
doc <- addSlide(doc, "Title and Content")
doc <- addTitle(doc, "Bar plot")
plotFunc<- function(){
  barplot(VADeaths, beside = TRUE,
          col = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk"),
  legend = rownames(VADeaths), ylim = c(0, 100))
  title(main = "Death Rates in Virginia", font.main = 4)
}
doc <- addPlot(doc, plotFunc )
doc <- addPageNumber(doc, "2/4")

# Slide 3 : Add table 
#+++++++++++++++++++++++
doc <- addSlide(doc, "Two Content")
doc <- addTitle(doc,"iris data sets")
doc <- addFlexTable(doc, FlexTable(iris[1:4,] ))
doc <- addParagraph(doc, "iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.")
doc <- addPageNumber(doc, "3/4")

# Silde 4 : Add R script
#+++++++++++++++++++++
doc <- addSlide(doc, "Content with Caption")
doc <- addTitle(doc, "R Script for histogram plot")
doc <- addPlot(doc, function() hist(iris$Sepal.Width, col=4))
r_code ="data(iris)
hist(iris$Sepal.Width, col = 4)"
doc <- addRScript(doc, text=r_code)

# write the document 
writeDoc(doc, "r-reporters-powerpoint-from-template.pptx" )

The PowerPoint document created by the R code above is available here : R software and ReporteRs package - PowerPoint document from template

Infos

This analysis has been performed using R (ver. 3.1.0).

You can read more about ReporteRs and download the source code at the following link :

GitHub (David Gohel): ReporteRs

Import and export data using R

$
0
0

R Built-in Data Sets

$
0
0


R comes with several built-in data sets, which are generally used as demo data for playing with R functions.


In this article, we’ll first describe how load and use R built-in data sets. Next, we’ll describe some of the most used R demo data sets: mtcars, iris, ToothGrowth, PlantGrowth and USArrests.


Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory

List of pre-loaded data

To see the list of pre-loaded data, type the function data():

data()

The output is as follow:

R data sets

Loading a built-in R data

Load and print mtcars data as follow:

# Loading
data(mtcars)

# Print the first 6 rows
head(mtcars, 6)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If you want learn more about mtcars data sets, type this:

?mtcars

Most used R built-in data sets

mtcars: Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

  • View the content of mtcars data set:
# 1. Loading 
data("mtcars")
# 2. Print
head(mtcars)
  • It contains 32 observations and 11 variables:
# Number of rows (observations)
nrow(mtcars)
[1] 32
# Number of columns (variables)
ncol(mtcars)
[1] 11
  • Description of variables:
  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (1000 lbs)
  7. qsec: 1/4 mile time
  8. vs: V/S
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors

If you want to learn more about mtcars, type this:

?mtcars

iris

iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

data("iris")

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

ToothGrowth

ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

data("ToothGrowth")
head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5
  1. len: Tooth length
  2. supp: Supplement type (VC or OJ).
  3. dose: numeric Dose in milligrams/day

PlantGrowth

Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.

data("PlantGrowth")
head(PlantGrowth)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

USArrests

This data set contains statistics about violent crime rates by us state.

data("USArrests")
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
  1. Murder: Murder arrests (per 100,000)
  2. Assault: Assault arrests (per 100,000)
  3. UrbanPop: Percent urban population
  4. Rape: Rape arrests (per 100,000)

Summary


  • Load a built-in R data set: data(“dataset_name”)

  • Inspect the data set: head(dataset_name)


Infos

This analysis has been performed using R (ver. 3.2.3).

Viewing all 183 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>