Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data.

Here, you well learn how to organize (or reshape) your data in order to make the analysis easier. This process is called tidying your data.

tibble data format: tbl_df
[Figure adapted from RStudio data wrangling cheatsheet (see reference section)]

What is a tidy data set?

A data set is called tidy when:

each column represents a variable
and each row represents an observation

The opposite of tidy is messy data, which corresponds to any other arrangement of the data.

Tidy data

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. Its particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory
Import your data as described here: Importing data into R

Reshaping data using tidyr package

The tidyr package, provides four functions to help you change the layout of your data set:

gather(): gather (collapse) columns into rows
spread(): spread rows into columns
separate(): separate one column into multiple
unite(): unite multiple columns into one

Installing and loading tidyr

# Installing
install.packages("tidyr")

# Loading
library("tidyr")

Example data sets

Well use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:

my_data <- USArrests[c(1, 10, 20, 30), ]
my_data

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Georgia      17.4     211       60 25.8
Maryland     11.3     300       67 27.8
New Jersey    7.4     159       89 18.8

Row names are states, so lets use the function cbind() to add a column named state in the data. This will make the data tidy and the analysis easier.

my_data <- cbind(state = rownames(my_data), my_data)
my_data

                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

gather(): collapse columns into rows

The function gather() collapses multiple columns into key-value pairs. It produces a long data format from a wide one. Its an alternative of melt() function [in reshape2 package].

Simplified format:

gather(data, key, value, ...)

data: A data frame
key, value: Names of key and value columns to create in output
: Specification of columns to gather. Allowed values are:
- variable names
- if you want to select all variables between a and e, use a:e
- if you want to exclude a column name y use -y
- for more options, see: dplyr::select()

Examples of usage:

Gather all columns except the column state

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2

        state arrest_attribute arrest_estimate
1     Alabama           Murder            13.2
2     Georgia           Murder            17.4
3    Maryland           Murder            11.3
4  New Jersey           Murder             7.4
5     Alabama          Assault           236.0
6     Georgia          Assault           211.0
7    Maryland          Assault           300.0
8  New Jersey          Assault           159.0
9     Alabama         UrbanPop            58.0
10    Georgia         UrbanPop            60.0
11   Maryland         UrbanPop            67.0
12 New Jersey         UrbanPop            89.0
13    Alabama             Rape            21.2
14    Georgia             Rape            25.8
15   Maryland             Rape            27.8
16 New Jersey             Rape            18.8

Note that, all column names (except state) have been collapsed into a single key column (here arrest_attribute). Their values have been put into a value column (here arrest_estimate).

Gather only Murder and Assault columns

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder, Assault)
my_data2

       state UrbanPop Rape arrest_attribute arrest_estimate
1    Alabama       58 21.2           Murder            13.2
2    Georgia       60 25.8           Murder            17.4
3   Maryland       67 27.8           Murder            11.3
4 New Jersey       89 18.8           Murder             7.4
5    Alabama       58 21.2          Assault           236.0
6    Georgia       60 25.8          Assault           211.0
7   Maryland       67 27.8          Assault           300.0
8 New Jersey       89 18.8          Assault           159.0

Note that, the two columns Murder and Assault have been collapsed and the remaining columns (state, UrbanPop and Rape) have been duplicated.

Gather all variables between Murder and UrbanPop

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop)
my_data2

        state Rape arrest_attribute arrest_estimate
1     Alabama 21.2           Murder            13.2
2     Georgia 25.8           Murder            17.4
3    Maryland 27.8           Murder            11.3
4  New Jersey 18.8           Murder             7.4
5     Alabama 21.2          Assault           236.0
6     Georgia 25.8          Assault           211.0
7    Maryland 27.8          Assault           300.0
8  New Jersey 18.8          Assault           159.0
9     Alabama 21.2         UrbanPop            58.0
10    Georgia 25.8         UrbanPop            60.0
11   Maryland 27.8         UrbanPop            67.0
12 New Jersey 18.8         UrbanPop            89.0

The remaining state column is duplicated.

How to use gather() programmatically inside an R function?

You should use the function gather_() which takes character vectors, containing column names, instead of unquoted column names

The simplified syntax is as follow:

gather_(data, key_col, value_col, gather_cols)

data: a data frame
key_col, value_col: Strings specifying the names of key and value columns to create
gather_cols: Character vector specifying column names to be gathered together into pair of key-value columns.

As an example, type this:

gather_(my_data,
       key_col = "arrest_attribute",
       value_col = "arrest_estimate",
       gather_cols = c("Murder", "Assault"))

spread(): spread two columns into multiple columns

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a wide data format from a long one. Its an alternative of the function cast() [in reshape2 package].

tidyr spread

Simplified format:

spread(data, key, value)

data: A data frame
key: The (unquoted) name of the column whose values will be used as column headings.
value:The (unquoted) names of the column whose values will populate the cells.

Examples of usage:

Spread my_data2 to turn back to the original data:

my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
my_data3

       state Rape Assault Murder UrbanPop
1    Alabama 21.2     236   13.2       58
2    Georgia 25.8     211   17.4       60
3   Maryland 27.8     300   11.3       67
4 New Jersey 18.8     159    7.4       89

How to use spread() programmatically inside an R function?

You should use the function spread_() which takes strings specifying key and value columns instead of unquoted column names

The simplified syntax is as follow:

spread_(data, key_col, value_col)

data: a data frame.
key_col, value_col: Strings specifying the names of key and value columns.

As an example, type this:

spread_(my_data2, 
       key = "arrest_attribute",
       value = "arrest_estimate"
       )

unite(): Unite multiple columns into one

The function unite() takes multiple columns and paste them together into one.

tidyr unite

Simplified format:

unite(data, col, ..., sep = "_")

data: A data frame
col: The new (unquoted) name of column to add.
sep: Separator to use between values

Examples of usage:

The R code below uses the data set my_data and unites the columns Murder and Assault

my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4

                state Murder_Assault UrbanPop Rape
Alabama       Alabama       13.2_236       58 21.2
Georgia       Georgia       17.4_211       60 25.8
Maryland     Maryland       11.3_300       67 27.8
New Jersey New Jersey        7.4_159       89 18.8

How to use unite() programmatically inside an R function?

You should use the function unite_() as follow.

unite_(data, col, from, sep = "_")

data: A data frame.
col: String giving the name of the new column to be added
from: Character vector specifying the names of existing columns to be united
sep: Separator to use between values.

As an example, type this:

unite_(my_data,
    col = "Murder_Assault",
    from = c("Murder", "Assault"),
    sep = "_")

separate(): separate one column into multiple

The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.

tidyr separate

Simplified format:

separate(data, col, into, sep = "[^[:alnum:]]+")

data: A data frame
col: Unquoted column names
into: Character vector specifying the names of new variables to be created.
sep: Separator between columns:
- If character, is interpreted as a regular expression.
- If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.

Examples of usage:

Separate the column Murder_Assault [in my_data4] into two columns Murder and Assault:

separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

How to use separate() programmatically inside an R function?

You should use the function separate_() as follow.

separate_(data, col, into, sep = "[^[:alnum:]]+")

data: A data frame.
col: String giving the name of the column to split
into: Character vector specifying the names of new columns to create
sep: Separator between columns (as above).

As an example, type this:

separate_(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

Chaining multiple operations

Its possible to combine multiple operations using maggrittr forward-pipe operator : %>%.

For example, x %>% f is equivalent to f(x).

In the following R code:

first, my_data is passed to gather() function
next, the output of gather() is passed to unite() function

my_data %>% gather(key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop) %>%
            unite(col = "attribute_estimate",
                  arrest_attribute, arrest_estimate)

        state Rape attribute_estimate
1     Alabama 21.2        Murder_13.2
2     Georgia 25.8        Murder_17.4
3    Maryland 27.8        Murder_11.3
4  New Jersey 18.8         Murder_7.4
5     Alabama 21.2        Assault_236
6     Georgia 25.8        Assault_211
7    Maryland 27.8        Assault_300
8  New Jersey 18.8        Assault_159
9     Alabama 21.2        UrbanPop_58
10    Georgia 25.8        UrbanPop_60
11   Maryland 27.8        UrbanPop_67
12 New Jersey 18.8        UrbanPop_89

Summary

You should tidy your data for easier data analysis using the R package tidyr, which provides the following functions.

Collapse multiple columns together into key-value pairs (long data format): gather(data, key, value, )
Spread key-value pairs into multiple columns (wide data format): spread(data, key, value)
Unite multiple columns into one: unite(data, col, )
Separate one columns into multiple: separate(data, col, into)

References

The figures illustrating tidyr functions have been adapted from RStudio data wrangling cheatsheet
Learn more about tidy data: Hadley Wickham. Tidy Data. Journal of Statistical Software, August 2014, Volume 59, Issue 10..

Infos

This analysis has been performed using R (ver. 3.2.3).

Tidyr: Crutial Step Reshaping Data with R for Easier Analyses

What is a tidy data set?

Preleminary tasks

Reshaping data using tidyr package

Installing and loading tidyr

Example data sets

gather(): collapse columns into rows

spread(): spread two columns into multiple columns

unite(): Unite multiple columns into one

separate(): separate one column into multiple

Chaining multiple operations

Summary

References

Infos

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List

What is a tidy data set?

Preleminary tasks

Reshaping data using tidyr package

Installing and loading tidyr

Example data sets

gather(): collapse columns into rows

spread(): spread two columns into multiple columns

unite(): Unite multiple columns into one

separate(): separate one column into multiple

Chaining multiple operations

Summary

Related articles

References

Infos

Trending Articles