Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

Computing and Adding new Variables to a Data Frame in R

$
0
0



Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is modern convention way to work with your data. We also described crutial steps to reshape your data with R for easier analyses.


Here, you we’ll learn how to compute and add new variables to a data frame in R. This can be done easily using the functions mutate() and transmute() in dplyr R package.


  • mutate(): Computes and adds new variable(s). Preserves existing variables. It’s similar to the R base function transform().
  • transmute(): Computes new variable(s). Drops existing variables.

Renaming Columns of a Data Table in R
Figure adapted from RStudio data wrangling cheatsheet

Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris[, -5]

# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)

# Print
my_data
Source: local data frame [150 x 4]

   Sepal.Length Sepal.Width Petal.Length Petal.Width
          <dbl>       <dbl>        <dbl>       <dbl>
1           5.1         3.5          1.4         0.2
2           4.9         3.0          1.4         0.2
3           4.7         3.2          1.3         0.2
4           4.6         3.1          1.5         0.2
5           5.0         3.6          1.4         0.2
6           5.4         3.9          1.7         0.4
7           4.6         3.4          1.4         0.3
8           5.0         3.4          1.5         0.2
9           4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
..          ...         ...          ...         ...

Install and load dplyr package for renaming columns

  • Install dplyr
install.packages("dplyr")
  • Load dplyr:
library("dplyr")

dplyr::mutate(): Add new variables by preserving existing ones

  • Add new columns (sepal_by_petal_*) by preserving existing ones:
mutate(my_data,
       sepal_by_petal_l = Sepal.Length/Petal.Length
       )
Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width sepal_by_petal_l
          (dbl)       (dbl)        (dbl)       (dbl)            (dbl)
1           5.1         3.5          1.4         0.2         3.642857
2           4.9         3.0          1.4         0.2         3.500000
3           4.7         3.2          1.3         0.2         3.615385
4           4.6         3.1          1.5         0.2         3.066667
5           5.0         3.6          1.4         0.2         3.571429
6           5.4         3.9          1.7         0.4         3.176471
7           4.6         3.4          1.4         0.3         3.285714
8           5.0         3.4          1.5         0.2         3.333333
9           4.4         2.9          1.4         0.2         3.142857
10          4.9         3.1          1.5         0.1         3.266667
..          ...         ...          ...         ...              ...

dplyr::transmute(): Make new variables by dropping existing ones

  • Add new columns (sepal_by_petal_*) by dropping existing ones:
transmute(my_data, 
            sepal_by_petal_l = Sepal.Length/Petal.Length,
            sepal_by_petal_w = Sepal.Width/Petal.Width
            )
Source: local data frame [150 x 2]

   sepal_by_petal_l sepal_by_petal_w
              (dbl)            (dbl)
1          3.642857         17.50000
2          3.500000         15.00000
3          3.615385         16.00000
4          3.066667         15.50000
5          3.571429         18.00000
6          3.176471          9.75000
7          3.285714         11.33333
8          3.333333         17.00000
9          3.142857         14.50000
10         3.266667         31.00000
..              ...              ...

Use mutate() and transmute() programmatically inside a function:


mutate() and transmute() are best-suited for interactive use. The functions mutate_() and transmute() should be used for calling from a function. In this case the input must be “quoted”.


There are three ways to quote inputs that dplyr understands:

  • With a formula, ~Sepal.Length.
  • With quote(), quote(Sepal.Length).
  • As a string: “Sepal.Length”.
# Use formula
mutate_(my_data, 
            sepal_by_petal_l = ~Sepal.Length/Petal.Length,
            sepal_by_petal_w = ~Sepal.Width/Petal.Width
            )

# Or use quote
transmute_(my_data, 
            sepal_by_petal_l = quote(Sepal.Length/Petal.Length),
            sepal_by_petal_w = quote(Sepal.Width/Petal.Width)
            )

# or, this
transmute_(my_data, 
            sepal_by_petal_l = "Sepal.Length/Petal.Length",
            sepal_by_petal_w = "Sepal.Width/Petal.Width"
            )

transform(): R base function to compute and add new variables

dplyr::mutate() works similarly to the R base function transform(), except that in mutate() you can refer to variables you’ve just created. This is not possible in transform().

my_data2 <- transform(my_data, neg_sepal_length = -Sepal.Length)
head(my_data2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width neg_sepal_length
1          5.1         3.5          1.4         0.2             -5.1
2          4.9         3.0          1.4         0.2             -4.9
3          4.7         3.2          1.3         0.2             -4.7
4          4.6         3.1          1.5         0.2             -4.6
5          5.0         3.6          1.4         0.2             -5.0
6          5.4         3.9          1.7         0.4             -5.4

Summary


  • dplyr::mutate(iris, sepal = 2*Sepal.Length): Computes and appends new variable(s).
  • dplyr::transmute(iris, sepal = 2*Sepal.Length): Makes new variable(s) and drops existing ones.
  • transform(iris, sepal = 2*Sepal.Length): R base function similar to mutate().


Infos

This analysis has been performed using R (ver. 3.2.4).


Viewing all articles
Browse latest Browse all 183

Trending Articles