Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data. We next described crutial steps to reshape your data with R for easier analyses. Additionally, we provided quick start guides for subsetting data frame rows based on some logical criteria.
Pleleminary tasks
Launch RStudio as described here: Running RStudio and setting up your working directory
Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files
Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.
Here, well use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.
# Create my_data
my_data <- iris
# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)
# Print
my_data
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
Install and load dplyr package
- Install dplyr
install.packages("dplyr")
- Load dplyr:
library("dplyr")
Selecting column by position
- Select columns 1 to 2:
my_data[, 1:2]
- Select column 1 and 3 but not 2:
my_data[, c(1, 3)]
Select columns by names
- Select columns by names: Sepal.Length and Petal.Length
select(my_data, Sepal.Length, Petal.Length)
Source: local data frame [150 x 2]
Sepal.Length Petal.Length
(dbl) (dbl)
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
7 4.6 1.4
8 5.0 1.5
9 4.4 1.4
10 4.9 1.5
.. ... ...
- Select all columns from Sepal.Length to Petal.Length
select(my_data, Sepal.Length:Petal.Length)
Source: local data frame [150 x 3]
Sepal.Length Sepal.Width Petal.Length
(dbl) (dbl) (dbl)
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
7 4.6 3.4 1.4
8 5.0 3.4 1.5
9 4.4 2.9 1.4
10 4.9 3.1 1.5
.. ... ... ...
There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.
# Select column whose name starts with "Petal"
select(my_data, starts_with("Petal"))
# Select column whose name ends with "Width"
select(my_data, ends_with("Width"))
# Select columns whose names contains "etal"
select(my_data, contains("etal"))
# Select columns whose name maches a regular expression
select(my_data, matches(".t."))
# selects variables provided in a character vector.
select(my_data, one_of(c("Sepal.Length", "Petal.Length")))
Drop columns
Note that, to remove a column from a data frame, prepend its name by minus -.
- Dropping Sepal.Length and Petal.Length:
select(my_data, -Sepal.Length, -Petal.Length)
- Dropping columns from Sepal.Length to Petal.Length:
select(my_data, -(Sepal.Length:Petal.Length))
Source: local data frame [150 x 2]
Petal.Width Species
(dbl) (fctr)
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
7 0.3 setosa
8 0.2 setosa
9 0.2 setosa
10 0.1 setosa
.. ... ...
- Dropping columns whose name starts with Petal:
select(my_data, -starts_with("Petal"))
Source: local data frame [150 x 3]
Sepal.Length Sepal.Width Species
(dbl) (dbl) (fctr)
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
7 4.6 3.4 setosa
8 5.0 3.4 setosa
9 4.4 2.9 setosa
10 4.9 3.1 setosa
.. ... ... ...
Note that, if you want to drop columns by position, the syntax is as follow.
# Drop column 1
my_data[, -1]
# Drop columns 1 to 3
my_data[, -(1:3)]
# Drop columns 1 and 3 but not 2
my_data[, -c(1, 3)]
Use select() programmatically inside an R function
Dplyr uses non-standard evaluation (NSE), which is great for interactive use and save you typing. Behind the scene, NSE is powered by the lazyeval package.
There are three ways to quote inputs that dplyr understands:
- With a formula, ~Sepal.Length.
- With quote(), quote(Sepal.Length).
- As a string: Sepal.Length.
For example, you can select the column Sepal.Length by typing the following R code:
select_(my_data, ~Sepal.Length)
Or, by using this:
select_(my_data, "Sepal.Length")
Its also possible to use function inside select_(). The R package lazyeval is required. It can be installed as follow:
install.packages("lazyeval")
Use lazyeval package to interpret functions inside select_():
# Select column names that match ".t."
select_(my_data, lazyeval::interp(~matches(x), x = ".t."))
# Select column names that start with "Petal"
select_(my_data, lazyeval::interp(~starts_with(x), x = "Petal"))
# Dropping columns: Sepal.Length and Sepal.Width
select_(my_data, quote(-Sepal.Length), quote(-Sepal.Width))
# Or use this
select_(my_data, .dots = list(quote(-Petal.Length), quote(-Petal.Width)))
Summary
Select columns by position: my_data[, 1:2]
Select columns by name: dplyr::select(my_data, Sepal.Length, Petal.Length)
Drop columns: dplyr::select(my_data, -Sepal.Length, -Petal.Length)
- Helper functions: starts_with(), ends_with(), contains(), matches(), one_of()
- dplyr::select(my_data, starts_with(Petal))
- dplyr::select(my_data, ends_with(Length))
Infos
This analysis has been performed using R (ver. 3.2.3).