Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data. We also described crutial steps to reshape your data with R for easier analyses.
Among the functions available in dplyr package, there are:
- filter(iris, Sepal.Length >7): Extract rows based on logical criteria
- distinct(iris): Remove duplicated rows
- sample_n(iris, 10, replace = FALSE): Select n random rows from a table
- sample_frac(iris, 0.5, replace = FALSE): Select a random fraction of rows
- slice(iris, 3:8): Select rows by position
- top_n(iris, 10, Sepal.Length): Select and order top n rows (by groups if grouped data)
Well start by describing how to subset rows based on some criteria, with the dplyr::filter() function as well as the R base function subset(). Next, well show you how to select rows randomly using sample_n() and sample_frac() functions. Finally, well describe how to select the top n elements in each group, ordered by a given variables.
Pleleminary tasks
Launch RStudio as described here: Running RStudio and setting up your working directory
Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files
Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.
Here, well use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.
# Create my_data
my_data <- iris
# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)
# Print
my_data
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
Install and load dplyr package
- Install dplyr
install.packages("dplyr")
- Load dplyr:
library("dplyr")
Extracting rows by position: dplyr::slice()
Select rows 1 to 6:
my_data[1:6, ]
or you can also use the function slice()[in dplyr]:
slice(my_data, 1:6)
Extracting rows by criteria: dplyr::filter()
Logical comparisons
Before continuing, we introduce the notion of logical comparisons and operators, which are important to know for filtering data.
The logical comparison operators available in R are:
- Logical comparisons
- <: for less than
- >: for greater than
- <=: for less than or equal to
- >=: for greater than or equal to
- ==: for equal to each other
- !=: not equal to each other
- %in%: group membership. For example, value %in% c(2, 3) means that value can takes 2 or 3.
- is.na(): is NA
- !is.na(): is not NA.
- Logical operators
- value == 2|3: means that the value equal 2 or (|) 3. value %in% c(2, 3) is a shortcut equivalent to value == 2|3.
- &: means and. For example sex == female & age > 25
The most frequent mistake made by beginners in R is to use = instead of == when testing for equality. Remember that, when you are testing for equality, you should always use == (not =).
Extracting rows based on logical criteria
- One-column based criteria: Extract rows where Sepal.Length > 7:
filter(my_data, Sepal.Length > 7)
Source: local data frame [12 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica
- Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
filter(my_data, Sepal.Length > 6.7, Sepal.Width <= 3)
Source: local data frame [10 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 6.8 2.8 4.8 1.4 versicolor
2 7.1 3.0 5.9 2.1 virginica
3 7.6 3.0 6.6 2.1 virginica
4 7.3 2.9 6.3 1.8 virginica
5 6.8 3.0 5.5 2.1 virginica
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.0 5.8 1.6 virginica
9 7.4 2.8 6.1 1.9 virginica
10 7.7 3.0 6.1 2.3 virginica
- Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = versicolor:
filter(my_data, Sepal.Length > 6.7, Species == "versicolor")
Source: local data frame [3 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 7.0 3.2 4.7 1.4 versicolor
2 6.9 3.1 4.9 1.5 versicolor
3 6.8 2.8 4.8 1.4 versicolor
- Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = versicolor or Species = virginica):
Use this:
filter(my_data, Sepal.Length > 6.7,
Species == "versicolor" | Species == "virginica" )
Or, equivalently, use this shortcut (%in% operator):
filter(my_data, Sepal.Length > 6.7,
Species %in% c("versicolor", "virginica" ))
Source: local data frame [20 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 7.0 3.2 4.7 1.4 versicolor
2 6.9 3.1 4.9 1.5 versicolor
3 6.8 2.8 4.8 1.4 versicolor
4 7.1 3.0 5.9 2.1 virginica
5 7.6 3.0 6.6 2.1 virginica
6 7.3 2.9 6.3 1.8 virginica
7 7.2 3.6 6.1 2.5 virginica
8 6.8 3.0 5.5 2.1 virginica
9 7.7 3.8 6.7 2.2 virginica
10 7.7 2.6 6.9 2.3 virginica
11 6.9 3.2 5.7 2.3 virginica
12 7.7 2.8 6.7 2.0 virginica
13 7.2 3.2 6.0 1.8 virginica
14 7.2 3.0 5.8 1.6 virginica
15 7.4 2.8 6.1 1.9 virginica
16 7.9 3.8 6.4 2.0 virginica
17 7.7 3.0 6.1 2.3 virginica
18 6.9 3.1 5.4 2.1 virginica
19 6.9 3.1 5.1 2.3 virginica
20 6.8 3.2 5.9 2.3 virginica
Note that, filter() works similarly to the R base function subset(), which will be described in the next sections.
Removing missing values
As described in the chapter named R programming basics, its possible to use the function is.na(x) to check whether a data contains missing value. It takes a vector x as an input and returns a logical vector in which the value TRUE specifies that the corresponding element in x is NA.
- Create a tbl with missing values using data_frame() [in dplyr]. In R NA (Not Available) is used to represent missing values:
# Create a data frame with missing data
friends_data <- data_frame(
name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
age = c(27, 25, 29, 26),
height = c(180, NA, NA, 169),
married = c("yes", "yes", "no", "no")
)
# Print
friends_data
Source: local data frame [4 x 4]
name age height married
(chr) (dbl) (dbl) (chr)
1 Nicolas 27 180 yes
2 Thierry 25 NA yes
3 Bernard 29 NA no
4 Jerome 26 169 no
- Extract rows where height is NA:
filter(friends_data, is.na(height))
Source: local data frame [2 x 4]
name age height married
(chr) (dbl) (dbl) (chr)
1 Thierry 25 NA yes
2 Bernard 29 NA no
- Exclude (drop) rows where height is NA:
filter(friends_data, !is.na(height))
Source: local data frame [2 x 4]
name age height married
(chr) (dbl) (dbl) (chr)
1 Nicolas 27 180 yes
2 Jerome 26 169 no
In the R code above, !is.na() means that we dont want NAs.
Using filter() programmatically inside an R function
There are three ways to quote inputs that dplyr understands:
- With a formula, ~Sepal.Length.
- With quote(), quote(Sepal.Length).
- As a string: Sepal.Length.
# Extract rows where Sepal.Length > 7
filter_(my_data, "Sepal.Length > 7")
# Extract rows where Sepal.Length > 7 and Sepal.Width <= 3
filter_(my_data, "Sepal.Length > 7 & Sepal.Width <= 3")
# Extract rows where Sepal.Length > 6.5 and
# (Species = "versicolor" or Species = "virginica")
filter_(my_data, quote(Sepal.Length > 6.7 &
Species %in% c("versicolor", "virginica" )))
Extracting rows by criteria with R base functions: subset()
- Extract rows where Sepal.Length > 7 and Sepal.Width ≤ 3:
You can use this:
my_data[my_data$Sepal.Length > 7 & my_data$Sepal.Width <= 3, ]
Or use the R base function subset():
subset(my_data, Sepal.Length > 7 & Sepal.Width <= 3)
- Extract rows where Sepal.Length > 6.7 and (Species = versicolor or Species = virginica)
subset(my_data, Sepal.Length > 6.7,
Species %in% c("versicolor", "virginica" ))
subset() works also with vectors as follow.
my_vec <- 1:10
subset(my_vec, my_vec >5 & my_vec < 8)
[1] 6 7
Note that, R base functions require more typing than dplyr::filter(), so we recommend dplyr solutions.
Select random rows from a table
We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.
set.seed(1234)
# Extract 5 random rows without replacement
sample_n(my_data, 5, replace = FALSE)
Source: local data frame [5 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.1 3.5 1.4 0.3 setosa
2 5.8 2.6 4.0 1.2 versicolor
3 5.5 2.6 4.4 1.2 versicolor
4 6.1 3.0 4.6 1.4 versicolor
5 7.2 3.2 6.0 1.8 virginica
# Extract 5% of rows, randomly without replacement
sample_frac(my_data, 0.05, replace = FALSE)
Source: local data frame [8 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.7 2.9 4.2 1.3 versicolor
2 4.9 3.0 1.4 0.2 setosa
3 4.9 3.1 1.5 0.2 setosa
4 6.2 2.9 4.3 1.3 versicolor
5 6.6 3.0 4.4 1.4 versicolor
6 6.3 3.3 6.0 2.5 virginica
7 6.0 2.9 4.5 1.5 versicolor
8 5.0 3.5 1.3 0.3 setosa
Note that, its also possible to use the R base function sample(), but it requires more typing.
set.seed(1234)
my_data[sample(1:nrow(my_data), 5, replace = FALSE), , drop = FALSE]
Select top n rows ordered by a variable
- The format is as follow:
top_n(x, n, wt)
- x: Data table
- n: Number of rows to return. If x is grouped, this is the number of rows per group. May include more than n if there are ties.
- wt(Optional): The variable to use for ordering. If not specified, defaults to the last variable in the data table.
- Select the top 5 rows ordered by Sepal.Length
top_n(my_data, 5, Sepal.Length)
Source: local data frame [5 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 7.7 3.8 6.7 2.2 virginica
2 7.7 2.6 6.9 2.3 virginica
3 7.7 2.8 6.7 2.0 virginica
4 7.9 3.8 6.4 2.0 virginica
5 7.7 3.0 6.1 2.3 virginica
- Group by the column Species and select the top 5 of each group ordered by Sepal.Length:
my_data %>%
group_by(Species) %>%
top_n(5, Sepal.Length)
Note that, dplyr package allows to use the forward-pipe operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). The output of each operation is passed to the next operation.
Summary
Filter rows by logical criteria: dplyr::filter(iris, Sepal.Length >7)
Select n random rows: dplyr::sample_n(iris, 10)
Select a random fraction of rows: dplyr::sample_frac(iris, 10)
- Select top n rows by values: dplyr::top_n(iris, 10, Sepal.Length)
Infos
This analysis has been performed using R (ver. 3.2.3).