Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data. We also described crutial steps to reshape your data with R for easier analyses.

Here, you well learn how to subset (or filter) rows of a data frame based on certain criteria. This can be done easily using R functions provided by dplyr package. Its also possible to use the R base functions subset().

Among the functions available in dplyr package, there are:

filter(iris, Sepal.Length >7): Extract rows based on logical criteria
distinct(iris): Remove duplicated rows
sample_n(iris, 10, replace = FALSE): Select n random rows from a table
sample_frac(iris, 0.5, replace = FALSE): Select a random fraction of rows
slice(iris, 3:8): Select rows by position
top_n(iris, 10, Sepal.Length): Select and order top n rows (by groups if grouped data)

Well start by describing how to subset rows based on some criteria, with the dplyr::filter() function as well as the R base function subset(). Next, well show you how to select rows randomly using sample_n() and sample_frac() functions. Finally, well describe how to select the top n elements in each group, ordered by a given variables.

Subsetting Data Frame Rows in R

Pleleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory
Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files
Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, well use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris

# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)

# Print
my_data

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Install and load dplyr package

Install dplyr

install.packages("dplyr")

Load dplyr:

library("dplyr")

Extracting rows by position: dplyr::slice()

Select rows 1 to 6:

my_data[1:6, ]

or you can also use the function slice()[in dplyr]:

slice(my_data, 1:6)

Extracting rows by criteria: dplyr::filter()

The function filter() is used to filter rows that meet some logical criteria.

Logical comparisons

Before continuing, we introduce the notion of logical comparisons and operators, which are important to know for filtering data.

The logical comparison operators available in R are:

Logical comparisons
- <: for less than
- >: for greater than
- <=: for less than or equal to
- >=: for greater than or equal to
- ==: for equal to each other
- !=: not equal to each other
- %in%: group membership. For example, value %in% c(2, 3) means that value can takes 2 or 3.
- is.na(): is NA
- !is.na(): is not NA.
Logical operators
- value == 2|3: means that the value equal 2 or (|) 3. value %in% c(2, 3) is a shortcut equivalent to value == 2|3.
- &: means and. For example sex == female & age > 25

The most frequent mistake made by beginners in R is to use = instead of == when testing for equality. Remember that, when you are testing for equality, you should always use == (not =).

Extracting rows based on logical criteria

One-column based criteria: Extract rows where Sepal.Length > 7:

filter(my_data, Sepal.Length > 7)

Source: local data frame [12 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          (dbl)       (dbl)        (dbl)       (dbl)    (fctr)
1           7.1         3.0          5.9         2.1 virginica
2           7.6         3.0          6.6         2.1 virginica
3           7.3         2.9          6.3         1.8 virginica
4           7.2         3.6          6.1         2.5 virginica
5           7.7         3.8          6.7         2.2 virginica
6           7.7         2.6          6.9         2.3 virginica
7           7.7         2.8          6.7         2.0 virginica
8           7.2         3.2          6.0         1.8 virginica
9           7.2         3.0          5.8         1.6 virginica
10          7.4         2.8          6.1         1.9 virginica
11          7.9         3.8          6.4         2.0 virginica
12          7.7         3.0          6.1         2.3 virginica

Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:

filter(my_data, Sepal.Length > 6.7, Sepal.Width <= 3)

Source: local data frame [10 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1           6.8         2.8          4.8         1.4 versicolor
2           7.1         3.0          5.9         2.1  virginica
3           7.6         3.0          6.6         2.1  virginica
4           7.3         2.9          6.3         1.8  virginica
5           6.8         3.0          5.5         2.1  virginica
6           7.7         2.6          6.9         2.3  virginica
7           7.7         2.8          6.7         2.0  virginica
8           7.2         3.0          5.8         1.6  virginica
9           7.4         2.8          6.1         1.9  virginica
10          7.7         3.0          6.1         2.3  virginica

Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = versicolor:

filter(my_data, Sepal.Length > 6.7, Species == "versicolor")

Source: local data frame [3 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          7.0         3.2          4.7         1.4 versicolor
2          6.9         3.1          4.9         1.5 versicolor
3          6.8         2.8          4.8         1.4 versicolor

Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = versicolor or Species = virginica):

Use this:

filter(my_data, Sepal.Length > 6.7, 
       Species == "versicolor" | Species == "virginica" )

Or, equivalently, use this shortcut (%in% operator):

filter(my_data, Sepal.Length > 6.7, 
      Species %in% c("versicolor", "virginica" ))

Source: local data frame [20 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1           7.0         3.2          4.7         1.4 versicolor
2           6.9         3.1          4.9         1.5 versicolor
3           6.8         2.8          4.8         1.4 versicolor
4           7.1         3.0          5.9         2.1  virginica
5           7.6         3.0          6.6         2.1  virginica
6           7.3         2.9          6.3         1.8  virginica
7           7.2         3.6          6.1         2.5  virginica
8           6.8         3.0          5.5         2.1  virginica
9           7.7         3.8          6.7         2.2  virginica
10          7.7         2.6          6.9         2.3  virginica
11          6.9         3.2          5.7         2.3  virginica
12          7.7         2.8          6.7         2.0  virginica
13          7.2         3.2          6.0         1.8  virginica
14          7.2         3.0          5.8         1.6  virginica
15          7.4         2.8          6.1         1.9  virginica
16          7.9         3.8          6.4         2.0  virginica
17          7.7         3.0          6.1         2.3  virginica
18          6.9         3.1          5.4         2.1  virginica
19          6.9         3.1          5.1         2.3  virginica
20          6.8         3.2          5.9         2.3  virginica

Note that, filter() works similarly to the R base function subset(), which will be described in the next sections.

Removing missing values

As described in the chapter named R programming basics, its possible to use the function is.na(x) to check whether a data contains missing value. It takes a vector x as an input and returns a logical vector in which the value TRUE specifies that the corresponding element in x is NA.

Create a tbl with missing values using data_frame() [in dplyr]. In R NA (Not Available) is used to represent missing values:

# Create a data frame with missing data
friends_data <- data_frame(
  name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
  age = c(27, 25, 29, 26),
  height = c(180, NA, NA, 169),
  married = c("yes", "yes", "no", "no")
)
# Print
friends_data

Source: local data frame [4 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Nicolas    27    180     yes
2 Thierry    25     NA     yes
3 Bernard    29     NA      no
4  Jerome    26    169      no

Extract rows where height is NA:

filter(friends_data, is.na(height))

Source: local data frame [2 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Thierry    25     NA     yes
2 Bernard    29     NA      no

Exclude (drop) rows where height is NA:

filter(friends_data, !is.na(height))

Source: local data frame [2 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Nicolas    27    180     yes
2  Jerome    26    169      no

In the R code above, !is.na() means that we dont want NAs.

Using filter() programmatically inside an R function

filter() is best-suited for interactive use. The function filter_() should be used for calling from a function. In this case the input must be quoted.

There are three ways to quote inputs that dplyr understands:

With a formula, ~Sepal.Length.
With quote(), quote(Sepal.Length).
As a string: Sepal.Length.

# Extract rows where Sepal.Length > 7
filter_(my_data, "Sepal.Length > 7")

# Extract rows where Sepal.Length > 7 and Sepal.Width <= 3
filter_(my_data, "Sepal.Length > 7 & Sepal.Width <= 3")

# Extract rows where Sepal.Length > 6.5 and
# (Species = "versicolor" or Species = "virginica")
filter_(my_data, quote(Sepal.Length > 6.7 & 
      Species %in% c("versicolor", "virginica" )))

Extracting rows by criteria with R base functions: subset()

Extract rows where Sepal.Length > 7 and Sepal.Width ≤ 3:

You can use this:

my_data[my_data$Sepal.Length > 7 & my_data$Sepal.Width <= 3, ]

Or use the R base function subset():

subset(my_data, Sepal.Length > 7 & Sepal.Width <= 3)

Extract rows where Sepal.Length > 6.7 and (Species = versicolor or Species = virginica)

subset(my_data, Sepal.Length > 6.7, 
      Species %in% c("versicolor", "virginica" ))

subset() works also with vectors as follow.

my_vec <- 1:10
subset(my_vec, my_vec >5 & my_vec < 8)

[1] 6 7

Note that, R base functions require more typing than dplyr::filter(), so we recommend dplyr solutions.

Select random rows from a table

Its possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac().

We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.

set.seed(1234)
# Extract 5 random rows without replacement
sample_n(my_data, 5, replace = FALSE)

Source: local data frame [5 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          5.1         3.5          1.4         0.3     setosa
2          5.8         2.6          4.0         1.2 versicolor
3          5.5         2.6          4.4         1.2 versicolor
4          6.1         3.0          4.6         1.4 versicolor
5          7.2         3.2          6.0         1.8  virginica

# Extract 5% of rows, randomly without replacement
sample_frac(my_data, 0.05, replace = FALSE)

Source: local data frame [8 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          5.7         2.9          4.2         1.3 versicolor
2          4.9         3.0          1.4         0.2     setosa
3          4.9         3.1          1.5         0.2     setosa
4          6.2         2.9          4.3         1.3 versicolor
5          6.6         3.0          4.4         1.4 versicolor
6          6.3         3.3          6.0         2.5  virginica
7          6.0         2.9          4.5         1.5 versicolor
8          5.0         3.5          1.3         0.3     setosa

Note that, its also possible to use the R base function sample(), but it requires more typing.

set.seed(1234)
my_data[sample(1:nrow(my_data), 5, replace = FALSE), , drop = FALSE]

Select top n rows ordered by a variable

As mentioned above, the function top_n(), can be used to select the top n entries in each group.

The format is as follow:

top_n(x, n, wt)

x: Data table
n: Number of rows to return. If x is grouped, this is the number of rows per group. May include more than n if there are ties.
wt(Optional): The variable to use for ordering. If not specified, defaults to the last variable in the data table.

Select the top 5 rows ordered by Sepal.Length

top_n(my_data, 5, Sepal.Length)

Source: local data frame [5 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
         (dbl)       (dbl)        (dbl)       (dbl)    (fctr)
1          7.7         3.8          6.7         2.2 virginica
2          7.7         2.6          6.9         2.3 virginica
3          7.7         2.8          6.7         2.0 virginica
4          7.9         3.8          6.4         2.0 virginica
5          7.7         3.0          6.1         2.3 virginica

Group by the column Species and select the top 5 of each group ordered by Sepal.Length:

my_data %>% 
  group_by(Species) %>%
  top_n(5, Sepal.Length)

Note that, dplyr package allows to use the forward-pipe operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). The output of each operation is passed to the next operation.

Summary

Filter rows by logical criteria: dplyr::filter(iris, Sepal.Length >7)
Select n random rows: dplyr::sample_n(iris, 10)
Select a random fraction of rows: dplyr::sample_frac(iris, 10)
Select top n rows by values: dplyr::top_n(iris, 10, Sepal.Length)

Infos

This analysis has been performed using R (ver. 3.2.3).

Subsetting Data Frame Rows in R

Pleleminary tasks

Install and load dplyr package

Extracting rows by position: dplyr::slice()

Extracting rows by criteria: dplyr::filter()

Logical comparisons

Extracting rows based on logical criteria

Removing missing values

Using filter() programmatically inside an R function

Extracting rows by criteria with R base functions: subset()

Select random rows from a table

Select top n rows ordered by a variable

Summary

Infos

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List

Pleleminary tasks

Install and load dplyr package

Extracting rows by position: dplyr::slice()

Extracting rows by criteria: dplyr::filter()

Logical comparisons

Extracting rows based on logical criteria

Removing missing values

Using filter() programmatically inside an R function

Extracting rows by criteria with R base functions: subset()

Select random rows from a table

Select top n rows ordered by a variable

Summary

Related articles

Infos

Trending Articles