Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

Subsetting Data Frame Rows in R

$
0
0



Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data. We also described crutial steps to reshape your data with R for easier analyses.


Here, you we’ll learn how to subset (or filter) rows of a data frame based on certain criteria. This can be done easily using R functions provided by dplyr package. It’s also possible to use the R base functions subset().


Among the functions available in dplyr package, there are:

  • filter(iris, Sepal.Length >7): Extract rows based on logical criteria
  • distinct(iris): Remove duplicated rows
  • sample_n(iris, 10, replace = FALSE): Select n random rows from a table
  • sample_frac(iris, 0.5, replace = FALSE): Select a random fraction of rows
  • slice(iris, 3:8): Select rows by position
  • top_n(iris, 10, Sepal.Length): Select and order top n rows (by groups if grouped data)

We’ll start by describing how to subset rows based on some criteria, with the dplyr::filter() function as well as the R base function subset(). Next, we’ll show you how to select rows randomly using sample_n() and sample_frac() functions. Finally, we’ll describe how to select the top n elements in each group, ordered by a given variables.

Subsetting Data Frame Rows in R

Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris

# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)

# Print
my_data
Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Install and load dplyr package

  • Install dplyr
install.packages("dplyr")
  • Load dplyr:
library("dplyr")

Extracting rows by position: dplyr::slice()

Select rows 1 to 6:

my_data[1:6, ]

or you can also use the function slice()[in dplyr]:

slice(my_data, 1:6)

Extracting rows by criteria: dplyr::filter()


The function filter() is used to filter rows that meet some logical criteria.


Logical comparisons

Before continuing, we introduce the notion of logical comparisons and operators, which are important to know for filtering data.

The “logical” comparison operators available in R are:


  1. Logical comparisons
    • <: for less than
    • >: for greater than
    • <=: for less than or equal to
    • >=: for greater than or equal to
    • ==: for equal to each other
    • !=: not equal to each other
    • %in%: group membership. For example, “value %in% c(2, 3)” means that value can takes 2 or 3.
    • is.na(): is NA
    • !is.na(): is not NA.
  2. Logical operators
    • value == 2|3: means that the value equal 2 or (|) 3. value %in% c(2, 3) is a shortcut equivalent to value == 2|3.
    • &: means and. For example sex == “female” & age > 25


The most frequent mistake made by beginners in R is to use = instead of == when testing for equality. Remember that, when you are testing for equality, you should always use == (not =).

Extracting rows based on logical criteria

  • One-column based criteria: Extract rows where Sepal.Length > 7:
filter(my_data, Sepal.Length > 7)
Source: local data frame [12 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          (dbl)       (dbl)        (dbl)       (dbl)    (fctr)
1           7.1         3.0          5.9         2.1 virginica
2           7.6         3.0          6.6         2.1 virginica
3           7.3         2.9          6.3         1.8 virginica
4           7.2         3.6          6.1         2.5 virginica
5           7.7         3.8          6.7         2.2 virginica
6           7.7         2.6          6.9         2.3 virginica
7           7.7         2.8          6.7         2.0 virginica
8           7.2         3.2          6.0         1.8 virginica
9           7.2         3.0          5.8         1.6 virginica
10          7.4         2.8          6.1         1.9 virginica
11          7.9         3.8          6.4         2.0 virginica
12          7.7         3.0          6.1         2.3 virginica
  • Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
filter(my_data, Sepal.Length > 6.7, Sepal.Width <= 3)
Source: local data frame [10 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1           6.8         2.8          4.8         1.4 versicolor
2           7.1         3.0          5.9         2.1  virginica
3           7.6         3.0          6.6         2.1  virginica
4           7.3         2.9          6.3         1.8  virginica
5           6.8         3.0          5.5         2.1  virginica
6           7.7         2.6          6.9         2.3  virginica
7           7.7         2.8          6.7         2.0  virginica
8           7.2         3.0          5.8         1.6  virginica
9           7.4         2.8          6.1         1.9  virginica
10          7.7         3.0          6.1         2.3  virginica
  • Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:
filter(my_data, Sepal.Length > 6.7, Species == "versicolor")
Source: local data frame [3 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          7.0         3.2          4.7         1.4 versicolor
2          6.9         3.1          4.9         1.5 versicolor
3          6.8         2.8          4.8         1.4 versicolor
  • Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):

Use this:

filter(my_data, Sepal.Length > 6.7, 
       Species == "versicolor" | Species == "virginica" )

Or, equivalently, use this shortcut (%in% operator):

filter(my_data, Sepal.Length > 6.7, 
      Species %in% c("versicolor", "virginica" ))
Source: local data frame [20 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1           7.0         3.2          4.7         1.4 versicolor
2           6.9         3.1          4.9         1.5 versicolor
3           6.8         2.8          4.8         1.4 versicolor
4           7.1         3.0          5.9         2.1  virginica
5           7.6         3.0          6.6         2.1  virginica
6           7.3         2.9          6.3         1.8  virginica
7           7.2         3.6          6.1         2.5  virginica
8           6.8         3.0          5.5         2.1  virginica
9           7.7         3.8          6.7         2.2  virginica
10          7.7         2.6          6.9         2.3  virginica
11          6.9         3.2          5.7         2.3  virginica
12          7.7         2.8          6.7         2.0  virginica
13          7.2         3.2          6.0         1.8  virginica
14          7.2         3.0          5.8         1.6  virginica
15          7.4         2.8          6.1         1.9  virginica
16          7.9         3.8          6.4         2.0  virginica
17          7.7         3.0          6.1         2.3  virginica
18          6.9         3.1          5.4         2.1  virginica
19          6.9         3.1          5.1         2.3  virginica
20          6.8         3.2          5.9         2.3  virginica

Note that, filter() works similarly to the R base function subset(), which will be described in the next sections.

Removing missing values

As described in the chapter named R programming basics, it’s possible to use the function is.na(x) to check whether a data contains missing value. It takes a vector x as an input and returns a logical vector in which the value TRUE specifies that the corresponding element in x is NA.

  • Create a tbl with missing values using data_frame() [in dplyr]. In R NA (Not Available) is used to represent missing values:
# Create a data frame with missing data
friends_data <- data_frame(
  name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
  age = c(27, 25, 29, 26),
  height = c(180, NA, NA, 169),
  married = c("yes", "yes", "no", "no")
)
# Print
friends_data
Source: local data frame [4 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Nicolas    27    180     yes
2 Thierry    25     NA     yes
3 Bernard    29     NA      no
4  Jerome    26    169      no
  • Extract rows where height is NA:
filter(friends_data, is.na(height))
Source: local data frame [2 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Thierry    25     NA     yes
2 Bernard    29     NA      no
  • Exclude (drop) rows where height is NA:
filter(friends_data, !is.na(height))
Source: local data frame [2 x 4]

     name   age height married
    (chr) (dbl)  (dbl)   (chr)
1 Nicolas    27    180     yes
2  Jerome    26    169      no

In the R code above, !is.na() means that “we don’t want” NAs.

Using filter() programmatically inside an R function


filter() is best-suited for interactive use. The function filter_() should be used for calling from a function. In this case the input must be “quoted”.


There are three ways to quote inputs that dplyr understands:

  • With a formula, ~Sepal.Length.
  • With quote(), quote(Sepal.Length).
  • As a string: “Sepal.Length”.
# Extract rows where Sepal.Length > 7
filter_(my_data, "Sepal.Length > 7")

# Extract rows where Sepal.Length > 7 and Sepal.Width <= 3
filter_(my_data, "Sepal.Length > 7 & Sepal.Width <= 3")

# Extract rows where Sepal.Length > 6.5 and
# (Species = "versicolor" or Species = "virginica")
filter_(my_data, quote(Sepal.Length > 6.7 & 
      Species %in% c("versicolor", "virginica" )))

Extracting rows by criteria with R base functions: subset()

  • Extract rows where Sepal.Length > 7 and Sepal.Width ≤ 3:

You can use this:

my_data[my_data$Sepal.Length > 7 & my_data$Sepal.Width <= 3, ]

Or use the R base function subset():

subset(my_data, Sepal.Length > 7 & Sepal.Width <= 3)
  • Extract rows where Sepal.Length > 6.7 and (Species = “versicolor” or Species = “virginica”)
subset(my_data, Sepal.Length > 6.7, 
      Species %in% c("versicolor", "virginica" ))

subset() works also with vectors as follow.

my_vec <- 1:10
subset(my_vec, my_vec >5 & my_vec < 8)
[1] 6 7

Note that, R base functions require more typing than dplyr::filter(), so we recommend dplyr solutions.

Select random rows from a table


It’s possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac().


We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.

set.seed(1234)
# Extract 5 random rows without replacement
sample_n(my_data, 5, replace = FALSE)
Source: local data frame [5 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          5.1         3.5          1.4         0.3     setosa
2          5.8         2.6          4.0         1.2 versicolor
3          5.5         2.6          4.4         1.2 versicolor
4          6.1         3.0          4.6         1.4 versicolor
5          7.2         3.2          6.0         1.8  virginica
# Extract 5% of rows, randomly without replacement
sample_frac(my_data, 0.05, replace = FALSE)
Source: local data frame [8 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         (dbl)       (dbl)        (dbl)       (dbl)     (fctr)
1          5.7         2.9          4.2         1.3 versicolor
2          4.9         3.0          1.4         0.2     setosa
3          4.9         3.1          1.5         0.2     setosa
4          6.2         2.9          4.3         1.3 versicolor
5          6.6         3.0          4.4         1.4 versicolor
6          6.3         3.3          6.0         2.5  virginica
7          6.0         2.9          4.5         1.5 versicolor
8          5.0         3.5          1.3         0.3     setosa

Note that, it’s also possible to use the R base function sample(), but it requires more typing.

set.seed(1234)
my_data[sample(1:nrow(my_data), 5, replace = FALSE), , drop = FALSE]

Select top n rows ordered by a variable


As mentioned above, the function top_n(), can be used to select the top n entries in each group.


  • The format is as follow:
top_n(x, n, wt)

  • x: Data table
  • n: Number of rows to return. If x is grouped, this is the number of rows per group. May include more than n if there are ties.
  • wt(Optional): The variable to use for ordering. If not specified, defaults to the last variable in the data table.


  • Select the top 5 rows ordered by Sepal.Length
top_n(my_data, 5, Sepal.Length)
Source: local data frame [5 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
         (dbl)       (dbl)        (dbl)       (dbl)    (fctr)
1          7.7         3.8          6.7         2.2 virginica
2          7.7         2.6          6.9         2.3 virginica
3          7.7         2.8          6.7         2.0 virginica
4          7.9         3.8          6.4         2.0 virginica
5          7.7         3.0          6.1         2.3 virginica
  • Group by the column Species and select the top 5 of each group ordered by Sepal.Length:
my_data %>% 
  group_by(Species) %>%
  top_n(5, Sepal.Length)

Note that, dplyr package allows to use the forward-pipe operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). The output of each operation is passed to the next operation.

Summary


  • Filter rows by logical criteria: dplyr::filter(iris, Sepal.Length >7)

  • Select n random rows: dplyr::sample_n(iris, 10)

  • Select a random fraction of rows: dplyr::sample_frac(iris, 10)

  • Select top n rows by values: dplyr::top_n(iris, 10, Sepal.Length)


Infos

This analysis has been performed using R (ver. 3.2.3).


Viewing all articles
Browse latest Browse all 183

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>