Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

Identifying and Removing Duplicate Data in R

$
0
0



Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data.


Here, you we’ll learn how to remove duplicate data using R base functions (duplicated() and unique()) as well as the function distinct [in dplyr package].


Identifying and Removing Duplicate Data in R

Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris

# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)

# Print
my_data
Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

R base functions


In this section, we’ll describe the function unique() [for extracting unique elements] and the function duplicated() [for identifying duplicated elements].


Find and drop duplicate elements: duplicated()

The function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)
  • To find the position of duplicate elements in x, use this:
duplicated(x)
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE
  • Extract duplicate elements:
x[duplicated(x)]
[1] 1 4
  • If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
[1] 1 4 5 6
  • Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]
Source: local data frame [23 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           4.4         2.9          1.4         0.2  setosa
9           5.4         3.7          1.5         0.2  setosa
10          5.8         4.0          1.2         0.2  setosa
..          ...         ...          ...         ...     ...

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements: unique()

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
[1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)
Source: local data frame [149 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Remove duplicate rows using dplyr


The function distinct() in dplyr package can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().


The dplyr package can be loaded and installed as follow:

# Install
install.packages("dplyr")

# Load
library("dplyr")
  • Remove duplicate rows based on all columns:
distinct(my_data)
Source: local data frame [149 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...
  • Remove duplicate rows based on certain columns (variables):
# Remove duplicated rows based on Sepal.Length
distinct(my_data, Sepal.Length)
Source: local data frame [35 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.4         2.9          1.4         0.2  setosa
8           4.8         3.4          1.6         0.2  setosa
9           4.3         3.0          1.1         0.1  setosa
10          5.8         4.0          1.2         0.2  setosa
..          ...         ...          ...         ...     ...
# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
distinct(my_data, Sepal.Length, Petal.Width)
Source: local data frame [110 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           4.4         2.9          1.4         0.2  setosa
9           4.9         3.1          1.5         0.1  setosa
10          5.4         3.7          1.5         0.2  setosa
..          ...         ...          ...         ...     ...

distinct() is best-suited for interactive use. The function distinct_() should be used for calling from a function. In this case the input must be “quoted”.


distinct_(my_data,  "Sepal.Length", "Petal.Width")

Summary


  • Remove duplicate rows based on one or more column values: dplyr::distinct(my_data, Sepal.Length)

  • R base function to extract unique elements from vectors and data frames: unique(my_data)

  • R base function to determine duplicate elements: duplicated(my_data)


Infos

This analysis has been performed using R (ver. 3.2.3).


Viewing all articles
Browse latest Browse all 183

Trending Articles