Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data.

Here, you well learn how to remove duplicate data using R base functions (duplicated() and unique()) as well as the function distinct [in dplyr package].

Identifying and Removing Duplicate Data in R

Pleleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory
Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files
Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, well use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris

# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)

# Print
my_data

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

R base functions

In this section, well describe the function unique() [for extracting unique elements] and the function duplicated() [for identifying duplicated elements].

Find and drop duplicate elements: duplicated()

The function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

To find the position of duplicate elements in x, use this:

duplicated(x)

[1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Extract duplicate elements:

x[duplicated(x)]

[1] 1 4

If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:

x[!duplicated(x)]

[1] 1 4 5 6

Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:

# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]

Source: local data frame [23 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           4.4         2.9          1.4         0.2  setosa
9           5.4         3.7          1.5         0.2  setosa
10          5.8         4.0          1.2         0.2  setosa
..          ...         ...          ...         ...     ...

! is a logical negation. !duplicated() means that we dont want duplicate rows.

Extract unique elements: unique()

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)

[1] 1 4 5 6

Its also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)

Source: local data frame [149 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Remove duplicate rows using dplyr

The function distinct() in dplyr package can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. Its an efficient version of the R base function unique().

The dplyr package can be loaded and installed as follow:

# Install
install.packages("dplyr")

# Load
library("dplyr")

Remove duplicate rows based on all columns:

distinct(my_data)

Source: local data frame [149 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Length
distinct(my_data, Sepal.Length)

Source: local data frame [35 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.4         2.9          1.4         0.2  setosa
8           4.8         3.4          1.6         0.2  setosa
9           4.3         3.0          1.1         0.1  setosa
10          5.8         4.0          1.2         0.2  setosa
..          ...         ...          ...         ...     ...

# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
distinct(my_data, Sepal.Length, Petal.Width)

Source: local data frame [110 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           4.4         2.9          1.4         0.2  setosa
9           4.9         3.1          1.5         0.1  setosa
10          5.4         3.7          1.5         0.2  setosa
..          ...         ...          ...         ...     ...

distinct() is best-suited for interactive use. The function distinct_() should be used for calling from a function. In this case the input must be quoted.

distinct_(my_data,  "Sepal.Length", "Petal.Width")

Summary

Remove duplicate rows based on one or more column values: dplyr::distinct(my_data, Sepal.Length)
R base function to extract unique elements from vectors and data frames: unique(my_data)
R base function to determine duplicate elements: duplicated(my_data)

Infos

This analysis has been performed using R (ver. 3.2.3).

Identifying and Removing Duplicate Data in R

Pleleminary tasks

R base functions

Find and drop duplicate elements: duplicated()

Extract unique elements: unique()

Remove duplicate rows using dplyr

Summary

Infos

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

Pleleminary tasks

R base functions

Find and drop duplicate elements: duplicated()

Extract unique elements: unique()

Remove duplicate rows using dplyr

Summary

Related articles

Infos

Trending Articles