Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data.
Pleleminary tasks
Launch RStudio as described here: Running RStudio and setting up your working directory
Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files
Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.
Here, well use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.
# Create my_data
my_data <- iris
# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)
# Print
my_data
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
R base functions
Find and drop duplicate elements: duplicated()
The function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.
Given the following vector:
x <- c(1, 1, 4, 5, 4, 6)
- To find the position of duplicate elements in x, use this:
duplicated(x)
[1] FALSE TRUE FALSE FALSE TRUE FALSE
- Extract duplicate elements:
x[duplicated(x)]
[1] 1 4
- If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
[1] 1 4 5 6
- Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]
Source: local data frame [23 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 4.4 2.9 1.4 0.2 setosa
9 5.4 3.7 1.5 0.2 setosa
10 5.8 4.0 1.2 0.2 setosa
.. ... ... ... ... ...
! is a logical negation. !duplicated() means that we dont want duplicate rows.
Extract unique elements: unique()
Given the following vector:
x <- c(1, 1, 4, 5, 4, 6)
You can extract unique elements as follow:
unique(x)
[1] 1 4 5 6
Its also possible to apply unique() on a data frame, for removing duplicated rows as follow:
unique(my_data)
Source: local data frame [149 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
Remove duplicate rows using dplyr
The dplyr package can be loaded and installed as follow:
# Install
install.packages("dplyr")
# Load
library("dplyr")
- Remove duplicate rows based on all columns:
distinct(my_data)
Source: local data frame [149 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
- Remove duplicate rows based on certain columns (variables):
# Remove duplicated rows based on Sepal.Length
distinct(my_data, Sepal.Length)
Source: local data frame [35 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.4 2.9 1.4 0.2 setosa
8 4.8 3.4 1.6 0.2 setosa
9 4.3 3.0 1.1 0.1 setosa
10 5.8 4.0 1.2 0.2 setosa
.. ... ... ... ... ...
# Remove duplicated rows based on
# Sepal.Length and Petal.Width
distinct(my_data, Sepal.Length, Petal.Width)
Source: local data frame [110 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
10 5.4 3.7 1.5 0.2 setosa
.. ... ... ... ... ...
distinct_(my_data, "Sepal.Length", "Petal.Width")
Summary
Remove duplicate rows based on one or more column values: dplyr::distinct(my_data, Sepal.Length)
R base function to extract unique elements from vectors and data frames: unique(my_data)
- R base function to determine duplicate elements: duplicated(my_data)
Infos
This analysis has been performed using R (ver. 3.2.3).