Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

Clarifying distance measures - Unsupervised Machine Learning

$
0
0


Large amounts of data are collected every day from satellite images, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Mining knowledge from these big data far exceeds human’s abilities. Consequently, unsupervised machine learning tools (i.e, clustering) for discovering knowledge becomes more and more important for big data analyses.

Clustering corresponds to a set of tools used in order to classify data samples into groups (i.e clusters). Each groups contains objects with similar profiles. The classification of observations into groups, requires some methods for measuring the distance or the (dis)similarity between the observations. This means that, no unsupervised machine learning algorithms can take place without some notion of distances.

In this article, we describe the common distance measures used for assessing similarity between observations. Some R codes, for computing pairwise-distances between observations, are also provided. You’ll learn also some methods for visualizing distance measures in R software.

1 Methods for measuring distances

The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

There are different solutions for measuring the distance between observations in order to define clusters.

In this section, we’ll describe the formulas of the classical measures, such as Euclidean and Manhattan distances as well as correlation-based distances.

  1. Euclidean distance:

\[ d_{euc}(x,y) = \sqrt{\sum_{i=1}^n(x_i - y_i)^2} \]

  1. Manhattan distance:

\[ d_{man}(x,y) = \sum_{i=1}^n |{(x_i - y_i)|} \]

Where, x and y are two vectors of length n.

Other dissimilarity measures exist such as correlation-based distances which have been widely used for microarray data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

  1. Pearson correlation distance:

\[ d_{cor}(x, y) = 1 - \frac{\sum\limits_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^n(x_i - \bar{x})^2 \sum\limits_{i=1}^n(y_i -\bar{y})^2}} \]

Pearson correlation measures the degree of a linear relationship between two profiles.

  1. Eisen cosine correlation distance (Eisen et al., 1998):

It’s a special case of Pearson’s correlation with \(\bar{x}\) and \(\bar{y}\) both replaced by zero:

\[ d_{eisen}(x, y) = 1 - \frac{\left|\sum\limits_{i=1}^n x_iy_i\right|}{\sqrt{\sum\limits_{i=1}^n x^2_i \sum\limits_{i=1}^n y^2_i}} \]

  1. Spearman correlation distance:

Spearman correlation method computes the correlation between the rank of x and the rank of y variables.

\[ d_{spear}(x, y) = 1 - \frac{\sum\limits_{i=1}^n (x'_i - \bar{x'})(y'_i - \bar{y'})}{\sqrt{\sum\limits_{i=1}^n(x'_i - \bar{x'})^2 \sum\limits_{i=1}^n(y'_i -\bar{y'})^2}} \]

Where \(x'_i = rank(x_i)\) and \(y'_i = rank(y)\).

  1. Kendall correlation distance:

Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is \(n(n-1)/2\), where n is the size of x and y. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders. Now, for each \(y_i\), count the number of \(y_j > y_i\) (concordant pairs (c)) and the number of \(y_j < y_i\) (discordant pairs (d)).

Kendall correlation distance is defined as follow:

\[ d_{kend}(x, y) = 1 - \frac{n_c - n_d}{\frac{1}{2}n(n-1)} \]

Where,

  • \(n_c\): total number of concordant pairs
  • \(n_d\): total number of discordant pairs
  • \(n\): size of x and y

Note that,

  • Pearson correlation analysis is the most commonly used method. It is also known as a parametric correlation which depends on the distribution of the data.
  • Kendall and Spearman correlations are non-parametric and they are used to perform rank-based correlation analysis.

In the formula above, x and y are two vectors of length n and, means \(\bar{x}\) and \(\bar{y}\), respectively. The distance between x and y is denoted \(d(x, y)\).

2 Distances and scaling

The value of distance measures is intimately related to the scale on which measurements are made. Therefore, variables are often scaled (i.e. standardized) before measuring the inter-observation dissimilarities. Generally variables are scaled to have standard deviation one and mean zero.

Why transforming the data?

The goal is to make the variables comparable and they will have equal importance in the clustering algorithm. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the dissimilarity measures obtained will be severely affected.

The standardization of data is an approach widely used in the context of gene expression data analysis before clustering.

We might also want to scale the data when the mean and/or the standard deviation of variables are largely different.

Note also that, standardization makes the four distance measure methods - Euclidean, Manhattan, Correlation and Eisen - more similar than they would be with non-transformed data.

This issue whether to scale or not the data before performing the analysis applies to any clustering methods (e.g K-means, hierarchical clustering, …)

When scaling variables, the data can be transformed as follow:

\[ \frac{x_i - center(x)}{scale(x)} \]

Where \(center(x)\) can be mean or the median of x values, and \(scale(x)\) can be the standard deviation (SD), interquartile range, or MAD (median absolute deviation).

The R base scale() function can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns.

3 Data preparation

The built-in R dataset USArrests (violent crime rates by US state) we’ll be used in this section. It contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. The percent of the population living in urban areas is also provided.

We’ll use only a subset of the data by taking 10 random rows among the 50 rows in the dataset. This is done by using the function sample(). The data can be prepared as follow:

# Load the dataset
data(USArrests)

# Subset of the data
set.seed(123)
ss <- sample(1:50, 10) # Take 10 random rows
df <- USArrests[ss, ] # Subset the 10 rows

# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
df <- na.omit(df)

# View the firt 6 rows of the data
head(df, n = 6)
##              Murder Assault UrbanPop Rape
## Iowa            2.2      56       57 11.3
## Rhode Island    3.4     174       87  8.3
## Maryland       11.3     300       67 27.8
## Tennessee      13.2     188       59 26.9
## Utah            3.2     120       80 22.9
## Arizona         8.1     294       80 31.0

In this dataset, columns are variables and rows are observations (i.e., samples).

To inspect the data before distance measures we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.

3.1 Descriptive statistics

The apply() function is used to apply a given function (e.g : min(), max(), mean(), …) on the dataset. The second argument can take the value of:

  • 1: for applying the function on the rows
  • 2: for applying the function on the columns
desc_stats <- data.frame(
  Min = apply(USArrests, 2, min), # minimum
  Med = apply(USArrests, 2, median), # median
  Mean = apply(USArrests, 2, mean), # mean
  SD = apply(USArrests, 2, sd), # Standard deviation
  Max = apply(USArrests, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)
##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

Note that the variables have a large different means and variances. They must be standardized to make them comparable.

Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. The scale() function can be used as follow:

df.scaled <- scale(df)
head(round(df.scaled, 2))
##              Murder Assault UrbanPop  Rape
## Iowa          -0.95   -1.21    -0.62 -0.83
## Rhode Island  -0.72    0.06     1.58 -1.18
## Maryland       0.82    1.42     0.12  1.08
## Tennessee      1.19    0.21    -0.47  0.98
## Utah          -0.75   -0.52     1.07  0.51
## Arizona        0.20    1.35     1.07  1.45

4 R functions for computing distances

There are many functions to compute pairwise distances in R:

  • The standard dist() function [in stats package]
  • The function daisy() [in cluster package]

4.1 The standard dist() function

The standard R dist() function computes and returns pairwise distances by using the specified distance measure. It returns an object of class dist containing the distances between the rows of the data which can be either a matrix or a data frame.

A simplified format is:

dist(x, method = "euclidean")

  • x: a numeric matrix or a data frame
  • method: possible values include “euclidean”, “manhattan” and more


We want to compute pairwise distances between observations:

# Compute Euclidean pairwise distances
dist.eucl <- dist(df.scaled, method = "euclidean")
# View a subset of the distance matrices
round(as.matrix(dist.eucl)[1:6, 1:6], 1)
##              Iowa Rhode Island Maryland Tennessee Utah Arizona
## Iowa          0.0          2.6      3.8       3.1  2.3     4.0
## Rhode Island  2.6          0.0      3.4       3.5  1.9     3.1
## Maryland      3.8          3.4      0.0       1.4  2.7     1.2
## Tennessee     3.1          3.5      1.4       0.0  2.6     2.2
## Utah          2.3          1.9      2.7       2.6  0.0     2.3
## Arizona       4.0          3.1      1.2       2.2  2.3     0.0

In this dataset, the columns are variables. Hence, if we want to compute pairwise distances between variables, we must start by transposing the data to have variables in the rows of the dataset before using the dist() function. The function t() is used for transposing the data

4.2 Correlation based distance measures

In the example above, Euclidean distance has been used for measuring the dissimilarities between observations.

In this section we’ll use a correlation based distance which can be computed using the function as.dist().

We start by computing pairwise correlation matrix using the function cor(x, method). Correlation method can be either pearson, spearman or kendall. Next, the correlation matrix is converted as distance matrix using the function as.dist().

The function cor() compute pairwise correlation coefficients between the columns of the data. In our case columns are variables. We want to compute correlation coefficients between observations. So, the data must be first transposed using the function t() in order to have observations in the columns of the data:

# Compute correlation matrix
res.cor <- cor(t(df.scaled),  method = "pearson")
# Compute distance matrix
dist.cor <- as.dist(1 - res.cor)

round(as.matrix(dist.cor)[1:6, 1:6], 1)
##              Iowa Rhode Island Maryland Tennessee Utah Arizona
## Iowa          0.0          0.6      1.9       1.3  0.2     1.0
## Rhode Island  0.6          0.0      1.7       1.9  0.5     0.9
## Maryland      1.9          1.7      0.0       0.5  1.7     0.7
## Tennessee     1.3          1.9      0.5       0.0  1.6     1.4
## Utah          0.2          0.5      1.7       1.6  0.0     0.5
## Arizona       1.0          0.9      0.7       1.4  0.5     0.0

4.3 The function daisy() in cluster package

The function daisy() can be also used to compute dissimilarity matrices between observations. It returns also the distances between the rows of the input data which can be a matrix or a data frame.

Compared to dist() whose input must be numeric variables, the main feature of daisy() is its ability to handle other variable types as well (e.g. nominal, ordinal, (a)symmetric binary). In that case Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details read the R documentation for daisy() function (?daisy).

A simplified format of daisy() function is:

daisy(x, metric = c("euclidean", "manhattan", "gower"),
      stand = FALSE)

  • x: numeric matrix or data frame. Dissimilarities will be computed between the rows of x. If x is a data frame, columns of class factor are considered as nominal variables and columns of class ordered are recognized as ordinal variables.
  • metric: The metric to be used for distance measures. Possible values are “euclidean”, “manhattan” and “gower”. “Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric.
  • stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation


The R code below applies daisy() on flower data which contains factor, ordered and numeric variables:

library(cluster)
# Load data
data(flower)
head(flower)
##   V1 V2 V3 V4 V5 V6  V7 V8
## 1  0  1  1  4  3 15  25 15
## 2  1  0  0  2  1  3 150 50
## 3  0  1  0  3  3  1 150 50
## 4  0  0  1  4  2 16 125 50
## 5  0  1  0  5  2  2  20 15
## 6  0  1  0  4  3 12  50 40
# Data structure
str(flower)
## 'data.frame':    18 obs. of  8 variables:
##  $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
##  $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
##  $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
##  $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
##  $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
##  $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
##  $ V7: num  25 150 150 125 20 50 40 100 25 100 ...
##  $ V8: num  15 50 50 50 15 40 20 15 15 60 ...
# Distance matrix
dd <- as.matrix(daisy(flower))
head(round(dd[, 1:6], 2))
##      1    2    3    4    5    6
## 1 0.00 0.89 0.53 0.35 0.41 0.23
## 2 0.89 0.00 0.51 0.55 0.62 0.66
## 3 0.53 0.51 0.00 0.57 0.37 0.30
## 4 0.35 0.55 0.57 0.00 0.64 0.42
## 5 0.41 0.62 0.37 0.64 0.00 0.34
## 6 0.23 0.66 0.30 0.42 0.34 0.00

5 Visualizing distance matrices

A simple solution for visualizing the distance matrices is to use the function corrplot() [in corrplot package]. Other specialized methods, such as hierarchical clustering dendrogram or heatmap will be comprehensively described in other chapters. A brief introduction is provided here using the following R codes.

# install.packages("corrplot")
library("corrplot")
# Euclidean distance
corrplot(as.matrix(dist.eucl), is.corr = FALSE, method = "color")

Distance measures - Unsupervised Machine Learning

# Visualize only the upper triangle
corrplot(as.matrix(dist.eucl), is.corr = FALSE, method = "color",
         order="hclust", type = "upper")

Distance measures - Unsupervised Machine Learning

# Use hierarchical clustering dendogram to visualize clusters
# of similar observations
plot(hclust(dist.eucl, method = "ward.D2"))

Distance measures - Unsupervised Machine Learning

# Use heatmap
heatmap(as.matrix(dist.eucl), symm = TRUE,
        distfun = function(x) as.dist(x))

Distance measures - Unsupervised Machine Learning

6 Infos

This analysis has been performed using R software (ver. 3.2.1)


Viewing all articles
Browse latest Browse all 183

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>