Kruskal-Wallis Test in R

May 21, 2020, 11:35 pm

≫ Next: F-Test: Compare Two Variances in R

≪ Previous: MANOVA Test in R: Multivariate Analysis of Variance

What is Kruskal-Wallis test?

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met. This tutorial describes how to compute Kruskal-Wallis test in R software.

Kruskal Wallis Test

Related Book:

Practical Statistics in R for Comparing Groups: Numerical Variables

Visualize your data and compute Kruskal-Wallis test in R

Import your data into R

Prepare your data as specified here: Best practices for preparing your data set for R
Save your data in an external .txt tab or .csv files
Import your data into R as follow:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.

my_data <- PlantGrowth

Check your data

# print the head of the file
head(my_data)

  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

In R terminology, the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically.

# Show the group levels
levels(my_data$group)

[1] "ctrl" "trt1" "trt2"

If the levels are not automatically in the correct order, re-order them as follow:

my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))

It’s possible to compute summary statistics by groups. The dplyr package can be used.

To install dplyr package, type this:

install.packages("dplyr")

Compute summary statistics by groups:

library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )

Source: local data frame [3 x 6]
   group count  mean        sd median    IQR
  (fctr) (int) (dbl)     (dbl)  (dbl)  (dbl)
1   ctrl    10 5.032 0.5830914  5.155 0.7425
2   trt1    10 4.661 0.7936757  4.550 0.6625
3   trt2    10 5.526 0.4425733  5.435 0.4675

Visualize the data using box plots

To use R base graphs read this: R base graphs. Here, we’ll use the ggpubr R package for an easy ggplot2-based data visualization.
Install the latest version of ggpubr from GitHub as follow (recommended):

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")

Or, install from CRAN as follow:

install.packages("ggpubr")

Visualize your data with ggpubr:

# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

Kruskal-Wallis Test in R

# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

Kruskal-Wallis Test in R

Compute Kruskal-Wallis test

We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.

The test can be performed using the function kruskal.test() as follow:

kruskal.test(weight ~ group, data = my_data)


    Kruskal-Wallis rank sum test
data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Interpret

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

Multiple pairwise-comparison between groups

From the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, but we don’t know which pairs of groups are different.

It’s possible to use the function pairwise.wilcox.test() to calculate pairwise comparisons between group levels with corrections for multiple testing.

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
                 p.adjust.method = "BH")


    Pairwise comparisons using Wilcoxon rank sum test 
data:  PlantGrowth$weight and PlantGrowth$group 
     ctrl  trt1 
trt1 0.199 -    
trt2 0.095 0.027
P value adjustment method: BH

The pairwise comparison shows that, only trt1 and trt2 are significantly different (p < 0.05).

Infos

This analysis has been performed using R software (ver. 3.2.4).

↧

F-Test: Compare Two Variances in R

June 23, 2020, 11:37 am

≫ Next: Comparing Variances in R

≪ Previous: Kruskal-Wallis Test in R

F-test is used to assess whether the variances of two populations (A and B) are equal.

F-Test in R: Compare Two Sample Variances

Contents

When to you use the F-test?

Comparing two variances is useful in several cases, including:

When you want to perform a two samples t-test to check the equality of the variances of the two samples
When you want to compare the variability of a new measurement method to an old one. Does the new method reduce the variability of the measure?

Research questions and statistical hypotheses

Typical research questions are:

whether the variance of group A (\(\sigma^2_A\)) is equal to the variance of group B (\(\sigma^2_B\))?
whether the variance of group A (\(\sigma^2_A\)) is less than the variance of group B (\(\sigma^2_B\))?
whether the variance of group A (\(\sigma^2_A\)) is greather than the variance of group B (\(\sigma^2_B\))?

In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

\(H_0: \sigma^2_A = \sigma^2_B\)
\(H_0: \sigma^2_A \leq \sigma^2_B\)
\(H_0: \sigma^2_A \geq \sigma^2_B\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

\(H_a: \sigma^2_A \ne \sigma^2_B\) (different)
\(H_a: \sigma^2_A > \sigma^2_B\) (greater)
\(H_a: \sigma^2_A < \sigma^2_B\) (less)

Note that:

Hypotheses 1) are called two-tailed tests
Hypotheses 2) and 3) are called one-tailed tests

Formula of F-test

The test statistic can be obtained by computing the ratio of the two variances \(S_A^2\) and \(S_B^2\).

\[F = \frac{S_A^2}{S_B^2}\]

The degrees of freedom are \(n_A - 1\) (for the numerator) and \(n_B - 1\) (for the denominator).

Note that, the more this ratio deviates from 1, the stronger the evidence for unequal population variances.

Note that, the F-test requires the two samples to be normally distributed.

Compute F-test in R

R function

The R function var.test() can be used to compare two variances as follow:

# Method 1
var.test(values ~ groups, data, 
         alternative = "two.sided")
# or Method 2
var.test(x, y, alternative = "two.sided")

x,y: numeric vectors
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

Import and check your data into R

To import your data, use the following R code:

# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named ToothGrowth:

# Store the data in the variable my_data
my_data <- ToothGrowth

To have an idea of what the data look like, we start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package]:

library("dplyr")
sample_n(my_data, 10)

    len supp dose
43 23.6   OJ  1.0
28 21.5   VC  2.0
25 26.4   VC  2.0
56 30.9   OJ  2.0
46 25.2   OJ  1.0
7  11.2   VC  0.5
16 17.3   VC  1.0
4   5.8   VC  0.5
48 21.2   OJ  1.0
37  8.2   OJ  0.5

We want to test the equality of variances between the two groups OJ and VC in the column “supp”.

Preleminary test to check F-test assumptions

F-test is very sensitive to departure from the normal assumption. You need to check whether the data is normally distributed before using the F-test.

Shapiro-Wilk test can be used to test whether the normal assumption holds. It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable. Q-Q plot draws the correlation between a given sample and the normal distribution.

If there is doubt about normality, the better choice is to use Levene’s test or Fligner-Killeen test, which are less sensitive to departure from normal assumption.

Compute F-test

# F-test
res.ftest <- var.test(len ~ supp, data = my_data)
res.ftest


    F test to compare two variances
data:  len by supp
F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3039488 1.3416857
sample estimates:
ratio of variances 
         0.6385951

Interpretation of the result

The p-value of F-test is p = 0.2331433 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.

Access to the values returned by var.test() function

The function var.test() returns a list containing the following components:

statistic: the value of the F test statistic.
parameter: the degrees of the freedom of the F distribution of the test statistic.
p.value: the p-value of the test.
conf.int: a confidence interval for the ratio of the population variances.
estimate: the ratio of the sample variances

The format of the R code to use for getting these values is as follow: