Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

MASS package and factoextra : Correspondence Analysis - R software and data mining

$
0
0


As illustrated in my previous article, correspondence analysis (CA) is used to analyse the contingency table formed by two categorical variables.

This article describes how to perform correspondence analysis using MASS package

Required packages

MASS(for computing CA) and factoextra (for CA visualization) packages are used.

These packages can be installed as follow :

install.packages("MASS")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.1 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load MASS and factoextra

library("MASS")
library("factoextra")

Data format

We’ll use the data sets housetasks [in factoextra].

data(housetasks)
head(housetasks)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53

The data is contingency table containing 13 housetasks and their repartition in the couple :

  • rows are the different tasks
  • values are the frequencies of the tasks done :
    • by the wife only
    • alternatively
    • by the husband only
    • or jointly


Correspondence analysis (CA)

The function corresp() [in MASS package] can be used. A simplified format is :

corresp(x,  nf = 1)

  • x : a data frame, matrix or table (contingency table)
  • nf : number of dimensions to be included in the output


Example of usage :

res.ca <- corresp(housetasks, nf= 3)

The output of the function corresp() is an object of class correspondence structured as a list including :

names(res.ca)
[1] "cor"    "rscore" "cscore" "Freq"  
  • cor: the square root of eigenvalues
  • rscore, cscore: the row and column scores
  • Freq: the initial contingency table

Interpretation of CA outputs

For the interpretation of result, read this article: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Eigenvalues and scree plot

The proportion of inertia explained by the principal axes can be obtained using the function get_eigenvalue() [in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.ca)
eigenvalues
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.5428893         48.69222                    48.69222
Dim.2  0.4450028         39.91269                    88.60491
Dim.3  0.1270484         11.39509                   100.00000

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):

fviz_screeplot(res.ca)

Correspondance analysis - R software and data mining

Read more about eigenvalues and screeplot: Eigenvalues data visualization

Biplot of row and column variables

You can use the base R function biplot(res.ca) or use the function the function fviz_ca_biplot()[in factoextra package] to draw a nice looking plot:

fviz_ca_biplot(res.ca)

Correspondance analysis - R software and data mining

# Change the theme
fviz_ca_biplot(res.ca) +
  theme_minimal()

Correspondance analysis - R software and data mining

Read more about fviz_ca_biplot(): fviz_ca_biplot

Row variables

The function get_ca_row()[in factoextra] is used to extract the results for row variables. This functions returns a list containing the coordinates, the cos2, the contribution and the inertia of row variables. The function fviz_ca_row() [in factoextra] is used to visualize only row points.

row <- get_ca_row(res.ca)
row
Correspondence Analysis - Results for rows
 ===================================================
  Name       Description                
1 "$coord"   "Coordinates for the rows" 
2 "$cos2"    "Cos2 for the rows"        
3 "$contrib" "contributions of the rows"
4 "$inertia" "Inertia of the rows"      
# Coordinates
head(row$coord)
                Dim.1      Dim.2       Dim.3
Laundry    -0.9918368 -0.4953220 -0.31672897
Main_meal  -0.8755855 -0.4901092 -0.16406487
Dinner     -0.6925740 -0.3081043 -0.20741377
Breakfeast -0.5086002 -0.4528038  0.22040453
Tidying    -0.3938084  0.4343444 -0.09421375
Dishes     -0.1889641  0.4419662  0.26694926
# Visualize row variables only 
fviz_ca_row(res.ca) +
  theme_minimal()

Correspondance analysis - R software and data mining

Column varables

The result for columns gives the same information as described for rows.

col <- get_ca_col(res.ca)
# Coordinates
head(col$coord)
                  Dim.1      Dim.2       Dim.3
Wife        -0.83762154 -0.3652207 -0.19991139
Alternating -0.06218462 -0.2915938  0.84858939
Husband      1.16091847 -0.6019199 -0.18885924
Jointly      0.14942609  1.0265791 -0.04644302
# Visualize column variables only 
fviz_ca_col(res.ca) +
  theme_minimal()

Correspondance analysis - R software and data mining

References and further reading

Infos

This analysis has been performed using R software (ver. 3.1.2), FactoMineR (ver. ) and factoextra (ver. 1.0.2)


Viewing all articles
Browse latest Browse all 183

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>