As described here, correspondence analysis is used to analyse the contingency table formed by two qualitative variables.
This article describes how to perform a correspondence analysis using ca package
Required packages
ca(for computing CA) and factoextra (for CA visualization) packages are used.
These packages can be installed as follow :
install.packages("ca")
# install.packages("devtools")
devtools::install_github("kassambara/factoextra")
Note that, for factoextra a version >= 1.0.1 is required for this tutorial. If its already installed on your computer, you should re-install it to have the most updated version.
Load ca and factoextra
library("ca")
library("factoextra")
Data format
Well use the data sets housetasks taken from the package ade4.
data(housetasks)
head(housetasks, 13)
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
The data is a contingency table containing 13 housetasks and their repartition in the couple :
- rows are the different tasks
- values are the frequencies of the tasks done :
- by the wife only
- alternatively
- by the husband only
- or jointly
Correspondence analysis (CA)
The function ca() [in ca package] can be used. A simplified format is :
ca(obj, nd = NA)
- obj : a data frame, matrice or table (contingency table)
- nd : number of dimensions to be included in the output
Example of usage :
res.ca <- ca(housetasks, nd = 3)
The output of the function ca() is structured as a list including :
names(res.ca)
[1] "sv" "nd" "rownames" "rowmass" "rowdist" "rowinertia" "rowcoord"
[8] "rowsup" "colnames" "colmass" "coldist" "colinertia" "colcoord" "colsup"
[15] "call"
The standard coordinates of row variables can be extracted as follow:
res.ca$rowcoord
Dim1 Dim2 Dim3
Laundry -1.3461225 -0.7425167 -0.8885935
Main_meal -1.1883460 -0.7347025 -0.4602894
Dinner -0.9399625 -0.4618664 -0.5819061
Breakfeast -0.6902730 -0.6787794 0.6183521
Tidying -0.5344773 0.6511077 -0.2643198
Dishes -0.2564623 0.6625334 0.7489349
Shopping -0.1597173 0.6045960 0.5684434
Official 0.3075858 -0.3801811 2.5905284
Driving 1.0067309 -0.9795065 1.5274961
Finances 0.3674852 0.9262210 0.0976236
Insurance 0.8782125 0.7102288 -0.8118104
Repairs 2.0748608 -1.2955835 -1.3244577
Holidays 0.3426748 2.1511592 -0.3635596
The standard coordinates of columns are:
res.ca$colcoord
Dim1 Dim2 Dim3
Wife -1.13682130 -0.5474873 -0.5608580
Alternating -0.08439706 -0.4371162 2.3807453
Husband 1.57560041 -0.9023133 -0.5298508
Jointly 0.20280133 1.5389023 -0.1302974
Note that, the methods print() and summary() are available for ca objects.
# printing method
print(x)
# Summary method
summary(object, scree = TRUE, rows = TRUE, columns = TRUE)
- x, object: CA object
- scree: If TRUE, the scree plot is included in the output
- rows: If TRUE, the results for rows are included in the output
- columns: If TRUE, the results for columns are included in the output
Summary of CA outputs
summary(res.ca)
Principal inertias (eigenvalues):
dim value % cum% scree plot
1 0.542889 48.7 48.7 ************
2 0.445003 39.9 88.6 **********
3 0.127048 11.4 100.0 ***
-------- -----
Total: 1.114940 100.0
Rows:
name mass qlt inr k=1 cor ctr k=2 cor ctr k=3 cor ctr
1 | Lndr | 101 1000 120 | -992 740 183 | -495 185 56 | -317 75 80 |
2 | Mn_m | 88 1000 81 | -876 742 124 | -490 232 47 | -164 26 19 |
3 | Dnnr | 62 1000 34 | -693 777 55 | -308 154 13 | -207 70 21 |
4 | Brkf | 80 1000 37 | -509 505 38 | -453 400 37 | 220 95 31 |
5 | Tdyn | 70 1000 22 | -394 440 20 | 434 535 30 | -94 25 5 |
6 | Dshs | 65 1000 18 | -189 118 4 | 442 646 28 | 267 236 36 |
7 | Shpp | 69 1000 13 | -118 64 2 | 403 748 25 | 203 189 22 |
8 | Offc | 55 1000 48 | 227 53 5 | -254 66 8 | 923 881 369 |
9 | Drvn | 80 1000 91 | 742 432 81 | -653 335 76 | 544 233 186 |
10 | Fnnc | 65 1000 27 | 271 161 9 | 618 837 56 | 35 3 1 |
11 | Insr | 80 1000 52 | 647 576 61 | 474 309 40 | -289 115 53 |
12 | Rprs | 95 1000 281 | 1529 707 407 | -864 226 159 | -472 67 166 |
13 | Hldy | 92 1000 176 | 252 30 11 | 1435 962 425 | -130 8 12 |
Columns:
name mass qlt inr k=1 cor ctr k=2 cor ctr k=3 cor ctr
1 | Wife | 344 1000 270 | -838 802 445 | -365 152 103 | -200 46 108 |
2 | Altr | 146 1000 106 | -62 5 1 | -292 105 28 | 849 890 825 |
3 | Hsbn | 218 1000 342 | 1161 772 542 | -602 208 178 | -189 20 61 |
4 | Jntl | 292 1000 282 | 149 21 12 | 1027 977 691 | -46 2 5 |
The result of the function summary() contains 3 tables:
- Table 1 - Eigenvalues: table 1 contains the eigenvalues and the percentage of inertia retained by each dimension. Additionally, accumulated percentages and a scree plot are shown.
- Table 2 contains the results for row variables (X1000):
- The principal coordinates for the first 3 dimensions (k = 1, k = 2 and k = 3).
- Squared correlations (cor or cos2) and contributions (ctr) of the points. Note that, cor and ctr are expressed in per mills.
- mass: the mass (or total frequency) of each point (X1000).
- qlt is the total quality (X1000) of representation of points by the 3 included dimensions. In our example, it is the sum of the squared correlations over the three included dimensions.
- inr: the inertia of the point (in per mills of the total inertia).
- Table 3 contains the results for column variables (the same as the row variables).
The function summary.ca() returns a list : list(scree, rows, columns).
Use the R code below to get the table containing the results for rows:
summary(res.ca)$rows
name mass qlt inr k=1 cor ctr k=2 cor ctr k=3 cor ctr
1 Lndr 101 1000 120 -992 740 183 -495 185 56 -317 75 80
2 Mn_m 88 1000 81 -876 742 124 -490 232 47 -164 26 19
3 Dnnr 62 1000 34 -693 777 55 -308 154 13 -207 70 21
4 Brkf 80 1000 37 -509 505 38 -453 400 37 220 95 31
5 Tdyn 70 1000 22 -394 440 20 434 535 30 -94 25 5
6 Dshs 65 1000 18 -189 118 4 442 646 28 267 236 36
7 Shpp 69 1000 13 -118 64 2 403 748 25 203 189 22
8 Offc 55 1000 48 227 53 5 -254 66 8 923 881 369
9 Drvn 80 1000 91 742 432 81 -653 335 76 544 233 186
10 Fnnc 65 1000 27 271 161 9 618 837 56 35 3 1
11 Insr 80 1000 52 647 576 61 474 309 40 -289 115 53
12 Rprs 95 1000 281 1529 707 407 -864 226 159 -472 67 166
13 Hldy 92 1000 176 252 30 11 1435 962 425 -130 8 12
The summary for column variables is:
summary(res.ca)$columns
name mass qlt inr k=1 cor ctr k=2 cor ctr k=3 cor ctr
1 Wife 344 1000 270 -838 802 445 -365 152 103 -200 46 108
2 Altr 146 1000 106 -62 5 1 -292 105 28 849 890 825
3 Hsbn 218 1000 342 1161 772 542 -602 208 178 -189 20 61
4 Jntl 292 1000 282 149 21 12 1027 977 691 -46 2 5
Interpretation of CA outputs
The interpretation of correspondence analysis has been described in my previous post: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.
Eigenvalues and scree plot
The proportion of inertia explained by the principal dimensions can be extracted using the function get_eigenvalue() [in factoextra] as follow :
eigenvalues <- get_eigenvalue(res.ca)
eigenvalues
eigenvalue variance.percent cumulative.variance.percent
Dim.1 0.5428893 48.69222 48.69222
Dim.2 0.4450028 39.91269 88.60491
Dim.3 0.1270484 11.39509 100.00000
The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):
fviz_screeplot(res.ca)
Read more about eigenvalues and screeplot: Eigenvalues data visualization
Biplot of row and column variables
The base plot()[in ca package] function can be used:
plot(res.ca)
Its also possible to use the function fviz_ca_biplot() [in factoextra]:
fviz_ca_biplot(res.ca)
Read more about fviz_ca_biplot(): fviz_ca_biplot
References and further reading
- Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation
- Correspondence Analysis using ade4 and factoextra
- Oleg Nenadic and Michael Greenacre. Correspondence Analysis in R, with Two- and. Three-dimensional Graphics: The ca Package. Journal of Statistical Software, May 2007. http://www.jstatsoft.org/v20/i03/paper
Infos
This analysis has been performed using R software (ver. 3.1.2), ca (ver. 0.58) and factoextra (ver. 1.0.2)