Quantcast
Channel: Easy Guides
Viewing all articles
Browse latest Browse all 183

ca package and factoextra : Correspondence Analysis - R software and data mining

$
0
0


As described here, correspondence analysis is used to analyse the contingency table formed by two qualitative variables.

This article describes how to perform a correspondence analysis using ca package

Required packages

ca(for computing CA) and factoextra (for CA visualization) packages are used.

These packages can be installed as follow :

install.packages("ca")

# install.packages("devtools")
devtools::install_github("kassambara/factoextra")

Note that, for factoextra a version >= 1.0.1 is required for this tutorial. If it’s already installed on your computer, you should re-install it to have the most updated version.

Load ca and factoextra

library("ca")
library("factoextra")

Data format

We’ll use the data sets housetasks taken from the package ade4.

data(housetasks)
head(housetasks, 13)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

The data is a contingency table containing 13 housetasks and their repartition in the couple :

  • rows are the different tasks
  • values are the frequencies of the tasks done :
    • by the wife only
    • alternatively
    • by the husband only
    • or jointly


Correspondence analysis (CA)

The function ca() [in ca package] can be used. A simplified format is :

ca(obj,  nd = NA)

  • obj : a data frame, matrice or table (contingency table)
  • nd : number of dimensions to be included in the output


Example of usage :

res.ca <- ca(housetasks, nd = 3)

The output of the function ca() is structured as a list including :

names(res.ca)
 [1] "sv"         "nd"         "rownames"   "rowmass"    "rowdist"    "rowinertia" "rowcoord"  
 [8] "rowsup"     "colnames"   "colmass"    "coldist"    "colinertia" "colcoord"   "colsup"    
[15] "call"      

The standard coordinates of row variables can be extracted as follow:

res.ca$rowcoord
                 Dim1       Dim2       Dim3
Laundry    -1.3461225 -0.7425167 -0.8885935
Main_meal  -1.1883460 -0.7347025 -0.4602894
Dinner     -0.9399625 -0.4618664 -0.5819061
Breakfeast -0.6902730 -0.6787794  0.6183521
Tidying    -0.5344773  0.6511077 -0.2643198
Dishes     -0.2564623  0.6625334  0.7489349
Shopping   -0.1597173  0.6045960  0.5684434
Official    0.3075858 -0.3801811  2.5905284
Driving     1.0067309 -0.9795065  1.5274961
Finances    0.3674852  0.9262210  0.0976236
Insurance   0.8782125  0.7102288 -0.8118104
Repairs     2.0748608 -1.2955835 -1.3244577
Holidays    0.3426748  2.1511592 -0.3635596

The standard coordinates of columns are:

res.ca$colcoord
                   Dim1       Dim2       Dim3
Wife        -1.13682130 -0.5474873 -0.5608580
Alternating -0.08439706 -0.4371162  2.3807453
Husband      1.57560041 -0.9023133 -0.5298508
Jointly      0.20280133  1.5389023 -0.1302974

Note that, the methods print() and summary() are available for ca objects.

# printing method
print(x)

# Summary method
summary(object, scree = TRUE, rows = TRUE, columns = TRUE)

  • x, object: CA object
  • scree: If TRUE, the scree plot is included in the output
  • rows: If TRUE, the results for rows are included in the output
  • columns: If TRUE, the results for columns are included in the output


Summary of CA outputs

summary(res.ca)

Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      0.542889  48.7  48.7  ************             
 2      0.445003  39.9  88.6  **********               
 3      0.127048  11.4 100.0  ***                      
        -------- -----                                 
 Total: 1.114940 100.0                                 


Rows:
     name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr    k=3 cor ctr  
1  | Lndr |  101 1000  120 | -992 740 183 | -495 185  56 | -317  75  80 |
2  | Mn_m |   88 1000   81 | -876 742 124 | -490 232  47 | -164  26  19 |
3  | Dnnr |   62 1000   34 | -693 777  55 | -308 154  13 | -207  70  21 |
4  | Brkf |   80 1000   37 | -509 505  38 | -453 400  37 |  220  95  31 |
5  | Tdyn |   70 1000   22 | -394 440  20 |  434 535  30 |  -94  25   5 |
6  | Dshs |   65 1000   18 | -189 118   4 |  442 646  28 |  267 236  36 |
7  | Shpp |   69 1000   13 | -118  64   2 |  403 748  25 |  203 189  22 |
8  | Offc |   55 1000   48 |  227  53   5 | -254  66   8 |  923 881 369 |
9  | Drvn |   80 1000   91 |  742 432  81 | -653 335  76 |  544 233 186 |
10 | Fnnc |   65 1000   27 |  271 161   9 |  618 837  56 |   35   3   1 |
11 | Insr |   80 1000   52 |  647 576  61 |  474 309  40 | -289 115  53 |
12 | Rprs |   95 1000  281 | 1529 707 407 | -864 226 159 | -472  67 166 |
13 | Hldy |   92 1000  176 |  252  30  11 | 1435 962 425 | -130   8  12 |

Columns:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr    k=3 cor ctr  
1 | Wife |  344 1000  270 | -838 802 445 | -365 152 103 | -200  46 108 |
2 | Altr |  146 1000  106 |  -62   5   1 | -292 105  28 |  849 890 825 |
3 | Hsbn |  218 1000  342 | 1161 772 542 | -602 208 178 | -189  20  61 |
4 | Jntl |  292 1000  282 |  149  21  12 | 1027 977 691 |  -46   2   5 |

The result of the function summary() contains 3 tables:

  • Table 1 - Eigenvalues: table 1 contains the eigenvalues and the percentage of inertia retained by each dimension. Additionally, accumulated percentages and a scree plot are shown.
  • Table 2 contains the results for row variables (X1000):
    • The principal coordinates for the first 3 dimensions (k = 1, k = 2 and k = 3).
    • Squared correlations (cor or cos2) and contributions (ctr) of the points. Note that, cor and ctr are expressed in per mills.
    • mass: the mass (or total frequency) of each point (X1000).
    • qlt is the total quality (X1000) of representation of points by the 3 included dimensions. In our example, it is the sum of the squared correlations over the three included dimensions.
    • inr: the inertia of the point (in per mills of the total inertia).
  • Table 3 contains the results for column variables (the same as the row variables).

The function summary.ca() returns a list : list(scree, rows, columns).

Use the R code below to get the table containing the results for rows:

summary(res.ca)$rows
   name mass  qlt  inr  k=1 cor ctr  k=2 cor ctr  k=3 cor ctr
1  Lndr  101 1000  120 -992 740 183 -495 185  56 -317  75  80
2  Mn_m   88 1000   81 -876 742 124 -490 232  47 -164  26  19
3  Dnnr   62 1000   34 -693 777  55 -308 154  13 -207  70  21
4  Brkf   80 1000   37 -509 505  38 -453 400  37  220  95  31
5  Tdyn   70 1000   22 -394 440  20  434 535  30  -94  25   5
6  Dshs   65 1000   18 -189 118   4  442 646  28  267 236  36
7  Shpp   69 1000   13 -118  64   2  403 748  25  203 189  22
8  Offc   55 1000   48  227  53   5 -254  66   8  923 881 369
9  Drvn   80 1000   91  742 432  81 -653 335  76  544 233 186
10 Fnnc   65 1000   27  271 161   9  618 837  56   35   3   1
11 Insr   80 1000   52  647 576  61  474 309  40 -289 115  53
12 Rprs   95 1000  281 1529 707 407 -864 226 159 -472  67 166
13 Hldy   92 1000  176  252  30  11 1435 962 425 -130   8  12

The summary for column variables is:

summary(res.ca)$columns
  name mass  qlt  inr  k=1 cor ctr  k=2 cor ctr  k=3 cor ctr
1 Wife  344 1000  270 -838 802 445 -365 152 103 -200  46 108
2 Altr  146 1000  106  -62   5   1 -292 105  28  849 890 825
3 Hsbn  218 1000  342 1161 772 542 -602 208 178 -189  20  61
4 Jntl  292 1000  282  149  21  12 1027 977 691  -46   2   5

Interpretation of CA outputs

The interpretation of correspondence analysis has been described in my previous post: Correspondence Analysis in R: The Ultimate Guide for the Analysis, the Visualization and the Interpretation.

Eigenvalues and scree plot

The proportion of inertia explained by the principal dimensions can be extracted using the function get_eigenvalue() [in factoextra] as follow :

eigenvalues <- get_eigenvalue(res.ca)
eigenvalues
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.5428893         48.69222                    48.69222
Dim.2  0.4450028         39.91269                    88.60491
Dim.3  0.1270484         11.39509                   100.00000

The function fviz_screeplot() [in factoextra package] can be used to draw the scree plot (the percentages of inertia explained by the CA dimensions):

fviz_screeplot(res.ca)

Correspondance analysis - R software and data mining

Read more about eigenvalues and screeplot: Eigenvalues data visualization

Biplot of row and column variables

The base plot()[in ca package] function can be used:

plot(res.ca)

Correspondance analysis - R software and data mining

It’s also possible to use the function fviz_ca_biplot() [in factoextra]:

fviz_ca_biplot(res.ca)

Correspondance analysis - R software and data mining

Read more about fviz_ca_biplot(): fviz_ca_biplot

References and further reading

Infos

This analysis has been performed using R software (ver. 3.1.2), ca (ver. 0.58) and factoextra (ver. 1.0.2)


Viewing all articles
Browse latest Browse all 183

Trending Articles