Multivariate statistics is “a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable`” (wikipedia) (Does it include DFA??)

We could split multivariate methods in:

Variable interrelation
Classification (supervised and unsupervised)
Graphical methods

However, some methods can be used for variable interrelation and classification (e.g. PCA) and most require their own plotting techniques. Furthermore, statisticians can not even agree on a clear definition of multivariate stats. Hence, we will look at the most common (and a few modern) multivariate analysis in no particular order.

But first, install/load the following packages and example data set:

#install and/or load
x <- c("ggplot2", "MASS", "corrplot", "scatterplot3d", "cluster", "randomForest", "caret")

out <- lapply(x, function(y) {

  if(!y %in% installed.packages()[,"Package"]) install.packages(y) 

  require(y, character.only = T)
  })

#read from website
hylaeformis_data <- read.csv("https://marceloarayasalas.weebly.com/uploads/2/5/5/2/25524573/hylaeformis_data.csv", stringsAsFactors = FALSE)

# or  download manually and read from local file
hylaeformis_data <- read.csv("hylaeformis_data.csv", stringsAsFactors = FALSE)

# remove selec column
hylaeformis_data <- hylaeformis_data[, -2]

theme_set(theme_classic(base_size = 50))

Many plotting methods for multivariate data are an extension of univariate tools. The goal is the same, explore the data to try to find patterns (or anomalies)

Scatterplots

Let’s use the “iris” data set to create scatter plots:

pairs(iris[ , -5], pch = 16)

Now let’s try with a larger data set:

pairs(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)], pch = 16)

In this case the plot looks a little too busy. An useful alternative is the ‘corrplot’ package:

cm <- cor(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)])

corrplot(cm)

corrplot(cm, method = "ellipse")

corrplot.mixed(cm, upper = "ellipse")

Check out the ‘corrplot’ vignette for other options

3D scatterplots

Another option is to look at the relationship of 3 variables at the time using 3D plots:

scatterplot3d(x = iris$Sepal.Length, y = iris$Sepal.Width, 
              z = iris$Petal.Length, highlight.3d=TRUE, col.axis="blue", 
              pch=20)

# with lines
scatterplot3d(x = iris$Petal.Width, y = iris$Sepal.Width,type = "h", 
              z = iris$Petal.Length, highlight.3d=TRUE, col.axis="blue", 
              pch=20)

# flipping axis
scatterplot3d(z = hylaeformis_data$duration, y = hylaeformis_data$meanfreq, 
              type = "h", color = as.numeric(as.factor(hylaeformis_data$Locality)),
              x = hylaeformis_data$dfslope, highlight.3d=FALSE, col.axis="blue", 
              pch=20)

# with frog data
scatterplot3d(x = hylaeformis_data$duration, y = hylaeformis_data$meanfreq, 
              type = "h", color = as.numeric(as.factor(hylaeformis_data$Locality)),
              z = hylaeformis_data$dfslope, highlight.3d=FALSE, col.axis="blue", 
              pch=20)

Principal component analysis (PCA)

PCA does an orthogonal transformation of correlated variables that is used for: + Reduce collinearity + Reduce dimensionality + Explore natural grouping of data

pca <- prcomp(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)], scale. = TRUE)

There are 3 things to look for when performing PCA:

Proportion of variance explained by each Component
Variable loadings (contribution to each Principal Component)
PCs

Variance explained by component:

summary(pca)

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5   PC6    PC7    PC8
## Standard deviation     2.553 1.833 1.489 1.1284 0.9354 0.779 0.6308 0.5584
## Proportion of Variance 0.407 0.210 0.139 0.0796 0.0547 0.038 0.0249 0.0195
## Cumulative Proportion  0.407 0.617 0.756 0.8354 0.8901 0.928 0.9529 0.9724
##                            PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.38681 0.33831 0.2857 0.22460 0.16353 0.10432
## Proportion of Variance 0.00935 0.00715 0.0051 0.00315 0.00167 0.00068
## Cumulative Proportion  0.98173 0.98888 0.9940 0.99714 0.99881 0.99949
##                           PC15    PC16
## Standard deviation     0.08648 0.02663
## Proportion of Variance 0.00047 0.00004
## Cumulative Proportion  0.99996 1.00000

We can look at variance explain by each principal component by making a scree plot, which we can do in R using the “screeplot()” function:

screeplot(pca, type="lines", col = "red2", pch = 20, cex = 3)

or simply a barplot:

imp <- summary(pca)$importance

barplot(imp[2, 1:10], col = adjustcolor("blue", alpha.f = 0.3))

We should also look at the “loadings”: the contribution of the variables to each of the principal components

pca$rotation

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12	PC13	PC14	PC15	PC16
duration	0.371	-0.102	0.046	-0.078	0.127	-0.057	0.160	-0.063	0.340	-0.107	-0.120	0.197	0.119	-0.174	0.747	0.110
meanfreq	-0.047	-0.479	0.190	0.103	-0.149	-0.138	-0.384	0.090	-0.208	-0.061	0.296	0.434	0.433	0.096	0.042	-0.053
sd	0.236	-0.334	-0.004	0.005	-0.362	-0.268	-0.402	0.193	0.033	-0.189	-0.368	-0.427	-0.258	-0.093	0.008	0.028
freq.IQR	0.010	-0.334	-0.270	0.023	-0.568	0.197	0.602	0.170	-0.076	-0.042	0.136	0.122	-0.144	-0.007	0.001	0.028
time.median	0.361	-0.165	0.089	-0.091	0.062	0.094	-0.012	-0.065	0.069	0.336	0.194	-0.135	0.050	0.174	-0.217	0.746
time.Q25	0.353	-0.177	0.080	-0.099	0.033	0.132	-0.033	-0.067	0.002	0.521	0.328	-0.264	-0.071	-0.094	0.062	-0.581
time.IQR	0.363	-0.108	0.079	-0.081	0.112	-0.118	0.165	-0.008	0.484	-0.253	-0.128	0.269	0.056	0.153	-0.555	-0.262
kurt	0.289	0.114	0.144	0.081	0.286	-0.501	0.322	0.431	-0.475	-0.071	0.124	-0.020	-0.088	-0.014	-0.007	0.007
time.ent	-0.302	0.190	0.002	0.069	-0.230	-0.465	0.024	0.347	0.536	0.366	0.190	-0.043	0.085	0.077	0.062	0.028
mindom	-0.170	-0.125	0.520	0.241	-0.025	-0.037	0.228	-0.296	0.124	-0.362	0.295	-0.413	-0.048	0.258	0.107	-0.019
maxdom	-0.276	-0.332	-0.025	-0.112	0.328	-0.004	0.020	0.103	0.175	-0.143	0.270	-0.070	-0.114	-0.704	-0.171	0.120
dfrange	-0.171	-0.288	-0.327	-0.246	0.405	0.177	-0.047	0.383	0.086	-0.127	0.048	-0.185	-0.028	0.533	0.165	-0.066
modindx	-0.150	-0.132	-0.174	-0.617	-0.055	-0.504	0.131	-0.488	-0.154	0.034	-0.005	-0.040	0.054	0.081	0.001	-0.013
startdom	-0.195	-0.379	0.039	0.339	0.214	-0.066	0.291	-0.067	-0.054	0.354	-0.545	-0.142	0.333	-0.028	-0.061	-0.036
enddom	-0.211	-0.168	0.492	-0.189	0.051	0.044	-0.006	0.064	-0.012	0.249	-0.209	0.383	-0.608	0.122	0.062	0.022
dfslope	-0.054	0.141	0.442	-0.533	-0.179	0.258	0.115	0.339	-0.054	-0.056	-0.167	-0.187	0.432	-0.121	-0.038	-0.019

Exercise 1

Make a ggplot facet graph with the loadings for each principal component on a individual panel

Once the results of the PCA have been explored we can look at the grouping of our data based on this new dimensions. To accomplish this, we will need to extract the scores for each PC:

pcs <- as.data.frame(pca$x)

head(pcs)

##      PC1    PC2    PC3     PC4    PC5    PC6     PC7      PC8     PC9
## 1 -0.424  0.770 -1.477  0.9932 -0.141 -0.813 -0.5045  0.36835 -0.2193
## 2 -1.401  0.524  1.313  1.1524 -0.676  0.322 -0.6450 -0.03217  0.2597
## 3 -1.611  1.007  0.672 -0.9232 -0.516 -0.424 -0.6112 -0.17224 -0.6613
## 4 -1.784  0.874 -0.218 -1.3177  0.860 -0.248  0.0106  0.66222 -0.0634
## 5 -3.511 -1.205  2.797 -0.0395  0.314  0.844  0.1752  0.00871  0.1391
## 6 -3.034 -0.852  2.861 -1.1812 -0.124  1.491 -0.2048  1.26911 -0.0202
##      PC10    PC11     PC12    PC13     PC14    PC15     PC16
## 1  0.0376  0.0306  0.00188  0.3229 -0.12368 -0.0218 -0.02406
## 2  0.1521 -0.1340  0.04512  0.0516 -0.02768  0.1260  0.00872
## 3  0.1365 -0.1263  0.07417  0.0949  0.00483  0.1203 -0.00773
## 4  0.7211 -0.7491 -0.32497 -0.1728  0.12733  0.1114 -0.01912
## 5  0.0834  0.0400 -0.02238  0.2161 -0.04933  0.1561  0.04586
## 6 -0.3972 -0.0936  0.14146  0.3600 -0.13859 -0.1374 -0.00721

Now the first 2 Components can be plotted:

pcs$locality <- hylaeformis_data$Locality

ggplot(pcs, aes(x = PC1, y = PC2, color = locality)) +
  stat_ellipse() + 
  geom_point() + 
  theme_classic()

We can also use a 3D visualization to plot 3 PCs at the time:

scatterplot3d(x = pcs$PC1, y = pcs$PC3,  z = pcs$PC2, type = "h", 
              color = as.numeric(as.factor(hylaeformis_data$Locality)),
              pch=20)

Discriminant function analysis

This analysis aim to predict a categorical variable based on a set of continuous variables. It is widely used in biological science (e.g. morphometrics)

First, we have to create a DFA model:

fd <- lda(Locality ~ ., hylaeformis_data[, !names(hylaeformis_data) %in% c("sound.files", "Species")])

And then apply to the original data set:

pred <- predict(fd, hylaeformis_data[, !names(hylaeformis_data) %in% c("sound.files", "Species", "Locality")])

We can look at the accuracy of the prediction to see how well the model performed. This is typically shown by the “confusion matrix”:

ct <- table(pred$class, hylaeformis_data$Locality)

prop.table(ct, 1)

##            
##             Dantas Monserrat Tapanti
##   Dantas    0.8750    0.1250  0.0000
##   Monserrat 0.0556    0.9444  0.0000
##   Tapanti   0.0000    0.0000  1.0000

diag(prop.table(ct, 1))

##    Dantas Monserrat   Tapanti 
##     0.875     0.944     1.000

sum(diag(prop.table(ct)))

## [1] 0.927

Similar to PCA we can plotting the vectors resulting from DFA to assess the accuracy of the analysis:

plot(fd)

Although the default graph is not the best. Let’s try our own:

fd.v <- as.data.frame(pred$x)

fd.v$locality <- hylaeformis_data$Locality

ggplot(fd.v, aes(x = LD1, y = LD2, color = locality)) +
  stat_ellipse() + 
  geom_point() + 
  theme_classic()

Exercise 2

The ‘MASS’ package can also perform a quadratic discriminant function analysis

2.1 Find the quadratic discriminant function analysis in ‘MASS’ and run it on the frog call data (“hylaeformis_data”)

2.2 Compare the confusion matrix to the one of the linear discriminant function analysis

2.3 Plot the results using a scatterplot on ‘ggplot2’

Cluster analysis

Cluster analysis describes a range of algorithms for investigating structure in data, the main interest is in finding groups of objects who are more alike. Several algorithms have been developed for cluster analysis:

Partitioning

K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters:

# prepare data
x  <- as.matrix(hylaeformis_data[, sapply(hylaeformis_data, is.numeric)])
rownames(x) <- hylaeformis_data$Locality


# K-Means Cluster Analysis
fit <- kmeans(x, 3) # 3 cluster solution

# plot result
clusplot(x, fit$cluster, color=TRUE, shade=TRUE,
   labels=2, lines=0)

Hierarchical clusters

This also encompasses wide range of hierarchical clustering approaches. Let’s try a simple example with Ward’s method:

d <- dist(x, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward.D")
plot(fit, main = "")

Divisive analysis

This analysis constructs a hierarchy of clusterings, starting with one large cluster containing all n observations. Clusters are divided until each cluster contains only a single observation. Then, the same step is repeated within each subcluster.

Divisive cluster analysis can be done in R using the ‘cluster’ package:

hyla.diana <- diana(x)

plot(hyla.diana, which = 2, cex = 0.6,
main = "Dendrogram", xlab = "")

Note that these methods are using slightly different algorithms for clustering estimation. You should understand the requirements and assumptions of each method before applying it to your data set.

There are also some fancy visualizations of clustering results like the ‘heatmap’ method from the default ‘stats’ package:

rc <- rainbow(nrow(x), start=0, end=.3)
cc <- rainbow(ncol(x), start=0, end=.3)

hv <- heatmap(x, col = cm.colors(256), scale="column",
RowSideColors = rc, ColSideColors = cc, margin=c(5,10))

Exercise 3

Try 2 other clustering algorithms and compare the results to those of from the ‘ward.D’ method (Look at the ‘hclust’ function documentation)

Machine learning

Random Forest

Random forest is an ensemble learning method for classification, regression and other tasks. It works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees (Denisko & Hoffman 2018). Thousand of decision trees can be produce to find the combination of variables that better split a group of items. Random decision forests correct for decision trees’ habit of overfitting to their training set.

Random forest can be implemented in the ‘randomForest’ package:

rfModel <- randomForest(x = hylaeformis_data[, sapply(hylaeformis_data, is.numeric)], y = as.factor(hylaeformis_data$Locality), importance = F, proximity = F, ntree = 10000, mtry = 6)

conf.m <- confusionMatrix(data = rfModel$predicted, reference = hylaeformis_data$Locality)

conf.m

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Dantas Monserrat Tapanti
##   Dantas        20         2       0
##   Monserrat      2        18       0
##   Tapanti        0         0      13
## 
## Overall Statistics
##                                        
##                Accuracy : 0.927        
##                  95% CI : (0.824, 0.98)
##     No Information Rate : 0.4          
##     P-Value [Acc > NIR] : 2.36e-16     
##                                        
##                   Kappa : 0.888        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: Dantas Class: Monserrat Class: Tapanti
## Sensitivity                  0.909            0.900          1.000
## Specificity                  0.939            0.943          1.000
## Pos Pred Value               0.909            0.900          1.000
## Neg Pred Value               0.939            0.943          1.000
## Prevalence                   0.400            0.364          0.236
## Detection Rate               0.364            0.327          0.236
## Detection Prevalence         0.400            0.364          0.236
## Balanced Accuracy            0.924            0.921          1.000

Check the CRAN Multivariate Task View for a more comprehensive list of multivariate tools tools in R.

References

Denisko D, Hoffman MM (2018). “Classification and interaction in random forests”. Proceedings of the National Academy of Sciences of the United States of America. 115 (8): 1690–1692.

Session information

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-78         lattice_0.20-35      randomForest_4.6-12 
##  [4] kableExtra_0.7.0     cluster_2.0.6        scatterplot3d_0.3-40
##  [7] corrplot_0.84        MASS_7.3-49          knitr_1.20          
## [10] RColorBrewer_1.1-2   ggplot2_2.2.1       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.3.1         ddalpha_1.3.1      tidyr_0.7.2       
##  [4] sfsmisc_1.1-1      viridisLite_0.2.0  splines_3.4.3     
##  [7] foreach_1.4.4      prodlim_1.6.1      assertthat_0.2.0  
## [10] stats4_3.4.3       DRR_0.0.3          yaml_2.1.16       
## [13] robustbase_0.92-8  ipred_0.9-6        pillar_1.0.1      
## [16] backports_1.1.2    glue_1.2.0         digest_0.6.13     
## [19] rvest_0.3.2        colorspace_1.3-2   recipes_0.1.2     
## [22] htmltools_0.3.6    Matrix_1.2-11      plyr_1.8.4        
## [25] psych_1.7.8        timeDate_3042.101  pkgconfig_2.0.1   
## [28] CVST_0.2-1         broom_0.4.3        purrr_0.2.4       
## [31] scales_0.5.0       gower_0.1.2        lava_1.6          
## [34] tibble_1.4.1       withr_2.1.1        nnet_7.3-12       
## [37] lazyeval_0.2.1     mnormt_1.5-5       survival_2.41-3   
## [40] magrittr_1.5       evaluate_0.10.1    nlme_3.1-131      
## [43] xml2_1.2.0         dimRed_0.1.0       foreign_0.8-69    
## [46] class_7.3-14       tools_3.4.3        hms_0.4.0         
## [49] stringr_1.2.0      kernlab_0.9-25     munsell_0.4.3     
## [52] bindrcpp_0.2       e1071_1.6-8        compiler_3.4.3    
## [55] RcppRoll_0.2.2     rlang_0.1.6        grid_3.4.3        
## [58] iterators_1.0.9    labeling_0.3       rmarkdown_1.8     
## [61] gtable_0.2.0       ModelMetrics_1.1.0 codetools_0.2-15  
## [64] reshape2_1.4.3     R6_2.2.2           lubridate_1.7.1   
## [67] dplyr_0.7.4        bindr_0.1          rprojroot_1.3-2   
## [70] readr_1.1.1        stringi_1.1.6      parallel_3.4.3    
## [73] Rcpp_0.12.14       rpart_4.1-13       tidyselect_0.2.3  
## [76] DEoptimR_1.0-8

Multivariate stats

Marcelo Araya-Salas, PhD BIONB 7200, Lec 002 (#18319), Spring 2018 Neurobiology and Behavior, Cornell University

25-04-2018