Multivariate statistics is “a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable`” (wikipedia) (Does it include DFA??)
We could split multivariate methods in:
Variable interrelation
Classification (supervised and unsupervised)
Graphical methods
However, some methods can be used for variable interrelation and classification (e.g. PCA) and most require their own plotting techniques. Furthermore, statisticians can not even agree on a clear definition of multivariate stats. Hence, we will look at the most common (and a few modern) multivariate analysis in no particular order.
But first, install/load the following packages and example data set:
#install and/or load
x <- c("ggplot2", "MASS", "corrplot", "scatterplot3d", "cluster", "randomForest", "caret")
out <- lapply(x, function(y) {
if(!y %in% installed.packages()[,"Package"]) install.packages(y)
require(y, character.only = T)
})
#read from website
hylaeformis_data <- read.csv("https://marceloarayasalas.weebly.com/uploads/2/5/5/2/25524573/hylaeformis_data.csv", stringsAsFactors = FALSE)
# or download manually and read from local file
hylaeformis_data <- read.csv("hylaeformis_data.csv", stringsAsFactors = FALSE)
# remove selec column
hylaeformis_data <- hylaeformis_data[, -2]
theme_set(theme_classic(base_size = 50))
Many plotting methods for multivariate data are an extension of univariate tools. The goal is the same, explore the data to try to find patterns (or anomalies)
Let’s use the “iris” data set to create scatter plots:
pairs(iris[ , -5], pch = 16)
Now let’s try with a larger data set:
pairs(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)], pch = 16)
In this case the plot looks a little too busy. An useful alternative is the ‘corrplot’ package:
cm <- cor(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)])
corrplot(cm)
corrplot(cm, method = "ellipse")
corrplot.mixed(cm, upper = "ellipse")
Check out the ‘corrplot’ vignette for other options
Another option is to look at the relationship of 3 variables at the time using 3D plots:
scatterplot3d(x = iris$Sepal.Length, y = iris$Sepal.Width,
z = iris$Petal.Length, highlight.3d=TRUE, col.axis="blue",
pch=20)
# with lines
scatterplot3d(x = iris$Petal.Width, y = iris$Sepal.Width,type = "h",
z = iris$Petal.Length, highlight.3d=TRUE, col.axis="blue",
pch=20)
# flipping axis
scatterplot3d(z = hylaeformis_data$duration, y = hylaeformis_data$meanfreq,
type = "h", color = as.numeric(as.factor(hylaeformis_data$Locality)),
x = hylaeformis_data$dfslope, highlight.3d=FALSE, col.axis="blue",
pch=20)
# with frog data
scatterplot3d(x = hylaeformis_data$duration, y = hylaeformis_data$meanfreq,
type = "h", color = as.numeric(as.factor(hylaeformis_data$Locality)),
z = hylaeformis_data$dfslope, highlight.3d=FALSE, col.axis="blue",
pch=20)
PCA does an orthogonal transformation of correlated variables that is used for: + Reduce collinearity + Reduce dimensionality + Explore natural grouping of data
pca <- prcomp(hylaeformis_data[ , sapply(hylaeformis_data, is.numeric)], scale. = TRUE)
There are 3 things to look for when performing PCA:
Variance explained by component:
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.553 1.833 1.489 1.1284 0.9354 0.779 0.6308 0.5584
## Proportion of Variance 0.407 0.210 0.139 0.0796 0.0547 0.038 0.0249 0.0195
## Cumulative Proportion 0.407 0.617 0.756 0.8354 0.8901 0.928 0.9529 0.9724
## PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.38681 0.33831 0.2857 0.22460 0.16353 0.10432
## Proportion of Variance 0.00935 0.00715 0.0051 0.00315 0.00167 0.00068
## Cumulative Proportion 0.98173 0.98888 0.9940 0.99714 0.99881 0.99949
## PC15 PC16
## Standard deviation 0.08648 0.02663
## Proportion of Variance 0.00047 0.00004
## Cumulative Proportion 0.99996 1.00000
We can look at variance explain by each principal component by making a scree plot, which we can do in R using the “screeplot()” function:
screeplot(pca, type="lines", col = "red2", pch = 20, cex = 3)
or simply a barplot:
imp <- summary(pca)$importance
barplot(imp[2, 1:10], col = adjustcolor("blue", alpha.f = 0.3))
We should also look at the “loadings”: the contribution of the variables to each of the principal components
pca$rotation
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
duration | 0.371 | -0.102 | 0.046 | -0.078 | 0.127 | -0.057 | 0.160 | -0.063 | 0.340 | -0.107 | -0.120 | 0.197 | 0.119 | -0.174 | 0.747 | 0.110 |
meanfreq | -0.047 | -0.479 | 0.190 | 0.103 | -0.149 | -0.138 | -0.384 | 0.090 | -0.208 | -0.061 | 0.296 | 0.434 | 0.433 | 0.096 | 0.042 | -0.053 |
sd | 0.236 | -0.334 | -0.004 | 0.005 | -0.362 | -0.268 | -0.402 | 0.193 | 0.033 | -0.189 | -0.368 | -0.427 | -0.258 | -0.093 | 0.008 | 0.028 |
freq.IQR | 0.010 | -0.334 | -0.270 | 0.023 | -0.568 | 0.197 | 0.602 | 0.170 | -0.076 | -0.042 | 0.136 | 0.122 | -0.144 | -0.007 | 0.001 | 0.028 |
time.median | 0.361 | -0.165 | 0.089 | -0.091 | 0.062 | 0.094 | -0.012 | -0.065 | 0.069 | 0.336 | 0.194 | -0.135 | 0.050 | 0.174 | -0.217 | 0.746 |
time.Q25 | 0.353 | -0.177 | 0.080 | -0.099 | 0.033 | 0.132 | -0.033 | -0.067 | 0.002 | 0.521 | 0.328 | -0.264 | -0.071 | -0.094 | 0.062 | -0.581 |
time.IQR | 0.363 | -0.108 | 0.079 | -0.081 | 0.112 | -0.118 | 0.165 | -0.008 | 0.484 | -0.253 | -0.128 | 0.269 | 0.056 | 0.153 | -0.555 | -0.262 |
kurt | 0.289 | 0.114 | 0.144 | 0.081 | 0.286 | -0.501 | 0.322 | 0.431 | -0.475 | -0.071 | 0.124 | -0.020 | -0.088 | -0.014 | -0.007 | 0.007 |
time.ent | -0.302 | 0.190 | 0.002 | 0.069 | -0.230 | -0.465 | 0.024 | 0.347 | 0.536 | 0.366 | 0.190 | -0.043 | 0.085 | 0.077 | 0.062 | 0.028 |
mindom | -0.170 | -0.125 | 0.520 | 0.241 | -0.025 | -0.037 | 0.228 | -0.296 | 0.124 | -0.362 | 0.295 | -0.413 | -0.048 | 0.258 | 0.107 | -0.019 |
maxdom | -0.276 | -0.332 | -0.025 | -0.112 | 0.328 | -0.004 | 0.020 | 0.103 | 0.175 | -0.143 | 0.270 | -0.070 | -0.114 | -0.704 | -0.171 | 0.120 |
dfrange | -0.171 | -0.288 | -0.327 | -0.246 | 0.405 | 0.177 | -0.047 | 0.383 | 0.086 | -0.127 | 0.048 | -0.185 | -0.028 | 0.533 | 0.165 | -0.066 |
modindx | -0.150 | -0.132 | -0.174 | -0.617 | -0.055 | -0.504 | 0.131 | -0.488 | -0.154 | 0.034 | -0.005 | -0.040 | 0.054 | 0.081 | 0.001 | -0.013 |
startdom | -0.195 | -0.379 | 0.039 | 0.339 | 0.214 | -0.066 | 0.291 | -0.067 | -0.054 | 0.354 | -0.545 | -0.142 | 0.333 | -0.028 | -0.061 | -0.036 |
enddom | -0.211 | -0.168 | 0.492 | -0.189 | 0.051 | 0.044 | -0.006 | 0.064 | -0.012 | 0.249 | -0.209 | 0.383 | -0.608 | 0.122 | 0.062 | 0.022 |
dfslope | -0.054 | 0.141 | 0.442 | -0.533 | -0.179 | 0.258 | 0.115 | 0.339 | -0.054 | -0.056 | -0.167 | -0.187 | 0.432 | -0.121 | -0.038 | -0.019 |
Exercise 1
Make a ggplot facet graph with the loadings for each principal component on a individual panel
Once the results of the PCA have been explored we can look at the grouping of our data based on this new dimensions. To accomplish this, we will need to extract the scores for each PC:
pcs <- as.data.frame(pca$x)
head(pcs)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## 1 -0.424 0.770 -1.477 0.9932 -0.141 -0.813 -0.5045 0.36835 -0.2193
## 2 -1.401 0.524 1.313 1.1524 -0.676 0.322 -0.6450 -0.03217 0.2597
## 3 -1.611 1.007 0.672 -0.9232 -0.516 -0.424 -0.6112 -0.17224 -0.6613
## 4 -1.784 0.874 -0.218 -1.3177 0.860 -0.248 0.0106 0.66222 -0.0634
## 5 -3.511 -1.205 2.797 -0.0395 0.314 0.844 0.1752 0.00871 0.1391
## 6 -3.034 -0.852 2.861 -1.1812 -0.124 1.491 -0.2048 1.26911 -0.0202
## PC10 PC11 PC12 PC13 PC14 PC15 PC16
## 1 0.0376 0.0306 0.00188 0.3229 -0.12368 -0.0218 -0.02406
## 2 0.1521 -0.1340 0.04512 0.0516 -0.02768 0.1260 0.00872
## 3 0.1365 -0.1263 0.07417 0.0949 0.00483 0.1203 -0.00773
## 4 0.7211 -0.7491 -0.32497 -0.1728 0.12733 0.1114 -0.01912
## 5 0.0834 0.0400 -0.02238 0.2161 -0.04933 0.1561 0.04586
## 6 -0.3972 -0.0936 0.14146 0.3600 -0.13859 -0.1374 -0.00721
Now the first 2 Components can be plotted:
pcs$locality <- hylaeformis_data$Locality
ggplot(pcs, aes(x = PC1, y = PC2, color = locality)) +
stat_ellipse() +
geom_point() +
theme_classic()
We can also use a 3D visualization to plot 3 PCs at the time:
scatterplot3d(x = pcs$PC1, y = pcs$PC3, z = pcs$PC2, type = "h",
color = as.numeric(as.factor(hylaeformis_data$Locality)),
pch=20)
This analysis aim to predict a categorical variable based on a set of continuous variables. It is widely used in biological science (e.g. morphometrics)
First, we have to create a DFA model:
fd <- lda(Locality ~ ., hylaeformis_data[, !names(hylaeformis_data) %in% c("sound.files", "Species")])
And then apply to the original data set:
pred <- predict(fd, hylaeformis_data[, !names(hylaeformis_data) %in% c("sound.files", "Species", "Locality")])
We can look at the accuracy of the prediction to see how well the model performed. This is typically shown by the “confusion matrix”:
ct <- table(pred$class, hylaeformis_data$Locality)
prop.table(ct, 1)
##
## Dantas Monserrat Tapanti
## Dantas 0.8750 0.1250 0.0000
## Monserrat 0.0556 0.9444 0.0000
## Tapanti 0.0000 0.0000 1.0000
diag(prop.table(ct, 1))
## Dantas Monserrat Tapanti
## 0.875 0.944 1.000
sum(diag(prop.table(ct)))
## [1] 0.927
Similar to PCA we can plotting the vectors resulting from DFA to assess the accuracy of the analysis:
plot(fd)
Although the default graph is not the best. Let’s try our own:
fd.v <- as.data.frame(pred$x)
fd.v$locality <- hylaeformis_data$Locality
ggplot(fd.v, aes(x = LD1, y = LD2, color = locality)) +
stat_ellipse() +
geom_point() +
theme_classic()
Exercise 2
The ‘MASS’ package can also perform a quadratic discriminant function analysis
2.1 Find the quadratic discriminant function analysis in ‘MASS’ and run it on the frog call data (“hylaeformis_data”)
2.2 Compare the confusion matrix to the one of the linear discriminant function analysis
2.3 Plot the results using a scatterplot on ‘ggplot2’
Cluster analysis describes a range of algorithms for investigating structure in data, the main interest is in finding groups of objects who are more alike. Several algorithms have been developed for cluster analysis:
K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters:
# prepare data
x <- as.matrix(hylaeformis_data[, sapply(hylaeformis_data, is.numeric)])
rownames(x) <- hylaeformis_data$Locality
# K-Means Cluster Analysis
fit <- kmeans(x, 3) # 3 cluster solution
# plot result
clusplot(x, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
This also encompasses wide range of hierarchical clustering approaches. Let’s try a simple example with Ward’s method:
d <- dist(x, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward.D")
plot(fit, main = "")
This analysis constructs a hierarchy of clusterings, starting with one large cluster containing all n observations. Clusters are divided until each cluster contains only a single observation. Then, the same step is repeated within each subcluster.
Divisive cluster analysis can be done in R using the ‘cluster’ package:
hyla.diana <- diana(x)
plot(hyla.diana, which = 2, cex = 0.6,
main = "Dendrogram", xlab = "")
Note that these methods are using slightly different algorithms for clustering estimation. You should understand the requirements and assumptions of each method before applying it to your data set.
There are also some fancy visualizations of clustering results like the ‘heatmap’ method from the default ‘stats’ package:
rc <- rainbow(nrow(x), start=0, end=.3)
cc <- rainbow(ncol(x), start=0, end=.3)
hv <- heatmap(x, col = cm.colors(256), scale="column",
RowSideColors = rc, ColSideColors = cc, margin=c(5,10))
Exercise 3
Try 2 other clustering algorithms and compare the results to those of from the ‘ward.D’ method (Look at the ‘hclust’ function documentation)
Random forest is an ensemble learning method for classification, regression and other tasks. It works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees (Denisko & Hoffman 2018). Thousand of decision trees can be produce to find the combination of variables that better split a group of items. Random decision forests correct for decision trees’ habit of overfitting to their training set.
Random forest can be implemented in the ‘randomForest’ package:
rfModel <- randomForest(x = hylaeformis_data[, sapply(hylaeformis_data, is.numeric)], y = as.factor(hylaeformis_data$Locality), importance = F, proximity = F, ntree = 10000, mtry = 6)
conf.m <- confusionMatrix(data = rfModel$predicted, reference = hylaeformis_data$Locality)
conf.m
## Confusion Matrix and Statistics
##
## Reference
## Prediction Dantas Monserrat Tapanti
## Dantas 20 2 0
## Monserrat 2 18 0
## Tapanti 0 0 13
##
## Overall Statistics
##
## Accuracy : 0.927
## 95% CI : (0.824, 0.98)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 2.36e-16
##
## Kappa : 0.888
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Dantas Class: Monserrat Class: Tapanti
## Sensitivity 0.909 0.900 1.000
## Specificity 0.939 0.943 1.000
## Pos Pred Value 0.909 0.900 1.000
## Neg Pred Value 0.939 0.943 1.000
## Prevalence 0.400 0.364 0.236
## Detection Rate 0.364 0.327 0.236
## Detection Prevalence 0.400 0.364 0.236
## Balanced Accuracy 0.924 0.921 1.000
Check the CRAN Multivariate Task View for a more comprehensive list of multivariate tools tools in R.
Denisko D, Hoffman MM (2018). “Classification and interaction in random forests”. Proceedings of the National Academy of Sciences of the United States of America. 115 (8): 1690–1692.
Session information
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_6.0-78 lattice_0.20-35 randomForest_4.6-12
## [4] kableExtra_0.7.0 cluster_2.0.6 scatterplot3d_0.3-40
## [7] corrplot_0.84 MASS_7.3-49 knitr_1.20
## [10] RColorBrewer_1.1-2 ggplot2_2.2.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.3.1 ddalpha_1.3.1 tidyr_0.7.2
## [4] sfsmisc_1.1-1 viridisLite_0.2.0 splines_3.4.3
## [7] foreach_1.4.4 prodlim_1.6.1 assertthat_0.2.0
## [10] stats4_3.4.3 DRR_0.0.3 yaml_2.1.16
## [13] robustbase_0.92-8 ipred_0.9-6 pillar_1.0.1
## [16] backports_1.1.2 glue_1.2.0 digest_0.6.13
## [19] rvest_0.3.2 colorspace_1.3-2 recipes_0.1.2
## [22] htmltools_0.3.6 Matrix_1.2-11 plyr_1.8.4
## [25] psych_1.7.8 timeDate_3042.101 pkgconfig_2.0.1
## [28] CVST_0.2-1 broom_0.4.3 purrr_0.2.4
## [31] scales_0.5.0 gower_0.1.2 lava_1.6
## [34] tibble_1.4.1 withr_2.1.1 nnet_7.3-12
## [37] lazyeval_0.2.1 mnormt_1.5-5 survival_2.41-3
## [40] magrittr_1.5 evaluate_0.10.1 nlme_3.1-131
## [43] xml2_1.2.0 dimRed_0.1.0 foreign_0.8-69
## [46] class_7.3-14 tools_3.4.3 hms_0.4.0
## [49] stringr_1.2.0 kernlab_0.9-25 munsell_0.4.3
## [52] bindrcpp_0.2 e1071_1.6-8 compiler_3.4.3
## [55] RcppRoll_0.2.2 rlang_0.1.6 grid_3.4.3
## [58] iterators_1.0.9 labeling_0.3 rmarkdown_1.8
## [61] gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-15
## [64] reshape2_1.4.3 R6_2.2.2 lubridate_1.7.1
## [67] dplyr_0.7.4 bindr_0.1 rprojroot_1.3-2
## [70] readr_1.1.1 stringi_1.1.6 parallel_3.4.3
## [73] Rcpp_0.12.14 rpart_4.1-13 tidyselect_0.2.3
## [76] DEoptimR_1.0-8