Analyzing large data sets in R can take several minutes, hours or even days. In this cases code optimization can largely improve performance and save us a good amount of time. In this class we see some optimization techniques in R.
Let’s install/load this packages first:
Performance is simply the time that it takes to do something in R. Hence, optimizing performance is just making processes go faster. The time it takes to run a process can be measured as follows:
# raiz cuadrada de una variable de largo 10^7 distribuida normalmente
system.time(sqrt(rnorm(10^7, mean = 100)))
## user system elapsed
## 0.748 0.008 0.753
# raiz cuadrada de una variable de largo 10^7 distribuida uniformemente
system.time(sqrt(runif(10^7, min = 80, max = 120)))
## user system elapsed
## 0.352 0.004 0.359
Another option is the ‘microbenchmark’ function from the package ‘microbenchmark’. It is slightly more accurate as it repeats the process several times to get a mean and distribution of process timing:
mb <- microbenchmark(sqrt(runif(10^5)), sqrt(rnorm(10^5, mean = 100)))
mb
## Unit: milliseconds
## expr min lq mean median
## sqrt(runif(10^5)) 2.877298 3.054743 3.320532 3.174672
## sqrt(rnorm(10^5, mean = 100)) 6.450303 6.795570 7.392067 7.073305
## uq max neval cld
## 3.404752 4.99594 100 a
## 7.446446 12.33274 100 b
Both options return the time the process takes to run, although ‘microbenchmark’ gives much more details. It also allow us to get a nice plot:
autoplot(mb)
Long processes in R typically involve loops. for
loops are probably the most common way to loop a process:
a<-rnorm(10^5, mean = 100)
b<-NULL
system.time(for(i in 1:length(a))
{
c <- ifelse(a[i] >= 0, "pos", "neg")
b<-append(b, c)
}
)
## user system elapsed
## 25.980 0.112 26.170
There are a few trick that can be used for improving loop (or even function) performance
b<-NULL
system.time(for(i in 1:length(a))
{
if(a[i] >= 0) c <- "pos" else c <- "neg"
b<-append(b, c)
}
)
## user system elapsed
## 24.332 0.212 24.567
b<-NULL
system.time(for(i in 1:length(a))
{
if(a[i] >= 0) c <- "pos"
if(a[i] < 0) c <- "neg"
b<-append(b, c)
})
## user system elapsed
## 25.112 0.156 25.299
append()
:b<-NULL
system.time(for(i in 1:length(a))
{
ifelse(a[i] >= 0, "pos", "neg")
b<-append(b, c)
}
)
## user system elapsed
## 24.288 0.124 24.437
b<-vector(length = length(a))
system.time(for(i in 1:length(a))
{
ifelse(a[i] >= 0, "pos", "neg")
b[i]<-c
}
)
## user system elapsed
## 0.176 0.000 0.176
Most for
loops can be run using Xapply
functions. As we have mentioned before Xapply
loops can several advantages. For instance we can use lapply
to run the above process:
#remover todos los objetos
rm(list = ls())
a<-rnorm(10^4)
b<-NULL
system.time(for(i in 1:length(a))
{if(a[i] >= 0) c1 <- "pos"
if(a[i] < 0) c1 <- "neg"
b<-append(b, c1)
})
## user system elapsed
## 0.236 0.000 0.235
# revisar si existe 'c' e 'i' en el ambiente de trabajo
exists("c1")
## [1] TRUE
exists("i")
## [1] TRUE
#remover todos los objetos
rm(list = ls())
a<-rnorm(10^4)
system.time(b <- sapply(1:length(a), function(i)
{if(a[i] >= 0) c1 <- "pos"
if(a[i] < 0) c1 <- "neg"
return(c1)
}))
## user system elapsed
## 0.012 0.000 0.012
# revisar si existe 'c' en el ambiente de trabajo
exists("c1")
## [1] FALSE
exists("i")
## [1] FALSE
The tips shown above also apply to functions (if vs ifelse, vector vs append). We can compare loop performance converting each loop to a function.
for
loop function:
a<-rnorm(10^4)
# funcion con bucle for:
sign.for <- function(a)
{b<-NULL
for(i in 1:length(a))
{if(a[i] >= 0) c <- "pos"
if(a[i] < 0) c <- "neg"
b<-append(b, c)
}
return(b)}
system.time(sign.for(a))
## user system elapsed
## 0.320 0.004 0.324
sapply
function:
sign.sapply <- function(a) {b <- sapply(1:length(a), function(i)
{
if(a[i] > 0) c <- "pos" else c <- "neg"
return(c)
}, USE.NAMES = F)
return(b)
}
system.time(sign.sapply(a))
## user system elapsed
## 0.012 0.000 0.011
We can also compared them using ‘microbenchmark’:
a<-rnorm(10^4)
compare <- microbenchmark(sign.sapply(a), sign.for(a), times = 10)
compare
## Unit: milliseconds
## expr min lq mean median uq
## sign.sapply(a) 5.084761 5.419886 5.888929 5.542048 5.682089
## sign.for(a) 218.409331 224.732995 242.210787 232.316350 244.625825
## max neval cld
## 9.761594 10 a
## 317.061148 10 b
autoplot(compare)
Functions can be compiled with the cmpfun()
function from compiler
. Compiling functions avoids having to evaluate the function each time is used:
cmpsign.sapply <- cmpfun(sign.sapply)
Now compare the function before and after compiling:
compare <- microbenchmark(sign.sapply(a), cmpsign.sapply(a), times = 10)
compare
## Unit: milliseconds
## expr min lq mean median uq max
## sign.sapply(a) 5.281843 5.423659 5.859677 5.549090 6.691979 6.794536
## cmpsign.sapply(a) 5.312670 5.474172 5.627506 5.519907 5.849997 6.190058
## neval cld
## 10 a
## 10 a
autoplot(compare)
Exercise 1
Create a function to obtain the standard deviation and compare its performance to the sd
function from base R.
Parallel computing is the use of the multiple cores in your computer to conduct processes simultaneously instead of sequentially. The parallel
package has several functions to simplify parallelization. Here we compare the lapply
function from base R with the mclapply
from parallel
:
system.time(lapply(1:8, function(i) Sys.sleep(1)))
## user system elapsed
## 0.004 0.000 8.012
# using 2 cores
system.time(mclapply(1:8, function(i) Sys.sleep(1), mc.cores = 2))
## user system elapsed
## 0.004 0.012 4.021
An elegant alternative is the package pbapply
which now allow users to run loops in parallel:
system.time(pblapply(1:8, function(i) Sys.sleep(1), cl = 2))
## user system elapsed
## 0.004 0.040 4.018
Session information
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel compiler stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pbapply_1.3-4 doParallel_1.0.11 iterators_1.0.9
## [4] foreach_1.4.4 ggplot2_2.2.1 microbenchmark_1.4-4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 pillar_1.2.3 plyr_1.8.4 tools_3.4.4
## [5] digest_0.6.13 evaluate_0.10.1 tibble_1.4.2 gtable_0.2.0
## [9] lattice_0.20-35 rlang_0.2.0 Matrix_1.2-14 yaml_2.1.16
## [13] mvtnorm_1.0-6 stringr_1.2.0 knitr_1.20 rprojroot_1.3-2
## [17] grid_3.4.4 survival_2.42-3 rmarkdown_1.8 multcomp_1.4-8
## [21] TH.data_1.0-8 magrittr_1.5 backports_1.1.2 scales_0.5.0
## [25] codetools_0.2-15 htmltools_0.3.6 splines_3.4.4 MASS_7.3-50
## [29] colorspace_1.3-2 sandwich_2.4-0 stringi_1.1.6 lazyeval_0.2.1
## [33] munsell_0.4.3 zoo_1.8-1