Analyzing large data sets in R can take several minutes, hours or even days. In this cases code optimization can largely improve performance and save us a good amount of time. In this class we see some optimization techniques in R.

Let’s install/load this packages first:

 

Evaluate performance

Performance is simply the time that it takes to do something in R. Hence, optimizing performance is just making processes go faster. The time it takes to run a process can be measured as follows:

# raiz cuadrada de una variable de largo 10^7 distribuida normalmente
system.time(sqrt(rnorm(10^7, mean = 100)))
##    user  system elapsed 
##   0.748   0.008   0.753
# raiz cuadrada de una variable de largo 10^7 distribuida uniformemente
system.time(sqrt(runif(10^7, min = 80, max = 120)))
##    user  system elapsed 
##   0.352   0.004   0.359

 

Another option is the ‘microbenchmark’ function from the package ‘microbenchmark’. It is slightly more accurate as it repeats the process several times to get a mean and distribution of process timing:

mb <- microbenchmark(sqrt(runif(10^5)), sqrt(rnorm(10^5, mean = 100)))

mb
## Unit: milliseconds
##                           expr      min       lq     mean   median
##              sqrt(runif(10^5)) 2.877298 3.054743 3.320532 3.174672
##  sqrt(rnorm(10^5, mean = 100)) 6.450303 6.795570 7.392067 7.073305
##        uq      max neval cld
##  3.404752  4.99594   100  a 
##  7.446446 12.33274   100   b

 

Both options return the time the process takes to run, although ‘microbenchmark’ gives much more details. It also allow us to get a nice plot:

autoplot(mb)

 

Optimizing loops

Long processes in R typically involve loops. for loops are probably the most common way to loop a process:

a<-rnorm(10^5, mean = 100)

b<-NULL
system.time(for(i in 1:length(a))
{
  c <- ifelse(a[i] >= 0, "pos", "neg")
b<-append(b, c)
}
)
##    user  system elapsed 
##  25.980   0.112  26.170

 

There are a few trick that can be used for improving loop (or even function) performance

  • ‘ifelse()’ is faster than ‘if()’:
b<-NULL
system.time(for(i in 1:length(a))
{
  if(a[i] >= 0) c <- "pos" else c <- "neg"
b<-append(b, c)
}
)
##    user  system elapsed 
##  24.332   0.212  24.567
b<-NULL
system.time(for(i in 1:length(a))
{
  if(a[i] >= 0) c <- "pos" 
if(a[i] < 0) c  <- "neg"
b<-append(b, c)
})
##    user  system elapsed 
##  25.112   0.156  25.299

 

  • Use a vector with a previously defined length to store results is better than append():
b<-NULL
system.time(for(i in 1:length(a))
{
  ifelse(a[i] >= 0, "pos", "neg")
b<-append(b, c)
}
)
##    user  system elapsed 
##  24.288   0.124  24.437
b<-vector(length = length(a))
system.time(for(i in 1:length(a))
{
  ifelse(a[i] >= 0, "pos", "neg")
b[i]<-c
}
)
##    user  system elapsed 
##   0.176   0.000   0.176

 

 

‘Xapply’ loops

Most for loops can be run using Xapply functions. As we have mentioned before Xapply loops can several advantages. For instance we can use lapply to run the above process:

#remover todos los objetos
rm(list = ls())

a<-rnorm(10^4)

b<-NULL
system.time(for(i in 1:length(a))
{if(a[i] >= 0) c1 <- "pos" 
if(a[i] < 0) c1  <- "neg"
b<-append(b, c1)
})
##    user  system elapsed 
##   0.236   0.000   0.235
# revisar si existe 'c' e 'i' en el ambiente de trabajo
exists("c1")
## [1] TRUE
exists("i")
## [1] TRUE
#remover todos los objetos
rm(list = ls())

a<-rnorm(10^4)

system.time(b <- sapply(1:length(a), function(i)
{if(a[i] >= 0) c1 <- "pos" 
if(a[i] < 0) c1  <- "neg"
return(c1)
}))
##    user  system elapsed 
##   0.012   0.000   0.012
# revisar si existe 'c' en el ambiente de trabajo
exists("c1")
## [1] FALSE
exists("i")
## [1] FALSE

 

Optimizing functions

The tips shown above also apply to functions (if vs ifelse, vector vs append). We can compare loop performance converting each loop to a function.

for loop function:

a<-rnorm(10^4)

# funcion con bucle for:
sign.for <- function(a)
{b<-NULL
for(i in 1:length(a))
{if(a[i] >= 0) c <- "pos" 
if(a[i] < 0) c  <- "neg"
b<-append(b, c)
}
return(b)}

system.time(sign.for(a))
##    user  system elapsed 
##   0.320   0.004   0.324

 

sapply function:

sign.sapply <- function(a) {b <- sapply(1:length(a), function(i)
{
  
if(a[i] > 0) c <- "pos" else c <- "neg"
return(c)
}, USE.NAMES = F)
return(b)
}

system.time(sign.sapply(a))
##    user  system elapsed 
##   0.012   0.000   0.011

 

We can also compared them using ‘microbenchmark’:

a<-rnorm(10^4)

compare <- microbenchmark(sign.sapply(a), sign.for(a), times = 10)

compare
## Unit: milliseconds
##            expr        min         lq       mean     median         uq
##  sign.sapply(a)   5.084761   5.419886   5.888929   5.542048   5.682089
##     sign.for(a) 218.409331 224.732995 242.210787 232.316350 244.625825
##         max neval cld
##    9.761594    10  a 
##  317.061148    10   b
autoplot(compare) 

 

Function compilation

Functions can be compiled with the cmpfun() function from compiler. Compiling functions avoids having to evaluate the function each time is used:

cmpsign.sapply <- cmpfun(sign.sapply)

 

Now compare the function before and after compiling:

compare <- microbenchmark(sign.sapply(a), cmpsign.sapply(a), times = 10)

compare
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##     sign.sapply(a) 5.281843 5.423659 5.859677 5.549090 6.691979 6.794536
##  cmpsign.sapply(a) 5.312670 5.474172 5.627506 5.519907 5.849997 6.190058
##  neval cld
##     10   a
##     10   a
autoplot(compare) 

 


Exercise 1


Create a function to obtain the standard deviation and compare its performance to the sd function from base R.


Parallel computing

Parallel computing is the use of the multiple cores in your computer to conduct processes simultaneously instead of sequentially. The parallel package has several functions to simplify parallelization. Here we compare the lapply function from base R with the mclapply from parallel:

system.time(lapply(1:8, function(i) Sys.sleep(1)))
##    user  system elapsed 
##   0.004   0.000   8.012
# using 2 cores
system.time(mclapply(1:8, function(i) Sys.sleep(1), mc.cores = 2))
##    user  system elapsed 
##   0.004   0.012   4.021

 

An elegant alternative is the package pbapply which now allow users to run loops in parallel:

system.time(pblapply(1:8, function(i) Sys.sleep(1), cl = 2))
##    user  system elapsed 
##   0.004   0.040   4.018

Session information

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  compiler  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] pbapply_1.3-4        doParallel_1.0.11    iterators_1.0.9     
## [4] foreach_1.4.4        ggplot2_2.2.1        microbenchmark_1.4-4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     pillar_1.2.3     plyr_1.8.4       tools_3.4.4     
##  [5] digest_0.6.13    evaluate_0.10.1  tibble_1.4.2     gtable_0.2.0    
##  [9] lattice_0.20-35  rlang_0.2.0      Matrix_1.2-14    yaml_2.1.16     
## [13] mvtnorm_1.0-6    stringr_1.2.0    knitr_1.20       rprojroot_1.3-2 
## [17] grid_3.4.4       survival_2.42-3  rmarkdown_1.8    multcomp_1.4-8  
## [21] TH.data_1.0-8    magrittr_1.5     backports_1.1.2  scales_0.5.0    
## [25] codetools_0.2-15 htmltools_0.3.6  splines_3.4.4    MASS_7.3-50     
## [29] colorspace_1.3-2 sandwich_2.4-0   stringi_1.1.6    lazyeval_0.2.1  
## [33] munsell_0.4.3    zoo_1.8-1