Chapter 2 Vectorization, *apply and for loops

This section will cover the basics of vectorizations, the *apply family of functions and for loops.

2.1 Vectorization

Almost everything in R is a vector. A scalar is really a vector of length 1 and a data.frame is a collection of vectors. An nice feature of is its vectorized capabilities. Vectorization indicates that a function operates on a whole vector of values at the same time and not just on a single value1. If you have have ever taken a basic linear algebra course, this concept will be familiar to you. Take for example two vectors: \[ \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} \] The corresponding R code is given by:

a <- c(1, 2, 3)
b <- c(1, 2, 3)
a + b
## [1] 2 4 6

Many of the base functions in R are already vectorized. Here are some common examples:

# generate a sequence of numbers from 1 to 10
(a <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
# sum the numbers from 1 to 10
sum(a)
## [1] 55
# calculate sums of each column
colSums(iris[, -5])
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        876.5        458.6        563.7        179.9

Exercise: What happens when you sum two vectors of different lengths?

2.2 Family of *apply functions

  • apply, lapply and sapply are some of the most commonly used class of functions in R
  • *apply functions are not necessarily faster than loops, but can be easier to read (and vice cersa)
  • apply is used when you need to perform an operation on every row or column of a matrix or data.frame
  • lapply and sapply differ in the format of the output. The former returns a list while the ladder returns a vector
  • There are other *apply functions such as tapply, vapply and mapply with similar functionality and purpose

2.2.1 Loops vs. Apply

# Getting the row means of two columns Generate data
N <- 10000
x1 <- runif(N)
x2 <- runif(N)
d <- as.data.frame(cbind(x1, x2))
head(d)
##           x1         x2
## 1 0.93196866 0.81751342
## 2 0.14861694 0.47933846
## 3 0.64465639 0.09915633
## 4 0.31383613 0.38192113
## 5 0.28983386 0.42311260
## 6 0.09529535 0.49011556
# Loop: create a vector to store the results in
rowMeanFor <- vector("double", N)

for (i in seq_len(N)) {
    rowMeanFor[[i]] <- mean(c(d[i, 1], d[i, 2]))
}

# Apply:
rowMeanApply <- apply(d, 1, mean)

# are the results equal
all.equal(rowMeanFor, rowMeanApply)
## [1] TRUE

2.2.2 Descriptive Statistics using *apply

data(women)
# data structure
str(women)
## 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
# calculate the mean for each column
apply(women, 2, mean)
##   height   weight 
##  65.0000 136.7333
# apply 'fivenum' function to each column
vapply(women, fivenum, c(Min. = 0, `1st Qu.` = 0, Median = 0, `3rd Qu.` = 0, 
    Max. = 0))
##         height weight
## Min.      58.0  115.0
## 1st Qu.   61.5  124.5
## Median    65.0  135.0
## 3rd Qu.   68.5  148.0
## Max.      72.0  164.0

2.2.3 Creating new columns using sapply

You can apply a user defined function to columns or the entire data frame:

# the ouput of sapply is a vector the 's' in sapply stands for 'simplified'
# apply
mtcars$gear2 <- sapply(mtcars$gear, function(i) if (i == 4) "alot" else "some")

head(mtcars)[, c("gear", "gear2")]
##                   gear gear2
## Mazda RX4            4  alot
## Mazda RX4 Wag        4  alot
## Datsun 710           4  alot
## Hornet 4 Drive       3  some
## Hornet Sportabout    3  some
## Valiant              3  some

2.2.4 Applying functions to subsets using tapply

# Fisher's famous dataset
data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# mean sepal length by species
tapply(iris$Sepal.Length, iris$Species, mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

2.2.5 Nested for loops using mapply

mapply is my favorite base R function and here are some reasons why:

  • Using mapply is equivalent to writing nested for loops except that it is 100% more human readable and less prone to errors
  • It is an effective way of conducting simulations because it iterates of many arguments

Let’s say you want to generate random samples from a normal distribution with varying means and standard deviations. Of course the brute force way would be to write out the command once, copy paste as many times as you want, and then manually change the arguments for mean and sd in the rnorm function as so:

v1 <- rnorm(100, mean = 5, sd = 1)
v2 <- rnorm(100, mean = 10, sd = 5)
v3 <- rnorm(100, mean = -3, sd = 10)

This isn’t too bad for three vectors. But what if you want to generate many more combinations of means and sds ? Furthermore, how can you keep track of the parameters you used? Now lets consider the mapply function:

means <- c(5, 10, -3)
sds <- c(1, 5, 10)

# MoreArgs is a list of arguments that dont change
randomNormals <- mapply(rnorm, mean = means, sd = sds, MoreArgs = list(n = 100))

head(randomNormals)
##          [,1]      [,2]       [,3]
## [1,] 5.400492  3.606588 -10.544957
## [2,] 4.025367  4.395509   1.248023
## [3,] 5.001900  8.994643 -10.234892
## [4,] 5.004534  2.210005 -10.172234
## [5,] 4.004708  5.368140  -6.539932
## [6,] 4.478162 14.107530   6.502228

The following diagram (from r4ds) describes exactly what is going on in the above function call to mapply:

Advantages:

  1. Result is automatically stored in a matrix
  2. The parameters are also saved in R objects so that they can be easily manipulated and/or recovered

Consider a more complex scenario where you want to consider many possible combinations of means and sds. We take advantage of the expand.grid function to create a data.frame of simulation parameters:

simParams <- expand.grid(means = 1:10, sds = 1:10)

randomNormals <- mapply(rnorm, mean = simParams$means, sd = simParams$sds, MoreArgs = list(n = 100))

dim(randomNormals)
## [1] 100 100

2.3 Creating dynamic documents with mapply

mapply together with the rmarkdown package (Allaire et al. 2016) can be very useful to create dynamic documents for exploratory analysis. We illustrate this using the Motor Trend Car Road Tests data which comes pre-loaded in R.

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Copy the code below in a file called mapplyRmarkdown.Rmd :

Copy the code below in a file called boxplotTemplate :