5 min read

Heatmaps in R

In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data. In this post I simulate some gene expression data and visualise it using the pheatmap function from the pheatmap package in R. You will also need the mvrnorm function from the MASS library to simulate from a multivariate normal distribution, and the brewer.pal function from the RColorBrewer library for easier customization of colors.

Components of a Heatmap

There are four main components that should be considered when drawing a heatmap:

  1. Formatting the data
  2. Choosing the color scheme
  3. Annotating the rows and/or columns
  4. Clustering

Formatting the data

First I simulate some gene expression data, based on a function that I created, for genes which are correlated conditional on an exposure status (the function definition is given at the end of this post):


n = 100 ; n0 = 50 ; n1 = 50; p = 100
genes <- sim.expr.data(n = 100, n0 = 50, p = 100, 
                       rho.0 = 0.01, rho.1 = 0.95)

In order to properly label the heatmap, we must label the matrix of gene expressions:


colnames(genes) <- paste0("Gene", 1:p)
rownames(genes) <- paste0("Subject", 1:n)
genes[1:5, 1:5]

##               Gene1      Gene2      Gene3      Gene4      Gene5
## Subject1 -0.3330860  0.2515602 -0.7263684 -0.1257963  1.3159837
## Subject2 -0.4128702 -0.9761655  0.2132626  0.7267015 -0.3634427
## Subject3  0.9432688  0.2973164 -1.1210232  0.9017424 -0.8098157
## Subject4  0.2894936 -0.6708301 -0.8209677  0.2204200 -1.0602915
## Subject5 -1.2131869 -1.2346285 -0.1120417  0.9439494  0.5893691

Choosing the color scheme

To avoid wasting time choosing colors, I recommend using the RColorBrewer package based on the design of geographer Cynthia Brewer. From the RColorBrewer help page:

There are 3 types of palettes, sequential, diverging, and qualitative:

  1. Sequential palettes are suited to ordered data that progress from low to high. Lightness steps dominate the look of these schemes, with light colors for low data values to dark colors for high data values.
  2. Diverging palettes put equal emphasis on mid-range critical values and extremes at both ends of the data range. The critical class or break in the middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors that have contrasting hues.
  3. Qualitative palettes do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data

To see the palettes available for coloring the heatmap:


pacman::p_load(RColorBrewer)
RColorBrewer::display.brewer.all()

plot of chunk unnamed-chunk-4

You need to provide the RColorBrewer::brewer.pal function with two arguments; the number of values (All the diverging palettes are available in variations from 3 different values up to 11 different values), and the name of the palette as shown in the figure above. We will use the Reds palette which has a maximum number of 9 colors:


(col.pal <- RColorBrewer::brewer.pal(9, "Reds"))

## [1] "#FFF5F0" "#FEE0D2" "#FCBBA1" "#FC9272" "#FB6A4A" "#EF3B2C"
## [7] "#CB181D" "#A50F15" "#67000D"

Annotating the rows and/or columns

If the subjects can be contrasted, it is useful to display this information on the heatmap e.g. case/control status or exposed vs. unexposed. To do so, we first need to create a separate data frame which contains that information. This data frame can contain many columns or just one column. (Note that the rownames of this data frame need to correspond to the rownames i.e. Subjects IDs of the gene expression data created above). In this example we create a data frame which has exposure status and tumor type for each subject:


annotation_col <- data.frame(
        Exposure = factor(c(rep("X=0",n0), c(rep("X=1", n1)))),
        Type = factor(sample(c("T-cell","B-cell"),n, replace=T)))

rownames(annotation_col) = paste0("Subject", 1:n)

head(annotation_col)

##          Exposure   Type
## Subject1      X=0 B-cell
## Subject2      X=0 T-cell
## Subject3      X=0 T-cell
## Subject4      X=0 B-cell
## Subject5      X=0 B-cell
## Subject6      X=0 T-cell

We also want to annotate information on the genes, such as pathway membership. To do so, we create another data frame which has the gene annotations. Note once again that the rownames of this data frame need to correspond to the columnames i.e. Gene IDs of the gene expression data created above.


annotation_row <- data.frame(
                  Pathway = factor(rep(1:4,each=25)))

rownames(annotation_row) = paste0("Gene", 1:n)

head(annotation_row)

##       Pathway
## Gene1       1
## Gene2       1
## Gene3       1
## Gene4       1
## Gene5       1
## Gene6       1

Clustering

You need to decide if its important to cluster the rows and/or columns of your heatmap. If you decide to cluster, you must then choose the distance metric to use and the clustering method. The pheatmap comes with lots of customizations (see the help page for a complete list of options). In this example I only want to cluster the genes (i.e. the rows), and place a gap between subject who were exposed and unexposed. Note that we must pass the transpose of the matrix for the pheatmap function, which is not the case for other functions such as gplots::heatmap.2.


pacman::p_load(pheatmap)
pheatmap::pheatmap(t(genes), 
                   cluster_row = T,
                   cluster_cols = F,
                   annotation_col = annotation_col,
                   annotation_row = annotation_row,
                   color = col.pal, 
                   fontsize = 6.5,
                   fontsize_row=6, 
                   fontsize_col = 6,
                   gaps_col=50)

plot of chunk unnamed-chunk-8


Update: June 25, 2015

Interactive Heatmaps using d3heatmap

It is also possible to create Interactive heatmaps (in the sense that you can see the actual values by hovering your mouse over the plot) using the d3heatmap pacakge available on github:


pacman::p_install_gh("rstudio/d3heatmap")
pacman::p_load(webshot)

This is useful if you are producible markdown reports. The syntax is standard, though does not allow for multiple annotations as in pheatmap.


library(d3heatmap)
d3heatmap(t(genes), colors = "Reds", Colv = FALSE)

plot of chunk unnamed-chunk-10

For some reason, this map is not showing up on this website, but it should work when compiling Rmarkdown scripts and viewing the resulting HTML document in your browser or within RStudio.

Update: August 7, 2015

Interactive Heatmaps using plotly

After some user setup (see the plotly help page), the following code creates an interactive heatmap:


expression <- t(genes)
plotly::plot_ly(z = expression, colorscale = col.pal, type = "heatmap")

Simulate Gene Expression Data function


sim.expr.data <- function(n, n0, p, rho.0, rho.1){
# Initiate Simulation parameters
# n: total number of subjects
# n0: number of subjects with X=0
# n1: number of subjects with X=1
# p: number of genes
# rho.0: rho between Z_i and Z_j when X=0
# rho.1: rho between Z_i and Z_j when X=1

# Simulate gene expression values according to exposure X=0, X=1, according to a centered multivariate normal distribution with covariance between Z_i and Z_j being rho^|i-j|
  times = 1:p # used for creating covariance matrix
  H <- abs(outer(times, times, "-"))
  V0 <- rho.0^H
  V1 <- rho.1^H

  # rows are people, columns are genes
  genes0 <- MASS::mvrnorm(n = n0, mu = rep(0,p), Sigma = V0)
  genes1 <- MASS::mvrnorm(n = n1, mu = rep(0,p), Sigma = V1)
  genes <- rbind(genes0,genes1)
  return(genes)
}