In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data. In this post I simulate some gene expression data and visualise it using the pheatmap
function from the pheatmap package in R
. You will also need the mvrnorm
function from the MASS library to simulate from a multivariate normal distribution, and the brewer.pal
function from the RColorBrewer library for easier customization of colors.
Components of a Heatmap
There are four main components that should be considered when drawing a heatmap:
Formatting the data
First I simulate some gene expression data, based on a function that I created, for genes which are correlated conditional on an exposure status (the function definition is given at the end of this post):
n = 100 ; n0 = 50 ; n1 = 50; p = 100
genes <- sim.expr.data(n = 100, n0 = 50, p = 100,
rho.0 = 0.01, rho.1 = 0.95)
In order to properly label the heatmap, we must label the matrix of gene expressions:
colnames(genes) <- paste0("Gene", 1:p)
rownames(genes) <- paste0("Subject", 1:n)
genes[1:5, 1:5]
## Gene1 Gene2 Gene3 Gene4 Gene5
## Subject1 -0.3330860 0.2515602 -0.7263684 -0.1257963 1.3159837
## Subject2 -0.4128702 -0.9761655 0.2132626 0.7267015 -0.3634427
## Subject3 0.9432688 0.2973164 -1.1210232 0.9017424 -0.8098157
## Subject4 0.2894936 -0.6708301 -0.8209677 0.2204200 -1.0602915
## Subject5 -1.2131869 -1.2346285 -0.1120417 0.9439494 0.5893691
Choosing the color scheme
To avoid wasting time choosing colors, I recommend using the RColorBrewer package based on the design of geographer Cynthia Brewer. From the RColorBrewer help page:
There are 3 types of palettes, sequential, diverging, and qualitative:
- Sequential palettes are suited to ordered data that progress from low to high. Lightness steps dominate the look of these schemes, with light colors for low data values to dark colors for high data values.
- Diverging palettes put equal emphasis on mid-range critical values and extremes at both ends of the data range. The critical class or break in the middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors that have contrasting hues.
- Qualitative palettes do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data
To see the palettes available for coloring the heatmap:
pacman::p_load(RColorBrewer)
RColorBrewer::display.brewer.all()
You need to provide the RColorBrewer::brewer.pal
function with two arguments; the number of values (All the diverging palettes are available in variations from 3 different values up to 11 different values), and the name of the palette as shown in the figure above. We will use the Reds palette which has a maximum number of 9 colors:
(col.pal <- RColorBrewer::brewer.pal(9, "Reds"))
## [1] "#FFF5F0" "#FEE0D2" "#FCBBA1" "#FC9272" "#FB6A4A" "#EF3B2C"
## [7] "#CB181D" "#A50F15" "#67000D"
Annotating the rows and/or columns
If the subjects can be contrasted, it is useful to display this information on the heatmap e.g. case/control status or exposed vs. unexposed. To do so, we first need to create a separate data frame which contains that information. This data frame can contain many columns or just one column. (Note that the rownames of this data frame need to correspond to the rownames i.e. Subjects IDs of the gene expression data created above). In this example we create a data frame which has exposure status and tumor type for each subject:
annotation_col <- data.frame(
Exposure = factor(c(rep("X=0",n0), c(rep("X=1", n1)))),
Type = factor(sample(c("T-cell","B-cell"),n, replace=T)))
rownames(annotation_col) = paste0("Subject", 1:n)
head(annotation_col)
## Exposure Type
## Subject1 X=0 B-cell
## Subject2 X=0 T-cell
## Subject3 X=0 T-cell
## Subject4 X=0 B-cell
## Subject5 X=0 B-cell
## Subject6 X=0 T-cell
We also want to annotate information on the genes, such as pathway membership. To do so, we create another data frame which has the gene annotations. Note once again that the rownames of this data frame need to correspond to the columnames i.e. Gene IDs of the gene expression data created above.
annotation_row <- data.frame(
Pathway = factor(rep(1:4,each=25)))
rownames(annotation_row) = paste0("Gene", 1:n)
head(annotation_row)
## Pathway
## Gene1 1
## Gene2 1
## Gene3 1
## Gene4 1
## Gene5 1
## Gene6 1
Clustering
You need to decide if its important to cluster the rows and/or columns of your heatmap. If you decide to cluster, you must then choose the distance metric to use and the clustering method.
The pheatmap comes with lots of customizations (see the help page for a complete list of options). In this example I only want to cluster the genes (i.e. the rows), and place a gap between subject who were exposed and unexposed. Note that we must pass the transpose of the matrix for the pheatmap
function, which is not the case for other functions such as gplots::heatmap.2
.
pacman::p_load(pheatmap)
pheatmap::pheatmap(t(genes),
cluster_row = T,
cluster_cols = F,
annotation_col = annotation_col,
annotation_row = annotation_row,
color = col.pal,
fontsize = 6.5,
fontsize_row=6,
fontsize_col = 6,
gaps_col=50)
Update: June 25, 2015
Interactive Heatmaps using d3heatmap
It is also possible to create Interactive heatmaps (in the sense that you can see the actual values by hovering your mouse over the plot) using the d3heatmap
pacakge available on github:
pacman::p_install_gh("rstudio/d3heatmap")
pacman::p_load(webshot)
This is useful if you are producible markdown reports. The syntax is standard, though does not allow for multiple annotations as in pheatmap
.
library(d3heatmap)
d3heatmap(t(genes), colors = "Reds", Colv = FALSE)
For some reason, this map is not showing up on this website, but it should work when compiling Rmarkdown scripts and viewing the resulting HTML document in your browser or within RStudio.
Update: August 7, 2015
Interactive Heatmaps using plotly
After some user setup (see the plotly help page), the following code creates an interactive heatmap:
expression <- t(genes)
plotly::plot_ly(z = expression, colorscale = col.pal, type = "heatmap")
Simulate Gene Expression Data function
sim.expr.data <- function(n, n0, p, rho.0, rho.1){
# Initiate Simulation parameters
# n: total number of subjects
# n0: number of subjects with X=0
# n1: number of subjects with X=1
# p: number of genes
# rho.0: rho between Z_i and Z_j when X=0
# rho.1: rho between Z_i and Z_j when X=1
# Simulate gene expression values according to exposure X=0, X=1, according to a centered multivariate normal distribution with covariance between Z_i and Z_j being rho^|i-j|
times = 1:p # used for creating covariance matrix
H <- abs(outer(times, times, "-"))
V0 <- rho.0^H
V1 <- rho.1^H
# rows are people, columns are genes
genes0 <- MASS::mvrnorm(n = n0, mu = rep(0,p), Sigma = V0)
genes1 <- MASS::mvrnorm(n = n1, mu = rep(0,p), Sigma = V1)
genes <- rbind(genes0,genes1)
return(genes)
}