March 12, 2015


(Ernst and Kellis 2015)

Main Idea

Exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets

Matrix of Observed and Imputed Data


1. Leverage other marks in same sample


2. Leverage same mark in different sample


Types of data used to impute


Advantages of Imputation

  • Beneficial even if observed data is available
    • Combining information –> robust to experimental noise, confounders
    • Achieve a higher sequencing depth –> higher signal to noise ratio
  • Improve GWAS enrichments –> epigenomic maps as an unbiased approach for discovering disease-relevant tissues and cell types
  • Quality Control –> Are there discrepancies between imputed and observed datasets
  • Feature importance
  • Chromatin state annotation


  • If the presence of mark signal is highly specific to one or a few samples, and it does not correlate with other marks mapped in the sample or has a different correlation structure than in samples used for training, then it would not be possible to accurately impute the mark at those locations
  • When the target mark has been mapped in only a few samples, the features pertaining to the same mark in other samples may be less informative or more biased e.g. TFBS
  • For tissue samples that reflect mixtures of multiple cell types, our imputed maps will most likely reflect the same mixture as the observed data, though deconvolution of mixed samples is a potentially important direction for future work

ChromImpute Software