Exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets
Matrix of Observed and Imputed Data
1. Leverage other marks in same sample
2. Leverage same mark in different sample
Types of data used to impute
Advantages of Imputation
Beneficial even if observed data is available
Combining information –> robust to experimental noise, confounders
Achieve a higher sequencing depth –> higher signal to noise ratio
Improve GWAS enrichments –> epigenomic maps as an unbiased approach for discovering disease-relevant tissues and cell types
Quality Control –> Are there discrepancies between imputed and observed datasets
Feature importance
Chromatin state annotation
Limitations
If the presence of mark signal is highly specific to one or a few samples, and it does not correlate with other marks mapped in the sample or has a different correlation structure than in samples used for training, then it would not be possible to accurately impute the mark at those locations
When the target mark has been mapped in only a few samples, the features pertaining to the same mark in other samples may be less informative or more biased e.g. TFBS
For tissue samples that reflect mixtures of multiple cell types, our imputed maps will most likely reflect the same mixture as the observed data, though deconvolution of mixed samples is a potentially important direction for future work