March 12, 2015

intro

(Ernst and Kellis 2015)

Main Idea

Exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets

Matrix of Observed and Imputed Data

tablea

1. Leverage other marks in same sample

test

2. Leverage same mark in different sample

test

Types of data used to impute

used

Advantages of Imputation

  • Beneficial even if observed data is available
    • Combining information –> robust to experimental noise, confounders
    • Achieve a higher sequencing depth –> higher signal to noise ratio
  • Improve GWAS enrichments –> epigenomic maps as an unbiased approach for discovering disease-relevant tissues and cell types
  • Quality Control –> Are there discrepancies between imputed and observed datasets
  • Feature importance
  • Chromatin state annotation

Limitations

  • If the presence of mark signal is highly specific to one or a few samples, and it does not correlate with other marks mapped in the sample or has a different correlation structure than in samples used for training, then it would not be possible to accurately impute the mark at those locations
  • When the target mark has been mapped in only a few samples, the features pertaining to the same mark in other samples may be less informative or more biased e.g. TFBS
  • For tissue samples that reflect mixtures of multiple cell types, our imputed maps will most likely reflect the same mixture as the observed data, though deconvolution of mixed samples is a potentially important direction for future work

ChromImpute Software

Not a new idea

Leo Breiman (1928-2005)

MissForest

Introduction to Regression Trees

Some intuition behind the imputation approach

test

total sales = 7.1 + 0.0475 x # of TV's sold

Tree-based Methods

  • Involves splitting the predictor space into simple regions
  • Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods (James et al. 2013)

Baseball Data

Predict salary based on Hits and Years Played

How to split the data

How to split the data

How to split the data

How to split the data

Regression Tree for Baseball data

- Years is the most important factor in 
determining Salary
- Given that a player is less 
experienced, the number of Hits that 
he made in the previous year seems 
to play little role in his Salary
- Among players who have been in 
the major leagues for five or more years, 
the number of Hits made in the previous 
year does affect Salary

Decision Tree

test

test

More Details of Tree Building

  • The goal is to find boxes \(R_1,\ldots, R_J\) that minimize the residual sum of squares give by
  • \[ \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{Rj}) \]
  • \(y_i\) is the subjects response, \(\hat{y}_{Rj}\) is the mean in box \(j\)
  • Computationally infeasible to consider every single partition of the feature space into J boxes
  • Solution: take a top-down, greedy approach
  • Begins at the top, and never looks back

Pros and Cons

  • Tree-based methods are simple and useful for interpretation
  • Highly sensity to the first split
  • Solution: Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation.

Bagging

The Bootstrap

test (James et al. 2013)

Pull yourself up by your bootstraps

test

Random Forests

Acknowledgements

Regression tree slides are based on

Leo Breiman (1928-2005)

test

References

Breiman, Leo. 2001. “Random Forests.” Mach. Learn. 45 (1). Hingham, MA, USA: Kluwer Academic Publishers: 5–32. doi:10.1023/A:1010933404324.

Ernst, Jason, and Manolis Kellis. 2015. “Large-Scale Imputation of Epigenomic Datasets for Systematic Annotation of Diverse Human Tissues.” Nature Biotechnology. Nature Publishing Group.

James, G, D Witten, T Hastie, and R Tibshirani. 2013. An Introduction to Statistical Learning.

Stekhoven, Daniel J, and Peter B ühlmann. 2012. “MissForest—non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1). Oxford Univ Press: 112–18.