When analyzing large amounts of genetic and genomic data, the first line of analysis is usually some sort of univariate test. That is, conduct a statistical test for each SNP or CpG site or Gene and then correct for multiple testing. The limma package on Bioconductor is a popular method for computing moderated t-statistics using a combination of the limma::lmFit
and limma::eBayes
functions. In this post, I show how to calculate the ordinary t-statistics from limma
output.
First we load the required packages
# clear workspace
rm(list=ls())
if (!requireNamespace("pacman", quietly = TRUE))
install.packages("pacman")
# this will install (if you dont already have it) and load the package
pacman::p_load("minfi")
pacman::p_load("minfiData")
pacman::p_load("limma")
pacman::p_load("CpGassoc")
# check the data that is available in the minfiData package
pacman::p_data("minfiData")
## Data Description
## 1 MsetEx An example dataset for Illumina's Human Methylation 450k dataset, after preprocessing.
## 2 MsetEx.sub An example dataset for Illumina's Human Methylation 450k dataset, after preprocessing.
## 3 RGsetEx An example dataset for Illumina's Human Methylation 450k dataset.
## 4 RGsetEx.sub An example dataset for Illumina's Human Methylation 450k dataset.
Next, we extract some sample data and create a covariate of interest
# get the M-values for the sample data
DT <- minfi::getM(MsetEx.sub)
dim(DT)
## [1] 600 6
# rows are CpGs and columns are samples
head(DT)
## 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## cg00050873 3.502348 0.4414491 4.340695
## cg00212031 -3.273751 0.9234662 -2.614777
## cg00213748 2.076816 -0.1309465 1.260995
## cg00214611 -3.438838 1.7463950 -2.270551
## cg00455876 1.839010 -0.9927320 1.619479
## cg01707559 -3.303987 -0.6433201 -3.540887
## 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
## cg00050873 0.24458355 -0.3219281 0.2744392
## cg00212031 -0.21052257 -0.6861413 -0.1397595
## cg00213748 -1.10373279 -1.6616553 -0.1270869
## cg00214611 0.29029649 -0.2103599 -0.6138630
## cg00455876 -0.09504721 -0.2854655 0.6361273
## cg01707559 -0.74835377 -0.4678048 -1.1345421
# create a fake covariate. 3 intact and 3 degraded samples
tissue <- factor(c(rep("Intact",3), rep("Degraded",3)))
design <- model.matrix(~ tissue)
Then we calculate the moderated and ordinary t-statistics and compare them:
# limma fit
fit <- lmFit(DT, design)
fit_ebayes <- eBayes(fit)
# ordinary t-statistic
ordinary.t <- fit$coefficients[,"tissueIntact"] /
fit$stdev.unscaled[,"tissueIntact"] /
fit$sigma
plot(x = ordinary.t,
y = fit_ebayes$t[,"tissueIntact"],
xlab = "ordinary t-statistic",
ylab = "moderated t-statistic")
abline(a = 0, b = 1, col = "red")
We can calculate the corresponding p-values from the ordinary t-statistics. This is given by
# t-distribution with n-p-1 degrees of freedom
# (n: number of samples, p:number of covariates excluding intercept)
ordinary.pvalue <- pt(q = abs(ordinary.t),
df = dim(DT)[2]-(ncol(design)-1)-1,
lower.tail = F) * 2
We can also use the CpGassoc
package to calculate ordinary t-statistics and compare the result to our manual calculations:
# compare result with CpGassoc package (regular t-test)
t_test_mvalues <- CpGassoc::cpg.assoc(
beta.val = minfi::getBeta(MsetEx.sub),
indep = tissue,
fdr.cutoff = 0.05,
logit.transform = TRUE)
# these should be equal
plot(t_test_mvalues$results$P.value, ordinary.pvalue)
abline(a = 0, b = 1, col = "red")
# t-statistic is the squared F.statistic (in certain cases)
plot(t_test_mvalues$results$F.statistic, ordinary.t^2)
abline(a = 0, b = 1, col = "red")
Session Info
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices
## [7] utils datasets base
##
## other attached packages:
## [1] CpGassoc_2.55
## [2] nlme_3.1-128
## [3] limma_3.30.10
## [4] minfiData_0.20.0
## [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
## [6] IlluminaHumanMethylation450kmanifest_0.4.0
## [7] minfi_1.20.2
## [8] bumphunter_1.14.0
## [9] locfit_1.5-9.1
## [10] iterators_1.0.8
## [11] foreach_1.4.3
## [12] Biostrings_2.42.1
## [13] XVector_0.14.0
## [14] SummarizedExperiment_1.4.0
## [15] GenomicRanges_1.26.2
## [16] GenomeInfoDb_1.10.2
## [17] IRanges_2.8.1
## [18] S4Vectors_0.12.1
## [19] Biobase_2.34.0
## [20] BiocGenerics_0.20.0
## [21] knitr_1.15.1
##
## loaded via a namespace (and not attached):
## [1] mclust_5.2.2 base64_2.0
## [3] Rcpp_0.12.9 lattice_0.20-34
## [5] Rsamtools_1.26.1 digest_0.6.12
## [7] R6_2.2.0 plyr_1.8.4
## [9] RSQLite_1.1-2 evaluate_0.10
## [11] highr_0.6 httr_1.2.1
## [13] zlibbioc_1.20.0 GenomicFeatures_1.26.2
## [15] data.table_1.10.4 annotate_1.52.1
## [17] Matrix_1.2-7.1 preprocessCore_1.36.0
## [19] splines_3.3.2 BiocParallel_1.8.1
## [21] stringr_1.2.0 RCurl_1.95-4.8
## [23] biomaRt_2.30.0 rtracklayer_1.34.1
## [25] multtest_2.30.0 pkgmaker_0.22
## [27] openssl_0.9.6 GEOquery_2.40.0
## [29] quadprog_1.5-5 codetools_0.2-15
## [31] matrixStats_0.51.0 XML_3.98-1.5
## [33] reshape_0.8.6 GenomicAlignments_1.10.0
## [35] MASS_7.3-45 bitops_1.0-6
## [37] grid_3.3.2 xtable_1.8-2
## [39] registry_0.3 DBI_0.5-1
## [41] pacman_0.4.1 magrittr_1.5
## [43] stringi_1.1.2 genefilter_1.56.0
## [45] doRNG_1.6 nor1mix_1.2-2
## [47] RColorBrewer_1.1-2 siggenes_1.48.0
## [49] tools_3.3.2 illuminaio_0.16.0
## [51] rngtools_1.2.4 survival_2.40-1
## [53] AnnotationDbi_1.36.2 beanplot_1.2
## [55] memoise_1.0.0