In this post, I describe how to estimate a kinship matrix and subsequently fit a mixed model using that estimated kinship. In particular, I show how this can be done on an arbitrary matrix of genotype data, which is not stored in plink format. I also show how to deal with missing genotypes.

# install.packages("gaston")
library(gaston)


## Set Simulation Parameters

# number of subjects
n <- 1000

# number of genotypes
p <- 1e4

# number of causal genotypes
p_causal <- 50

# Signal to noise ratio
signal_to_noise_ratio <- 2

# vector of allele frequencies from which to sample
probs <- c(0.05, 0.1, 0.3, 0.4)


## Generate Sample Data with Missing Genotypes

set.seed(345321)
geno <- replicate(p, rbinom(n, 2, sample(probs, 1)))
dim(geno)

## [1]  1000 10000

geno[1:5,1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    1    1    0
## [2,]    1    0    0    1    0
## [3,]    1    1    0    0    0
## [4,]    2    1    1    0    0
## [5,]    1    2    1    0    0

geno[sample(1:p, 100)] <- NA
geno[1:5,1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    1    1    0
## [2,]    1    0    0    1    0
## [3,]    1    1    0    0    0
## [4,]    2    1    1    0    0
## [5,]    1    2    1    0    0


## Convert to BED Matrix

This is so that we can use the functions in the gaston package for fitting mixed models

DT <- gaston::as.bed.matrix(geno)

# can access the data by
as.matrix(DT)[1:5, 1:5]

##   M_1 M_2 M_3 M_4 M_5
## 1   0   0   1   1   0
## 2   1   0   0   1   0
## 3   1   1   0   0   0
## 4   2   1   1   0   0
## 5   1   2   1   0   0

# see the contents of DT
slotNames(DT)

## [1] "ped"                  "snps"                 "bed"
## [4] "p"                    "mu"                   "sigma"
## [7] "standardize_p"        "standardize_mu_sigma"

# to access the different contents of DT use @
DT@snps[1:5,]

##   chr  id dist pos A1 A2  N0  N1  N2 NAs N0.f N1.f N2.f NAs.f callrate
## 1  NA M_1   NA  NA NA NA 480 411 101   8   NA   NA   NA    NA    0.992
## 2  NA M_2   NA  NA NA NA 349 477 167   7   NA   NA   NA    NA    0.993
## 3  NA M_3   NA  NA NA NA 461 442  85  12   NA   NA   NA    NA    0.988
## 4  NA M_4   NA  NA NA NA 781 200   9  10   NA   NA   NA    NA    0.990
## 5  NA M_5   NA  NA NA NA 886  95   3  16   NA   NA   NA    NA    0.984
##          maf         hz
## 1 0.30897177 0.41431452
## 2 0.40835851 0.48036254
## 3 0.30971660 0.44736842
## 4 0.11010101 0.20202020
## 5 0.05132114 0.09654472

DT@ped[1:5,]

##   famid id father mother sex pheno   N0   N1  N2 NAs N0.x N1.x N2.x NAs.x
## 1     1  1      0      0   0     0 6431 2916 652   1    0    0    0     0
## 2     2  2      0      0   0     0 6450 2873 677   0    0    0    0     0
## 3     3  3      0      0   0     0 6441 2899 660   0    0    0    0     0
## 4     4  4      0      0   0     0 6374 2986 640   0    0    0    0     0
## 5     5  5      0      0   0     0 6339 2978 683   0    0    0    0     0
##   N0.y N1.y N2.y NAs.y N0.mt N1.mt N2.mt NAs.mt callrate    hz callrate.x
## 1    0    0    0     0     0     0     0      0     -Inf -2916        NaN
## 2    0    0    0     0     0     0     0      0      NaN   Inf        NaN
## 3    0    0    0     0     0     0     0      0      NaN   Inf        NaN
## 4    0    0    0     0     0     0     0      0      NaN   Inf        NaN
## 5    0    0    0     0     0     0     0      0      NaN   Inf        NaN
##   hz.x callrate.y hz.y callrate.mt hz.mt
## 1  NaN        NaN  NaN         NaN   NaN
## 2  NaN        NaN  NaN         NaN   NaN
## 3  NaN        NaN  NaN         NaN   NaN
## 4  NaN        NaN  NaN         NaN   NaN
## 5  NaN        NaN  NaN         NaN   NaN

# p contains the alternate allele frequency
# mu is equal to 2*p and is the expected value of the genotype (coded in 0, 1, 2)
# sigma is the genotype standard error
DT@p[1:10]

##  [1] 0.30897177 0.40835851 0.30971660 0.11010101 0.05132114 0.05393145
##  [7] 0.28556911 0.04984894 0.30502513 0.39635996

DT@mu[1:10]

##  [1] 0.61794355 0.81671702 0.61943320 0.22020202 0.10264228 0.10786290
##  [7] 0.57113821 0.09969789 0.61005025 0.79271992

DT@sigma[1:10]

##  [1] 0.6607853 0.6950724 0.6350671 0.4338020 0.3110142 0.3155285 0.6256291
##  [8] 0.3053246 0.6458457 0.6949752

plot(2*DT@p, DT@mu)
abline(a=0,b=1, col = "red")


If the Hardy-Weinberg equilibrium holds, sigma should be close to $\sqrt{2*p(1-p)}$. This is illustrated on the figure below

plot(DT@p, DT@sigma, xlim=c(0,1))
t <- seq(0,1,length=101);
lines(t, sqrt(2*t*(1-t)), col="red")


## Standardized SNP matrix

# this will center the columns of DT to mean 0, and standard deviation sqrt(2p(1-p))
gaston::standardize(DT) <- "p"
X <- as.matrix(DT)
X[1:5, 1:5]

##          M_1        M_2        M_3        M_4      M_5
## 1 -0.9456415 -1.1749151  0.5819959  1.7615748 -0.32893
## 2  0.5846625 -1.1749151 -0.9472912  1.7615748 -0.32893
## 3  0.5846625  0.2636678 -0.9472912 -0.4974395 -0.32893
## 4  2.1149665  0.2636678  0.5819959 -0.4974395 -0.32893
## 5  0.5846625  1.7022506  0.5819959 -0.4974395 -0.32893


## Dealing With Missing Values

In standardized matrices, the NA values are replaced by zeroes, which amount to impute the missing genotypes by the mean genotype.

X[is.na(X)] <- 0
X[1:5,1:5]

##          M_1        M_2        M_3        M_4      M_5
## 1 -0.9456415 -1.1749151  0.5819959  1.7615748 -0.32893
## 2  0.5846625 -1.1749151 -0.9472912  1.7615748 -0.32893
## 3  0.5846625  0.2636678 -0.9472912 -0.4974395 -0.32893
## 4  2.1149665  0.2636678  0.5819959 -0.4974395 -0.32893
## 5  0.5846625  1.7022506  0.5819959 -0.4974395 -0.32893


The object X is what will be used as the data matrix in the LMM analysis. We also need to create the kinship matrix, which we do next.

## Calculate Kinship Matrix

If $X_s$ is a standardized $n \times p$ matrix of genotypes, a Genetic Relationship Matrix of individuals can be computed as

where $p$ is the number of SNPs and $n$ is the number of individuals. This computation is done by the gaston::GRM function. Note that we could also use

(1 / (p - 1)) * tcrossprod(X)


to calculate the kinship (covariance) matrix, but the gaston::GRM function is faster.

Note that the gaston::GRM function internally standardizes the genotype data, which is why we provide it the object DT. The object X will be used for fitting the model. We specify autosome.only = FALSE because we don’t have that information.

kin <- gaston::GRM(DT, autosome.only = FALSE)
kin[1:5,1:5]

##              1            2             3             4             5
## 1  0.994037390 -0.020637164 -2.610047e-03  1.937922e-02 -0.0050506615
## 2 -0.020637164  0.970255971  1.689422e-03 -1.211568e-02  0.0014872842
## 3 -0.002610047  0.001689422  9.864578e-01 -9.975582e-05  0.0006858376
## 4  0.019379225 -0.012115683 -9.975582e-05  9.979130e-01  0.0061614369
## 5 -0.005050662  0.001487284  6.858376e-04  6.161437e-03  1.0134803057


## Principal Components

From the GRM, we can compute the Principal components. The eigenvectors are normalized. The Principal Components (PC) can be computed by multiplying them by the square root of the associated eigenvalues

eiK <- eigen(kin)

# deal with a small negative eigen value
eiK$values[ eiK$values < 0 ] <- 0

PC <- sweep(eiK$vectors, 2, sqrt(eiK$values), "*")
dim(PC)

## [1] 1000 1000

plot(PC[,1], PC[,2])


## Simulate Phenotype

p_causal SNPs are randomly assigned to a Uniform(0.9,1.1) distribution.

beta <- rep(0, p)
beta[sample(1:p, p_causal)] <- runif(p_causal, min = 0.9, max = 1.1)

y.star <- X %*% beta

error <- stats::rnorm(n)

k <- as.numeric(sqrt(stats::var(y.star)/(signal_to_noise_ratio*stats::var(error))))

Y <- y.star + k*error


## Run Univariate LMM

There are two packages we can use to fit uni/multivariate LMMs.

### gaston package

Here we use the gaston package (Perdry & Dandine-Roulland, 2017).

# make design matrix with intercept
x1 <- cbind(1, X[,1,drop=FALSE])

# with 1 random effect this function is faster than lmm.aireml
# in gaston::lmm.diago you provide the eigen decomposition
# in gaston::lmm.aireml you provide the kinship matrix
fit <- gaston::lmm.diago(Y, x1, eigenK = eiK)

## Optimization in interval [0, 1]
## Optimizing with p = 0
## [Iteration 1] Current point = 0 df = 28.579
## [Iteration 2] Current point = 0.54494 df = 2.00363
## [Iteration 3] Current point = 0.585132 df = -0.018836
## [Iteration 4] Current point = 0.584762 df = -1.81023e-06

# equivalently you can also fit using
# fit <- gaston::lmm.aireml(Y, x1, K = kin, verbose = FALSE)

# the second coefficient is x1, the first is the intercept
(z_score <- fit$BLUP_beta[2]/sqrt(fit$varbeta[2,2]))

## [1] 0.4642883

# pvalue
2*pnorm(z_score, lower.tail = F)

## [1] 0.6424412

# random effect variance
fit$tau  ## [1] 44.29653  # error variance fit$sigma2

## [1] 31.45491

# error sd
sqrt(fit\$sigma2)

## [1] 5.608468


### coxme package

Here we use the coxme package (Therneau, 2015).

# install.packages("coxme")
library(coxme)

# need an ID variable
dat <- data.frame(Y, x=X[,1], id = 1:n)

# provide the kinship matrix
gfit1 <- lmekin(Y ~ x + (1|id), data=dat, varlist=kin)
gfit1

## Linear mixed-effects kinship model fit by maximum likelihood
##   Data: dat
##   Log-likelihood = -3572.846
##   n= 1000
##
##
## Model:  Y ~ x + (1 | id)
## Fixed coefficients
##                 Value Std Error    z     p
## (Intercept) 0.3185476 0.1720343 1.85 0.064
## x           0.1320349 0.2748453 0.48 0.630
##
## Random effects
##  Group Variable Std Dev   Variance
##  id    Vmat.1    6.790932 46.116756
## Residual error= 5.440202


## References

1. gaston: Genetic Data Handling (QC, GRM, LD, PCA) & Linear Mixed Models. R package version 1.5 2017 [Website]
2. coxme: Mixed Effects Cox Models. R package version 2.2-5 2015 [Website]

## Session Info

##  setting  value
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US
##  collate  en_US.UTF-8
##  date     2017-10-12
##
##  package      * version date       source
##  backports      1.1.0   2017-05-22 cran (@1.1.0)
##  base         * 3.4.1   2017-07-08 local
##  bdsmatrix    * 1.3-2   2014-08-22 CRAN (R 3.4.1)
##  compiler       3.4.1   2017-07-08 local
##  coxme        * 2.2-5   2015-06-15 CRAN (R 3.4.1)
##  datasets     * 3.4.1   2017-07-08 local
##  devtools       1.13.3  2017-08-02 CRAN (R 3.4.1)
##  digest         0.6.12  2017-01-27 CRAN (R 3.4.1)
##  evaluate       0.10.1  2017-06-24 cran (@0.10.1)
##  gaston       * 1.5     2017-05-25 CRAN (R 3.4.1)
##  graphics     * 3.4.1   2017-07-08 local
##  grDevices    * 3.4.1   2017-07-08 local
##  grid           3.4.1   2017-07-08 local
##  htmltools      0.3.6   2017-04-28 cran (@0.3.6)
##  knitr          1.17    2017-08-10 cran (@1.17)
##  lattice        0.20-35 2017-03-25 CRAN (R 3.3.3)
##  magrittr       1.5     2014-11-22 CRAN (R 3.4.1)
##  Matrix         1.2-11  2017-08-16 CRAN (R 3.4.1)
##  memoise        1.1.0   2017-04-21 CRAN (R 3.4.1)
##  methods      * 3.4.1   2017-07-08 local
##  nlme           3.1-131 2017-02-06 CRAN (R 3.4.0)
##  Rcpp         * 0.12.12 2017-07-15 CRAN (R 3.4.1)
##  RcppParallel * 4.3.20  2016-08-16 CRAN (R 3.4.1)
##  rmarkdown      1.6     2017-06-15 CRAN (R 3.4.1)
##  rprojroot      1.2     2017-01-16 CRAN (R 3.4.1)
##  splines        3.4.1   2017-07-08 local
##  stats        * 3.4.1   2017-07-08 local
##  stringi        1.1.5   2017-04-07 CRAN (R 3.4.1)
##  stringr        1.2.0   2017-02-18 CRAN (R 3.4.1)
##  survival     * 2.41-3  2017-04-04 CRAN (R 3.4.0)
##  tools          3.4.1   2017-07-08 local
##  utils        * 3.4.1   2017-07-08 local
##  withr          2.0.0   2017-09-18 Github (jimhester/withr@d1f0957)
##  yaml           2.1.14  2016-11-12 cran (@2.1.14)