August 7, 2014 Statistical Genetics Journal Club

About me

  • http://sahirbhatnagar.com
  • 2nd year PhD student (Biostatistics) with Celia
  • Im interested in
    • anything applied that can contribute to society
    • big data
    • reproducible research
    • how to give a good talk
  • Big believer in open source software

Disclaimer

  1. I will ask alot of questions
  2. I need your help
  3. Your participation is necessary for this to be useful
  4. Interrupt me often

Introduction

What is Transmission Ratio Distortion (TRD)

  • A statistical departure from the Mendelian 1:1 inheritance ratio
  • Occurs when one of the two alleles from either parent is preferentially transmitted to the offspring

plot of chunk unnamed-chunk-1

What is Transmission Ratio Distortion (TRD)

Properties of TRD

  • Can act independently of disease status
    • there are biological processes that cause TRD
    • leads to false positives in linkage/association studies
  • Extent of TRD and its influence in the human genome remains incomplete
    • we didn't find any studies that looked at TRD in WGS data

Biological Mechanisms of TRD

How can we assess TRD ?

The Transmission Disequilibrium Test (TDT)

Caveat of assesing TRD

  • Can only be observed in family-based studies
    • Costs
    • Not always feasible to genotype unaffected
    • Getting in touch with family members

A note on Test Statistics

Test Statistics

What we care about

Test Statistics

What they actually are

- A single measure of a sample
- It reduces the data to one value
- Need to know its sampling distribution to conduct hypothesis tests

\[ \textrm{Pearson} = \frac{(Observed - Expected)^2}{Expected} \sim \chi^2_{(df)} \]

\[ Z = \frac{\bar{X}-\mu}{\sigma} \sim \mathcal{N}(0,1) \]

Test Statistics

In general…

\[\textrm{Test Statistic} = \frac{\textrm{a measure of deviation from the "truth"}}{\textrm{a scaling factor to account for variability in your sample}} \]
  • For example:
  • \[H_0: \textrm{mean height}= 170 cm \rightarrow \textrm{``truth''} \]
  • \[ \textrm{sample mean } \bar{X}=220cm\]
  • \[ \textrm{sample sd } = 60cm \]

What do we mean by Chi-Square Test ?

  • Any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true
  • Chi-Square Test
    • Pearson's goodness of fit
    • McNemar's test
    • Tukey's test of additivity
    • Likelihood ratio test

Data

Genetic Analysis Workshop 19 (GAW19)

  • GAW19 data drawn from T2D-GENES Project
  • Whole genome sequencing
    • 20 Mexican American Pedigrees from San Antonio
    • Enriched for type 2 diabetes
  • Genotype calls cleaned of mendelian errors for 959 individuals
    • 464 directly sequenced
    • 495 imputed
    • 8.4 million markers
    • odd-numbered autosomes

Genetic Analysis Workshop 19

20 Mexican American Pedigrees

Genetic Analysis Workshop 19

Imputation Procedure

A novel population-based imputation approach: Prephasing Imputation

  1. Haplotypes are estimated for each individual
  2. Estimated haplotypes used directly for imputation of sequence variants
  3. Imputation ignores family structure
    • to improve quality, data was cleaned for mendelian errors
  4. For each missing genotype:
    • the probabilities of each possible genotype were calculated in the context of the local haplotypes
    • the resulting probabilities were then used to generate an appropriately weighted
      gene dosage variable

Objective

GAW19 Objective

Identify potentially distorted regions in the genome using family based association methods

GAW19 Objective

My Reaction

GAW19 Objective: Take 2

  • See if this inflation in TRD \(p\) values is replicated in other Family Based Tests
    • Pedigree Disequilibrium Test (PDT)
    • Family Based Association Test (FBAT)
  • Compare methods across different subsets of the data
    • Everyone (n=1387)
    • Sequenced only (n=464)
    • 1 Nuclear family per pedigree (n=136)

Methods

Transmission Disequilibrium Test

  A non transmitted B non transmitted Total
A transmitted a b a+b
B transmitted c d c+d
Total a+c b+d 2n
  • \[ \chi^2_{(TDT)} = \frac{(b-c)^2}{b+c} \sim \chi^2_{(1)} \]
  • This is also known as a McNemar Test

Transmission Disequilibrium Test

Informative Trios

Need to have at least 1 heterozygous parent for trio to be informative

Transmission Disequilibrium Test

How to assess TRD

Pedigree Disequilibrium Test (PDT)

Consider a marker with two alleles \(A\) and \(B\).

Informative families are:

  1. At least one affected child, both parents genotypes, one heterozygous parent
  2. Discordant sibships (1 affected, 1 unaffected) with different genotypes, parental genotypes not required

Pedigree Disequilibrium Test (PDT)

Informative Trios

plot of chunk unnamed-chunk-2

Pedigree Disequilibrium Test (PDT)

Informative Discordant Sibships

plot of chunk unnamed-chunk-3

Pedigree Disequilibrium Test (PDT)

Within an informative nuclear family define:

  • \(X_T\) = (#\(A\) transmitted) - (#\(A\) not transmitted) \(\rightarrow\) \(n_T\) trios
  • \(X_S\) = (#\(A\) affected sib) - (#\(A\) unaffected sib) \(\rightarrow\) \(n_s\) sibships

\[D=\frac{1}{n_T + n_S} \left( \sum X_{T} + \sum X_{S} \right) \]

For \(k=1,\ldots,N\) unrelated informative pedigrees

\[ PDT_{test} = \frac{\sum D_k}{\sqrt{\sum D_k^2 }} \sim \mathcal{N}(0,1) \]

Family Based Association Test (FBAT)

A unified approach to family based tests of association that can handle:
  1. Different genetic models
  2. Sampling designs
  3. Multiallelic markers
  4. Quantitative traits
  5. Missing parental genotype information

Family Based Association Test (FBAT)

  • The FBAT statistic is based on the covariance between genotype and phenotype
  • What is covariance ?
  • A measure of how much two random variables change together
    • \(R^2\) is normalized version of the covariance

Family Based Association Test (FBAT)

Test Statistic

\[ U = \sum T^* \left[ X-E(X|P) \right] \rightarrow \textrm{Covariance} \] \[ FBAT = \frac{U^2}{var(U)} \sim \chi^2_{(1)} \rightarrow \textrm{Test Statistic} \]

  • \(X\): translates offspring's genotype to a numeric value e.g. count of A alleles (random)
  • \(P\): genotype of offspring's parents (fixed)
  • \(T\): offspring's trait (fixed)
  • summation is over all offspring in the sample

Family Based Association Test (FBAT)

Why so flexible ?

  • Of interest is the offspring's genotype
    • Missing parental genotypes are nuisance parameters
  • Standard approach to handling nuisance parameters:
    • Find sufficient statistics for them
    • Condition on the sufficient statistics
    • Conditional distribution does not depend on nuisance parameters

Family Based Association Test (FBAT)

Comparison of Methods

attribute TDT PDT FBAT
affected child ✓ ✓ ✓
1 heterozygous parent ✓ ✓ ✓
biallelic markers ✓ ✓ ✓
trios ✓ ✓ ✓
missing parental genotypes ✓ ✓
discordant sibships ✓ ✓
pedigrees ✓ ✓
multiallelic markers ✓
quantitative Traits ✓

Results

Discussion

Observations

  • Recently developed imputation algorithms giving no Mendelian errors can lead to inflated TDT results
  • These false positive signals can occur for example from
    • mistyping homozygote parents as heterozygotes
    • missed calls among heterozygotes
  • PDT and FBAT were not sensitive to imputation
    • more informative families being used \(\rightarrow\) more power

Possible Reasons for Inflated TDT p-values

Pair counts by sequenced status. Information extracted using PEDSTATS

Dosage Analysis

Imputed subjects within threshold at each marker by chromosome

Unexpected HWE vs. MAF

Hardy Weinberg Equilibrium p value vs. MAF for All Individuals

Gene Set Enrichment Analysis

Example of Competitive Testing

Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA)

Using Biological Processes from Gene Ontology

  • 3.3 million variants input (PDT Sequenced TRD p-value < 0.70)
  • 2.2 million variants used
  • 10,632 genes mapped
  • 19 gene sets selected (GO terms related to biological mechanisms of TRD)
  • 1 result (\(p<0.001\), \(FDR=0.0070\), )
GO0045910: negative regulation of DNA recombination

GSEA Results

gene mapped significant variant -log10(p) function
RAD18 ✓ ✓ rs529369 2.35 Postreplication repair of UV-damaged DNA
BLM ✓ ✓ rs404623 2.04 Participates in DNA replication and repair
LIG3 ✓ ✓ rs3135966 2.03 Interacts with DNA-repair protein XRCC1
MSH3 ✓ ✓ rs7737445 1.94 Component of the post-replicative DNA mismatch repair system
MSH2 ✓ rs72475989 0.50
MSH5
MSH6
ZRANB3

Software

ALL OPEN SOURCE !

  • Family Based Tests: FBAT, PLINK 1.9, PDT command line tools
  • GSEA-SNP: http://gsea4gwas.psych.ac.cn/
  • Plots: qqman, ggplot2, kinship2 packages in R, cranefoot
  • Data Cleaning: awk, bash, data.table package in R, GRinux
  • Slides: R Markdown, pandoc
  • Results App: Shiny package in R
  • Code: available on GitHub
  • Paper: LaTeX

References