- http://sahirbhatnagar.com
- 2nd year PhD student (Biostatistics) with Celia
- Im interested in
- anything applied that can contribute to society
- big data
- reproducible research
- how to give a good talk

- Big believer in open source software

August 7, 2014 Statistical Genetics Journal Club

- http://sahirbhatnagar.com
- 2nd year PhD student (Biostatistics) with Celia
- Im interested in
- anything applied that can contribute to society
- big data
- reproducible research
- how to give a good talk

- Big believer in open source software

- I will ask alot of questions
- I need your help
- Your participation is necessary for this to be useful
- Interrupt me often

- A statistical departure from the Mendelian 1:1 inheritance ratio
- Occurs when one of the two alleles from either parent is preferentially transmitted to the offspring

- Can act independently of disease status
- there are biological processes that cause TRD
- leads to false positives in linkage/association studies

- Extent of TRD and its influence in the human genome remains incomplete
- we didn't find any studies that looked at TRD in WGS data

- Can only be observed in family-based studies
- Costs
- Not always feasible to genotype unaffected
- Getting in touch with family members

- A single measure of a sample - It reduces the data to one value - Need to know its sampling distribution to conduct hypothesis tests

\[ \textrm{Pearson} = \frac{(Observed - Expected)^2}{Expected} \sim \chi^2_{(df)} \]

\[ Z = \frac{\bar{X}-\mu}{\sigma} \sim \mathcal{N}(0,1) \]

\[\textrm{Test Statistic} = \frac{\textrm{a measure of deviation from the "truth"}}{\textrm{a scaling factor to account for variability in your sample}} \]

- For example:

- \[H_0: \textrm{mean height}= 170 cm \rightarrow \textrm{``truth''} \]

- \[ \textrm{sample mean } \bar{X}=220cm\]

- \[ \textrm{sample sd } = 60cm \]

- Any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true

**Chi-Square Test**- Pearson's goodness of fit
- McNemar's test
- Tukey's test of additivity
- Likelihood ratio test

- GAW19 data drawn from
**T2D-GENES**Project - Whole genome sequencing
- 20 Mexican American Pedigrees from San Antonio
- Enriched for type 2 diabetes

- Genotype calls cleaned of mendelian errors for 959 individuals
- 464 directly sequenced
- 495 imputed
- 8.4 million markers
- odd-numbered autosomes

A novel population-based imputation approach: **Prephasing Imputation**

- Haplotypes are estimated for each individual
- Estimated haplotypes used directly for imputation of sequence variants
- Imputation ignores family structure
- to improve quality, data was cleaned for mendelian errors

- For each missing genotype:
- the probabilities of each possible genotype were calculated in the context of the local haplotypes
- the resulting probabilities were then used to generate an appropriately weighted
gene dosage variable

Identify potentially distorted regions in the genome using family based association methods

- See if this inflation in TRD \(p\) values is replicated in other Family Based Tests
- Pedigree Disequilibrium Test (PDT)
- Family Based Association Test (FBAT)

- Compare methods across different subsets of the data
- Everyone (n=1387)
- Sequenced only (n=464)
- 1 Nuclear family per pedigree (n=136)

Â | A non transmitted | B non transmitted | Total |
---|---|---|---|

A transmitted | a | b | a+b |

B transmitted | c | d | c+d |

Total | a+c | b+d | 2n |

- \[ \chi^2_{(TDT)} = \frac{(b-c)^2}{b+c} \sim \chi^2_{(1)} \]

- This is also known as a McNemar Test

Need to have at least 1 heterozygous parent for trio to be informative

Consider a marker with two alleles \(A\) and \(B\).

**Informative families** are:

- At least one affected child, both parents genotypes, one heterozygous parent
- Discordant sibships (1 affected, 1 unaffected) with different genotypes, parental genotypes not required

Within an informative nuclear family define:

- \(X_T\) = (#\(A\) transmitted) - (#\(A\) not transmitted) \(\rightarrow\) \(n_T\)
*trios* - \(X_S\) = (#\(A\) affected sib) - (#\(A\) unaffected sib) \(\rightarrow\) \(n_s\)
*sibships*

\[D=\frac{1}{n_T + n_S} \left( \sum X_{T} + \sum X_{S} \right) \]

For \(k=1,\ldots,N\) unrelated informative pedigrees

\[ PDT_{test} = \frac{\sum D_k}{\sqrt{\sum D_k^2 }} \sim \mathcal{N}(0,1) \]

A unified approach to family based tests of association that can handle:

- Different genetic models
- Sampling designs
- Multiallelic markers
- Quantitative traits
- Missing parental genotype information

- The FBAT statistic is based on the
**covariance**between genotype and phenotype

- What is covariance ?

- A measure of how much two random variables change together
- \(R^2\) is normalized version of the covariance

\[ U = \sum T^* \left[ X-E(X|P) \right] \rightarrow \textrm{Covariance} \] \[ FBAT = \frac{U^2}{var(U)} \sim \chi^2_{(1)} \rightarrow \textrm{Test Statistic} \]

- \(X\): translates offspring's genotype to a numeric value e.g. count of A alleles (random)
- \(P\): genotype of offspring's parents (fixed)
- \(T\): offspring's trait (fixed)
- summation is over all offspring in the sample

- Of interest is the offspring's genotype
- Missing parental genotypes are
*nuisance*parameters

- Missing parental genotypes are
- Standard approach to handling
*nuisance*parameters:- Find sufficient statistics for them
- Condition on the sufficient statistics
- Conditional distribution does not depend on
*nuisance*parameters