Lecture 19: Association Studies II

Lecture 19: Association Studies II Date: 10/29/02 Finish case-control TDT Relative Risk

REVIEW Case-Control – Derivation VIII

CORRECTION Case-Control – Hypothesis Testing • Recall that the trait allele frequencies are set in stone to calculate the trait prevalence K. • Model 1 (HWE, no LE): There are 2n distinct haplotypes, thus there are 2n-2 degrees of freedom. • Restricted Model 0 (HWE, LE): There are n distinct alleles, thus there are n – 1 degrees of freedom. • 2(lnL1 – lnL2) with n – 1 degrees of freedom tests for LE under the assumption of HWE. • Calculate the mle for model 1 with a modified EM.

Estimating Genetic Parameters • h = p1, p2, f11, f12, f22 are genetic parameters underlying the theoretical distribution of genotypes in the case-control approach. • When the genetic model and thus h are unknown, then one resorts to contingency tables. • Can the data be used to estimate h?

Estimating Genetic Parameters • One could estimate the haplotype frequencies h1i, h2i simultaneously with the genetic parameters h. • Then, 2[lnL(h1i, h2i, h) – lnL(qi, h)] is a statistic for testing linkage equilibrium without conditioning on known genetic parameters. • However, the G statistics above has an unknown distribution because when there is linkage equilibrium, then the marker locus and disease locus are independent and L(qi, h) is actually independent of h.

Spurious Associations (4.6.4) • Population subdivision, or any of the other causes of linkage disequilibrium we discussed last time, can cause spurious associations, i.e. linkage disequilibrium not caused by tight linkage. • Population subdivision is probably the most common source of spurious associations. • Other sources of spurious association cannot be accommodated so easily, except to know your population and know what is greater than “normal” association in this population.

Population Subdivision – Identifying Subpopulations • Identify subpopulations where matings occur randomly. These are subpopulations which will differ in trait and marker allele frequencies. Sometimes, a priori information is available about subpopulations in which these allele frequencies differ. • Often subdivide by ethnicity, location, religion, social class, and age.

Population Subdivision - Sampling Designs • Sample only from one identified subdivision. • Match case and control by subdivision. • In complex traits, there may be multiple loci associated with a disease, and these loci may vary between subpopulations. Which sampling scheme do you recommend?

Hidden Population Stratification • One cannot anticipate all sources of spurious association. • Internal checks may indicate presence of remaining spurious association. • Test HWE on individual markers. • Test markers on different chromosomes for spurious association. • Trait loci that associate tightly with multiple distant markers are a sign of trouble.

Using Families – Removing Spurious Association • The effect of spurious association can be removed by comparing the chromosomes of affected children to their relatives. • The most common relative to use? Parents. • This does NOT mean that we are returning to family-based linkage analysis. As you will see, we still use information from multiple generations of recombination.

Moving to Biallelic Model linkage disequilibrium linkage equilibrium

TDT – Assumptions • Depends on the presence of linkage disequilibrium at the population level. • Assumes random mating.

TDT – Genetic Model A D q Allele Frequencies P(A) = pA P(a) = 1 – pA P(D) = pD P(d) = 1 - pD Linkage Disequilibrium DAB = hAD - pApD

TDT – Haplotype Frequencies

TDT – The Test • Assume we randomly sample affected individuals and then genotype that individual and his/her two parents for marker A. • Take those families where the parents are heterozygous for the marker. • Record the data as transmitted and nontransmitted alleles. A table as shown on the next slide is typically used.

TDT – The Table N is the number of affected children sampled.

TDT – Filling the Table Aa Aa n12 += _____ n21 += _____ AA

TDT – Filling the Table Aa Aa n12 += _____ n21 += _____ Aa

TDT – Statistic

TDT – Derivation Nontransmitted Transmitted Under H0 the expected frequencies are equal.

TDT – Example • Search for Insulin-Dependent Diabetes Mellitus (IDDM) (Spielman et al. 1993). • 94 families included in study • 62 families had heterozygous parents at a marker on chromosome 11 with possible alleles “1” and “X”. • 78 “1” alleles were transmitted to affected children. 124-78 = 46 “X” alleles were transmitted to affected children.

TDT – Example (cont)

TDT - Power • How do we calculate the power of a TDT test? Make assumptions

TDT – Power (cont) • Statistical power is given by

TDT – Power (cont) • Power increases with sample size (number affected children). • Power increases with as recombination fraction decreases. • Power increases as linkage disequilibrium in population increases. • Power increases as trait allele frequency decreases (trait is rare). • Power is only slightly affected by marker allele frequencies.

TDT – Power Compared • TDT has lower power than a simple test for linkage disequilibrium in a random population sample. • TDT loses power by ignoring some of the data (only heterozygous parents considered) and because homozygous parents provide much information about linkage disequilibrium. • Why is TDT used then?

TDT – Advantages • TDT is a test for linkage and linkage disequilibrium, not just linkage disequilibrium. • Linkage disequilibrium from non-linkage sources can only change the genotypes of the parents. • TDT test transmission of heterozygous parents, and only linkage can result in significant result. • TDT can also detect segregation distortion at the marker locus. Another reason to check marker alleles for segregation distortion.

TDT – Advantages (cont) Ad AD unlinked Ad AD AD AD aD AD AD aD AD AD linked AD AD aD

Relative Risk Method • Analog to the general disequilibrium test on random population sample when dominant or recessive trait or marker (two genotype classes indistinguishable). • Observe two independent groups, defined by their marker genotype. • Determine the risk of being affected conditional on group P(affected | marker group). • Then, the relative risk is

Relative Risk – Data

Relative Risk – Statistic

Relative Risk – Conditional Probabilities

Relative Risk – Null Distribution

Relative Risk – Statistical Test • Chi-squared test for independence on the table. • Likelihood ratio test: 2 degrees of freedom

Haplotype Relative Risk AB BC case genotype: _____ control genotype: _____ BB

Haplotype-Based HRR (HHRR) • Focus on alleles rather than genotypes. • There are two transmitted and two non-transmitted alleles in every pair of parents with one affected offspring. • Treat the two allele samples as independent case-control samples.

HHRR – II AB BC case alleles: _____ control alleles: _____ BB

HHRR – III

HRR & HHRR • Most powerful when linkage is 0. • Both assume random mating when they assume the parents provide an independent control genotype or alleles. • HHRR is more powerful than TDT because it uses information from homozygous parents. • HHRR, is valid test statistic for DAD = 0 and q=0.

Lecture 19: Association Studies II