case control association techniques in genetic studies l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Case-control association techniques in genetic studies PowerPoint Presentation
Download Presentation
Case-control association techniques in genetic studies

Loading in 2 Seconds...

play fullscreen
1 / 107

Case-control association techniques in genetic studies - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

Case-control association techniques in genetic studies. March 10, 2011. Karen Curtin, Ph.D. Division of Genetic Epidemiology and HCI Pedigree & Population Resource (PPR). Presentation outline. Background (genetics concepts). Basic case-control association.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Case-control association techniques in genetic studies' - keefe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
case control association techniques in genetic studies
Case-control association techniques

in genetic studies

March 10, 2011

Karen Curtin, Ph.D.

Division of Genetic Epidemiology and

HCI Pedigree & Population Resource (PPR)

slide2

Presentation outline

  • Background (genetics concepts)
  • Basic case-control association
  • Complex case-control association
  • Genome-wide association
the human genome 6 billion dna bases a denine c ytosine g uanine or t hymine
The Human Genome: 6 billion DNA bases(Adenine, Cytosine, Guanine, or Thymine)

License: Creative Commons Attribution 2.0

genotype and haplotype

…AGCCAAACTGAATTC…

…AGCCAAATTGGATTC…

At any locus (position on a chromosome):

Read across both chromosomes

Genotype CT

CA

T

G

Read along a chromosome

Haplotypes: C-A and T-G

Genotype and Haplotype

If allele T can predict allele G,

two alleles are in

Linkage Disequilibrium (LD)

slide5

90% of genomic variants are SNPs

Single Nucleotide Polymorphsim

Two alternate forms (alleles) that differ

in sequence at one point in a DNA segment

Source: David Hall, Creative Commons Attribution 2.5 license

genetic variants germline v somatic
Genetic variants: Germline v Somatic
  • Germline variant/mutations
    • Inherited/In-born mutation
    • In all cells
    • In particular, in germline haploid cells
      • Heritable
    • Cell division - meiosis
  • Somatic variants/mutations
    • Acquired mutation
    • Only in an isolated number of cells (tumor site)
      • Generally not heritable
    • Cell division - mitosis
hereditary mutation meiosis
Hereditary mutation - meiosis

Parent germ cells

Daughter cells

HAPLOID

X

New zygotes

DIPLOID

slide8

Presentation outline

  • Background (genetics concepts)
  • Basic case-control association
  • Complex case-control association
  • Genome-wide association
genetic variants in association studies
Genetic variants in association studies

Association: two characteristics (disease& genetic variant) occur more often together than expected by chance

  • Direct Association / Causal

Functional variant Disease

    • Functional variant is involved in disease
    • Functional variant is associated with the disease
  • Indirect Association

Genetic variant Functional variant Disease

    • Genetic variant (SNP) is associated/correlated with underlying functional variant
    • Functional variant is involved in disease
    • Genetic variant (marker) is associated with disease (initial step.. Ultimate goal is to discover causal variant)
genetic association study designs
Genetic association study Designs
  • Observational
    • Exposure variables
      • Genetic variants
      • Environmental factors
  • Classical association study designs
    • Unit of interest is an individual
    • Cohort study (cross-sectional or longitudinal)
    • Case-control study
  • Family-based association study
    • Unit of interest is a family unit
slide11

Case-Control Study

  • Sample individuals based on to disease status and without knowledge of exposure status (e.g. genotype)
    • CASES (with disease)
    • CONTROLS (no disease)
  • Usually balanced design (#cases = #controls)
  • Retrospective
  • Neither prevalence nor incidence can be estimated
types of case control study
Types of Case-Control Study
  • Population-based
    • Risk estimates can be extrapolated to the source population
    • Could be nested in a cohort study
  • Selected sampling
    • Increases power to detect associations
      • Antoniou & Easton (2003)
    • Tests of independence are valid
    • True positive risks are exaggerated
      • Can not be extrapolated
case control population based
Case-Control: Population-based
  • Source population
    • All individuals satisfying predefined criteria
  • Source cohort
    • A group that is ‘representative’ of the source population
    • CASES and CONTROLS occur in relation to population prevalence
  • CASES
    • Cases selected are ‘representative’ of cases in the source cohort
    • In particular, in terms of the exposure variables
  • CONTROLS
    • Controls selected are ‘representative’ of controls in the source cohort
    • In particular, in terms of the exposure variables
  • Odds Ratio (estimate of the relative risk) can be extrapolated back to the source population
    • Population Attributable Risk (PAR)
case control selected sampling
Case-Control: Selected Sampling
  • Source population
    • All individuals satisfying predefined criteria
  • Source cohort
    • A group that is ‘representative’ of the source population
    • CASES and CONTROLS occur in relation to population prevalence
  • CASES
    • Cases selected are in effect selectively sampled from cases in source cohort
    • Family history of disease, severe disease, early onset,…
  • CONTROLS
    • Cases selected are in effect selectively sampled from controls in source cohort
    • Screened negative, no family history,…
  • Association analyses are still valid and power may be increased
  • BUT…
    • Odds Ratio (estimate of the relative risk) can not be extrapolated back to the source population
case control study odds ratio
Case-Control Study: Odds Ratio

Exposure

Yes No

Disease Cases (Yes) a b

Controls(No) c d

Odds Ratio (OR) = a / b = a × d

c / d b × c

H0: OR = 1 same risk (no association)

OR > 1 indicates increased risk

OR < 1 indicates decreased risk (protective)

95 confidence intervals for the odds ratio
95% confidence intervals for the Odds Ratio

Lower and Upper bounds for the risk estimates.

Two common methods:

  • eln(OR) – 1.96se(ln(OR)), eln(OR) + 1.96se(ln(OR))

where se(ln(OR)) = 1/a+1/b+1/c+1/d

2) OR1-1.96/, OR1+1.96/

chi square test
chi-square test

Compares observed values (O) with those expected under independence between rows and columns

Expected (E) = row total  column total

N

chi-square statistic, with (rows-1)  (columns-1) degrees of freedom

2 =  (O – E)2 ~ 2(rows-1) (columns-1)

E

test for non independence
Test for Non-independence

H0: Disease and exposure (genotype)

are independent

chi-square tests: contingency tables

2×3 genotype table (2 df)

2×2 grouped genotype table (1 df)

  • Dominant or recessive

2×3 ‘dose-dependent’ table

  • Armitage test for trend (1 df)

2×2 allele table (1 df)

modeling genetic exposures
Modeling genetic exposures
  • Exposure = genotype
  • Single variant with 2 alleles (SNP)
  • Three genotypes: CC, CT, TT
  • 23 contingency table
    • Chi-sq 2df
    • Chi-sq 1df (impose a linear dependency between columns)

CC CT TT

Controls

Cases

mode of expression inheritance
Mode of Expression / Inheritance
  • Let allele C be disease causing
  • Examples of modes of expression are:
    • Dominant TT TCCC
      • Individuals heterozygous or homozygous for the C allele gives rise to the disease
    • Recessive TT TC CC
      • Only homozygous individuals for the C allele results in disease
    • Codominant TT TCCC
      • All three genotypes can be distinguished phenotypically
      • ‘Additive’ model – TC has r-fold risk, CChas 2r effect
chi square test21
chi-square test

CC CT TT

Totals

Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2

120 50 30 120 50 30

Chi-statistic = 10.67

p-value=0.0048 (for a chi-square distribution with 2 df)

Controls

200

120

50

30

Cases

200

120

50

30

400

240

100

60

Totals

genotypic relative risk
Genotypic relative risk
  • Assess risk (OR) for each genotype relative to the homozygous common genotype

ORhet = a × e ORhzv = a × f

CT vs. CC b × d TT vs. CC c × d

Genotype (exposure)

CC CT TT

Controls

Cases

chi square test genotypic relative risk
chi-square test / genotypic relative risk

CC CT TT

Totals

Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2

120 50 30 120 50 30

Chi-statistic = 10.67

p-value=0.0048 (for a chi-square distribution with 2 df)

OR het CT vs. CC = 1.5 OR hzv TT vs. CC = 2.0

Controls

200

120

50

30

Cases

200

120

50

30

400

240

100

60

Totals

test for non independence24
Test for Non-independence

H0: Disease and exposure (genotype)

are independent

chi-square tests: contingency tables

2×3 genotype table (2 df)

2×2 grouped genotype table (1 df)

  • Dominant or recessive

2×3 ‘dose-dependent’ table

  • Armitage test for trend (1 df)

2×2 allele table (1 df)

dominant model for exposure
Dominant model for exposure

Exposure = CT&TT genotypes - 22 test with 1 df

ORdom = a × (e+f) = 1.67

d × (b+c)

Genotype

CC CT TT

(b+c)=

Controls

Cases

(e+f)=100

recessive model for exposure
Recessive model for exposure

Exposure = TT genotype (vs. CC&CT) - 22 test w/1 df

ORrec = (a+b) × f = 1.78

(d+e) × c

Genotype

CC CTTT

Controls

(a+b)=160

Cases

(d+e)=180

test for non independence27
Test for Non-independence

H0: Disease and exposure (genotype)

are independent

chi-square tests: contingency tables

2×3 genotype table (2 df)

2×2 grouped genotype table (1 df)

  • Dominant or recessive

2×3 ‘dose-dependent’ table

  • Armitage’s trend test (1 df)

2×2 allele table (1 df)

armitage trend t est 2 3 with 1df
Armitage Trend Test (23 with 1df)

Assess departures from a fitted trend

CC (x1=0) CT (x2=1) TT (x3=2)

R

Controls

Cases

n1

n2

n3

N

test for non independence30
Test for Non-independence

H0: Disease and exposure (genotype)

are independent

chi-square tests: contingency tables

2×3 genotype table (2 df)

2×2 grouped genotype table (1 df)

  • Dominant or recessive

2×3 ‘dose-dependent’ table

  • Armitage’s trend test (1 df)

2×2 allelic table (1 df)

allelic test
Allelic Test
  • Exposure = Allele (T vs. C)
  • 2 x 2 table (1 df) for a single SNP
  • Count every allele (2 per person)
    • Doubles the sample size

ORallele = (2a+b)×(2f+e)

(2c+b)×(2d+e)

Allele

C T

Controls

OR = 1.633 T vs. C allele

Cases

example allelic association
Example – allelic association

11 12 22

11 12 22

Xue et al. Arch Oral Bio 2009

more flexible techniques
More flexible techniques
  • If other factors may have an effect on disease status (affected/unaffected, case/control)
    • We want to account for these as covariates
    • We want to adjust for matching variables (age, sex, etc.)
  • Logistic regression
    • Logistic transformation (logit)
    • ln(p/(1-p)) =  + 1x1 + 2x2 + ….
    • Coefficients  and ’s are estimated using maximum likelihood estimation (MLE)
    • Test H0: =0 against H1:  =  using a likelihood ratio test (LRT)
  • Must decide on how to model the genetic exposure
    • genotype categories (i.e. CC, CT,TT), dominant, recessive, additive (allele dose)..

~ ~

^

example of logistic regression model with genetic exposure and covariates
Example of logistic regression model with genetic exposure and covariates

Slattery et al. IJC 2010

assumptions for validity
Assumptions for Validity
  • Independence of all individuals
    • Independent and identically distributed (iid)
  • Reasonable sample sizes
    • Contingency tables
      • Expected values all > 1 and 80% > 5
    • Logistic regression
      • Minimum of 15-20 individuals per group
  • If violated
    • Simulate the null distribution for testing
      • Permutation test
        • e.g. Fishers exact test is an exhaustive permutation test
      • Monte Carlo simulation
slide36

Presentation outline

  • Background (genetics concepts)
  • Basic case-control association
  • Complex case-control association
  • Genome-wide association
performing haplotype analyses
Performing haplotype analyses
  • Single locus
    • We observe genotypes, so testing is straight-forward counting into a contingency table

CC CT TT

Controls

Cases

performing haplotype analyses38
Performing haplotype analyses
  • Multi-locus
    • Haplotypes are not directly observed
    • But can be estimated (EM/Bayesian…)
    • For some individuals, their haplotype pair can be inferred unambiguously
    • For many individuals they can not
      • “Phase uncertainty”
    • All analyses of haplotypes must take into account the phase uncertainty in the data
      • Otherwise, increase in type 1 errors
haplotypes genotypes
Haplotypes / Genotypes

Two-locus Haplotypes:

The haplotype pair must be:

C-G and C-G

UNAMBIGUOUS

…AGCTAAACTGGATT…

…AGCCAAACTGGATT…

CG

CG

estimating haplotypes
Estimating haplotypes

Genotypes

Locus 1 Locus 2 Haplotypes

CCGGC-G&C-G

CCGAC-G&C-A

CCAAC-A&C-A

CTGGC-G&T-G

CTGA?(C-G&T-A)

or (C-A&T-G)?

CTAAC-A&T-A

TTGGT-G&T-G

TTGAT-G&G-A

TTAAT-A&T-A

estimating haplotypes41
Estimating haplotypes
  • Expectation-maximization (EM) algorithm
    • SNPHAP (Johnson et al 2001)
    • GCHap (Thomas 2003)
  • Bayesian MCMC approach
    • PHASE (Stephens et al 2001)
  • Both approaches assume independent individuals
  • Use to estimate
    • Population haplotype frequencies estimated from a set of individuals
    • Most likely haplotype pair for each individual
traditional methods for phase uncertainty
Traditional methods for phase uncertainty
  • Likelihood based approach
    • Each individual can have multiple different haplotype pairs that are consistent with the genotype data
      • Some pairs of haplotypes are more or less likely than others
      • Each pair is given a weight
      • All possible haplotype pairs are considered in the case-control analysis
        • weighted by their probabilities
simulation methods for phase uncertainty
Simulation methods for phase uncertainty
  • Sample over the observed data
  • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis
    • Sample one pair of each individual
      • Randomly and in proportion to the weights, select a haplotype pair for each individual
      • Perform the analysis as if those were observed
      • Repeat 1,000 times…
      • Average
  • SIMHAP (McCaskie et al.)
simulation methods for phase uncertainty45
Simulation methods for phase uncertainty
  • Monte Carlo testing
    • Simulate the null –matched to the real data
  • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis
    • Assign each individual their most likely haplotype pair
      • Cases and controls separately
    • Simulate null haplotype data
      • Null: Convert haplotypes to genotypes
      • Null: Estimate haplotypes
      • Null: Assign each individual their most likely haplotype pair
    • Real and null are matched
    • Test real data (with most likely haplotype pairs assigned) against the simulated null
  • hapMC (Thomas et al.)
exponential explosion high dimensional data
Exponential explosion… high dimensional data
  • 1 SNP
    • 2 alleles 1 test
    • 3 genotypes 1+ tests
  • 2 SNP loci
    • 4 haplotypes
  • 3 SNP loci
    • 8 haplotypes
  • 10 SNP loci
    • 1024 haplotypes many tests..
multi locus but how many and which loci to test
Multi-locus… but how many, and which loci to test?
  • For example…20 tSNPs
    • Only perform single SNP analyses?
    • Perform tests on all 20-locus haplotypes?
      • Group all ‘rare’ haplotypes together
      • Cluster to reduce dimension
    • Multi-locus tests with subsets of 20 SNPs?
      • Subsets of which SNPs?
data mining approach to haplotype construction hapconstructor abo et al
Data mining approach to haplotype construction – hapConstructor(Abo et al.)
  • Automatically builds haplotypes (or composite genotypes)
    • Non-contiguous SNPs
    • In a case-control framework
    • All SNP haplotypes are phased during 1st stage and used in all subset analyses
    • Starts with each single SNP locus
      • Forward-backward process driven by significance thresholds
  • Significance and false discovery rates (p-values and q-values) reported for the building process
  • Computationally challenging, potentially time intensive
multilocus model building example using hapconstructor
Multilocus model building example using hapConstructor

16 SNPs

Curtin et al. BMC Med Genet 2010

meta association in case control studies
Meta-association in case-control studies
  • Association: two characteristics occur more often together than expected by chance
    • Disease
    • Genetic variants
  • Meta-Association: study of association across case-control data collected by multiple study sites (collaborative effort)
    • NARAC: North American Rheumatoid Arthritis Consortium
    • BCAC: Breast Cancer Association Consortium

VS. “Meta-analysis of individual level data from participants in a systematically ascertained

group of studies” (Petitti definition)

meta analysis of multi study case control data general concepts
Meta-analysis of multi-study case-control data: general concepts
  • simple pooling – combine individual level data from multiple studies and compute association statistics
  • fixed effects models – inference is conditional on the studies actually done
    • in genetic association, assumes same genetic effect size across studies
  • random effects models – inference is based on assuming studies in the analysis are a ‘random sample’ of hypothetical population of studies
fixed effects models
Fixed effects models
  • Methods and effect measures
    • Mantel-Haenszel: Odds ratio; also rate, risk ratio
      • well-known method for calculating summary estimate of effect across strata (i.e. multiple studies)
    • Peto: Ratio (can approximate odds ratio)
      • modification of M-H method
    • General variance-based: Ratio (all types) and rate differences
mantel haenszel method fixed effects
Mantel-Haenszel method (fixed effects)

where i is the ith strata (study)

mantel haenszel method fixed effects summary odds ratio
Mantel-Haenszel method (fixed effects)summary odds ratio

weighti = 1/variancei

where:

variance component of effect size

within studies only

mantel haenszel method fixed effects summary odds ratio56
Mantel-Haenszel method (fixed effects)summary odds ratio
  • Strengths
    • Optimal statistical properties (uniformly most powerful test)
    • M-H estimate OR=1, M-H Chi-square=0

(mathematical connection of effect with summary statistic)

    • Widely available in statistical software
  • Limitations
    • Requires data to complete 2x2 table for all studies (potential exclusion bias)
    • ignores confounding not taken into account by study design (i.e. age, sex-matched controls)
      • could use logsitc regression estimate of OR to simultaneously model confounding variables and to adjust for study site
cmh chi square general association test of independence fixed effect method
CMH chi-square general association test of independence (fixed-effect method)
  • Extension of Cochran-Mantel-Haenszel (CMH) test to sets of (X by Y) contingency tables (i.e. studies)
  • Formulas for the CMH statistics are more easily defined in terms of matrices (Landis and Koch 1978)
  • Assumes study strata are independent, and that the marginal totals of each stratum are fixed
    • H0 : there is no association between X (disease status) and Y (genotype) in any of the strata
    • corresponding model is the multiple hypergeometric
heterogeneity
Heterogeneity
  • If Ho: homogeneity is rejected, studies are not measuring effect of the same size
  • Tests of Heterogeneity
    • Q test ~Chisq. with d.f.= #studies – 1
      • Mantel-Haenszel method:
    • Logistic regression: add a term for interaction between study and genotypes in model (test using Wald or Likelihood Ratio)
  • When heterogeneity is not extreme, fixed- and random- effects models yield similar results
random effects models
Random effects models
  • Methods and effect measures
    • DerSimionian-Laird (1986): Ratio (all types) and difference
    • Bagos and Nikolopoulos (2007): Odds ratio
      • study-specific coefficient in logistic regression model representing deviation of study i’s true genotypic effect to overall mean effect
  • incorporates between-study component of variance, CI’s at least as wide (wider) than fixed effects
fixed vs random assumptions
Fixed- vs. Random- Assumptions
  • analysis under fixed model addresses the question:

Was there a genotype-phenotype association in the consortium of case-control studies used in the meta analysis?

  • under the random model, question:

Will there be a genotype-phenotype association “on average?”

independent individuals
Independent individuals
  • If study cases and controls are independent (unrelated) individuals,

meta-association is straightforward...

straightforward
Straightforward...
  • Adjust for ‘study site’ in a logistic regression
  • Use Cochran Mantel Haenszel (CMH) techniques, controlling for study
    • CMH test of association
    • CMH test of trend
    • meta odds ratio estimate
cox et al nature genetics 2007
Cox et al, Nature Genetics (2007)
  • Test of Ho: no association included terms for genotype and BCAC study
  • Trend test included 1 parameter for allele dose and a term for BCAC study
  • Genotype-specific risks estimated as ORs using logistic regression with BCAC study as a covariate (fixed-effects)
  • Tested heterogeneity between studies by comparing logistic regression models with and without a genotype x study interaction term
  • Data also analyzed using a random-effects model, test for heterogeneity
meta association related individuals
Meta Association – Related individuals
  • But what if some study individuals (cases or controls) are related in multi-study collaborations? ..sibships, trios, pedigrees-or mixed, in families

meta analysis of data from multiple sites is more difficult..

Genie to the rescue..

genie overview
Genie overview
  • Allen-Brady et al. (2006), Curtin et al. (2007)
  • Simulation-based technique
    • Monte Carlo approach
    • Null distribution is simulated for the statistic of interest matching the pedigree structure
  • Equivalent to an empirical version of the variance correction method with prior probabilities
  • Flexible in type of statistic that can be analyzed
    • Classical association statistics and effect measure (OR)
    • Meta association statistics (fixed-effects approach)
  • Dichotomous and quantitative traits

http://www-genepi.med.utah.edu/Genie/index.html

genie empirical null
Genie: Empirical null
  • Generate the empirical null
  • Using appropriate allele frequencies perform a gene-drop through the pedigree
    • Null genotypic configuration
  • Calculate the statistic of interest using the null data ignoring relatedness
    • Null statistic
  • Repeat thousands of times
    • Empirical estimate of the null distribution
  • Assess the significance of the observed statistic by assessing where it lies in the null distribution
slide68

Creating the Simulated Null Distribution

Population allele frequencies

Assign alleles randomly to pedigree founders

Gene drop: simulated Mendelian inheritance

Repeat

Null Genotype Configuration

Calculate NULL statistic

Empirical Null Distribution

genie meta association
Genie Meta-association
  • Fixed effects approach – assumes same genetic effect size across studies
  • Generalized CMH approach – chi-square general association test of independence

extension to >2x2 tables across multiple studies

  • CMH chi-square test of trend – mean score statistic where ordered genotypes (i.e. genotypes aa, aA, and AA) lie on an ordinal scale
  • Meta ORs – M-H common odds ratio estimate for 2x2 tables (CT vs CC, TT vs CC)
    • 95% CI estimated empirically
empirical 95 confidence interval
Empirical 95% Confidence Interval

Distribution of OR estimates from 1,000 configurations in PedGenie null

why genie meta association
Why Genie Meta-association?
  • Ability to combine family-based and independent case-control resources and use all available data
    • Genie software corrects for relationships in family-based resources; all family members with phenotype and genotype data can be included
    • increases the utility of pedigrees previously ascertained for linkage and can provide increased power to detect associations..

..particularly in stratified and subset analyses that may lead to small sample sizes in individual studies

    • needs a logistic regression framework (underway)
association of xrcc2 tag snps with crc in 4 study meta analysis
Association of XRCC2 tag-SNPs with CRC in 4-study meta analysis

(Curtin et al. CEBP 2009)

*Empirical Cochran-Mantel-Haenszel χ2 test for trend or recessive model based on 10,000 simulations.

association of xrcc2 rs3218499g c with crc in 4 study meta analysis
Association of XRCC2rs3218499G>C with CRC in 4-study meta analysis

*Empirical Cochran-Mantel-Haenszel χ2 test for recessive model based on 10,000 simulations.

genomewide case control association gwa an approach to the study of common diseases
Genomewide (case-control) Association GWA: an approach to the study of common diseases
  • Complex architecture
    • Multiple genes likely involved
    • Multiple environmental factors
    • Individually low risks
  • Argument that the underlying variants may be common and of modest effect..
    • Common variants (>0.05, >0.01)
    • Not under intense negative selection
  • Agnostic.. no hypothesis
    • Hypothesis generating vs. hypothesis driven (candidate gene or pathway)
gwa what is required
GWA: What is required?
  • Large set of SNPs
  • Stringent significance thresholds
    • ~5 x 10-8
  • Large case-control sample size
    • Example
      • Allele frequency 0.15
      • OR=1.25
      • 80% power
      • 6,000 cases and 6,000 controls
large set of snps
Large set of SNPs
  • Linkage-disequilibrium (LD)-based
    • Genomewide tag-SNP set
    • Made possible by HAPMAP
    • 500,000-1,000,000 SNPs
    • High-density arrays with 2 million SNPs
      • Not optimal for rare variants…
        • tag-SNP methods ignore them
stringent significance thresholds
Stringent significance thresholds
  • Very few ‘hits’ per study
    • 1,3,4,5 significant hits per genome using GWA
    • If don’t correct and use nominal 0.05
      • In 500,000 markers
      • Can expect 25,000 false positives
    • Need to use a correction for multiple testing
      • significance threshold of ~510-8 (Dudbridge & Koeleman ASHG 2004)
  • Good… but not great
    • But we’re expecting many more genes to be found… right?
  • Less stringency and instead use replication?
multistage strategies in gwa
Multistage strategies in GWA

Hirschhorn & Daly Nature Reviews 2005

interactions
Interactions
  • An increase (or decrease) effect of one exposure given another.
  • Gene-environment interaction
    • Risk (genotype AA / no smoke) = 4
    • Risk (genotype AA / smoke)= 6
  • Gene-gene interaction
    • Epistasis
    • Risk (genotype AA / genotype bb) = 4
    • Risk (genotype AA / genotype Bb,BB) = 6
statistical interactions
Statistical Interactions
  • Multiplicative model
    • Most commonly used
    • Natural to a risk framework
      • Logistic regression
    • Independent loci
      • multiply risk OR11=OR10×OR01
    • Interaction
      • OR11≠OR10×OR01
multiplicative model
Multiplicative model
  • Multiplicative risk for alleles at each locus
  • First locus
    • aa 1.00 1
    • aA 2.20 
    • AA 4.84 2
  • Second locus
    • Bb 1.00 1
    • bB 1.50 
    • BB 2.25 2
statistical interactions84
Statistical Interactions
  • Additive model
    • Less popular
    • Independent loci
      • Add risks
      • OR11= 1 + (OR10-1) + (OR01-1)
    • Interaction
      • OR11≠ 1+ (OR10-1) + (OR01-1)
additive model
Additive model
  • Additive risk for alleles for each single locus
  • First locus
    • aa 1.00 1
    • aA 2.20 
    • AA 3.40 2-1
  • Second locus
    • bb 1.00 1
    • bB 1.50 
    • BB 2.00 2-1
no main effects
No main effects???
  • No main effects
  • Only interaction effects
  • Problem:
    • In a stepwise procedure, if aren’t able to identify the main effects, then how do you know to test the interaction??
  • HOWEVER… Thus far, no biological model has been put forth that support the lack of main effects
case control design ors
Case-control design: ORs
  • Testing in the Odds Ratio framework
  • H0: OR11=OR10×OR01
  • H0: IOR11=1.0
case control design ors90
Case-control design: ORs
  • IOR11= OR11

OR10  OR01

  • Under the null, IOR11 = 1
  • Can do several IORs
    • 11, 12, 21 and 22
  • Can construct confidence intervals to test for a significant interaction
case control design logistic regression
Case-control design: logistic regression

logit P(Y=1/G1,G2) = + 1G1 + 2G2 + 3G1×G2

  • Parameter is an estimator for ln(IOR) under a multiplicative model
  • G1 and G2 can be modeled several ways
    • Dominant
    • Recessive
    • Additive
    • 3 levels
methods mdr
Methods: MDR
  • Multifactor-Dimensionality Reduction (MDR)
    • Ritchie et al (2001) Am J Hum Genet
    • Combinatorial partitioning
    • Data mining
    • http://www.epistasis.org/software.html
slide93
MDR
  • Divide sample into 10 equal partitions
    • Model on 9/10 (1…9)
    • Test on 1/10 of data (10)
    • Repeat 10 times and average the misclassification
  • Pick n loci from the total N SNPs
    • Exhaustively assess all combinations
    • All cells cases>controls (high-risk)
    • All cells cases<controls (low-risk)
    • Group
  • Repeat for all possible n of N
  • May be too many… doesn’t scale well
machine learning
Machine learning
  • Machine learning
    • Classification trees (e.g. CART)
      • Greedy algorithms
      • Not optimal
      • Cook et al (2004) Stat Med
    • Artificial Neural Networks (ANNs)
      • GPNN software
      • Motsinger et al (2006) BMC Bioinformatics
    • Support Vector Machine Approach
      • Combinatorial optimization techniques
        • Local search
        • Genetic algorithms
      • Weng et al (2007) Genet Epidemiol
other approaches
Other approaches
  • Logistic regression framework
    • tagSNPs and powerful models for epistasis
    • Chapman and Clayton (2007) Genet Epidemiol
  • Case-control
    • Haplotype interactions
    • FAMHAP
    • Becker et al (2005) Genet Epidemiol
quantitative traits
Quantitative traits
  • Simple comparisons
    • 2 groups (e.g. alleles, dominant)
      • Normal test large sample sizes
      • T-test small sample sizes
      • Mann-Whitney non-parametric
    • >2 groups (e.g. genotypes)
      • ANOVA (F-test)
      • Kruskall-Wallis non-parametric
  • Including Covariates
    • Linear regression y =  + x
    • Again, need to model genetic exposure

~ ~

family based methods
Family-Based Methods
  • Parent-Offspring Trios
    • Haplotype Relative Risk (HRR)
    • Transmission/Disequilibrium Test (TDT)
    • Quantitative TDT (QTDT)
    • Generalized Estimating Equations (GEE)
  • Nuclear Families
    • Sibling TDT (STDT)
    • FBAT
    • QTDT
    • GEEs
family based methods100
Family-Based Methods
  • General Pedigrees (small to moderate size)
    • PDT
    • FBAT
    • QTDT
    • Variance correction (posterior probability)
    • CCREL
  • Extended Pedigrees
    • Variance correction (prior probability)
    • Quasi-Likelihood Score (QLS)
    • PedGenie
transmission disequilibrium test tdt
Transmission/Disequilibrium Test (TDT)
  • Transmission method
  • Spielman et al (2003)
  • Trio method
    • Requires genotype data on all three individuals
  • The statistic considers only {parent, affected-offspring} pairs from the trio for which the parent is heterozygous
    • Compare the number of times each of the different alleles is transmitted to the affected offspring
    • Is there evidence for preferential transmission of one allele over the other?
tdt validity
TDT: Validity
  • H0: (1-2) = 0
  • A test for both association and linkage
  • Robust to stratification
slide103
TDT

CT

CC

One heterozygous parent

Transmits T to offspring

CT

slide104
TDT

CT

CC

One heterozygous parent

Transmits C to offspring

CC

slide105
TDT

CT

CT

Two heterozygous parents

One parent:

Transmits C to offspring

Other parent:

Transmits T to offspring

CT

slide106
TDT

CC

CC

No heterozygous parents

No data to record

CT

tdt tabulation
TDT: Tabulation

Allele transmitted

C T

Allele NOT C a b

transmitted T c d