fastanova an efficient algorithm for genome wide association study
Download
Skip this Video
Download Presentation
FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

Loading in 2 Seconds...

play fullscreen
1 / 24

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study. Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill. Speaker: Xiang Zhang. Genotype-phenotype association study. Goal: finding genetic factors causing phenotypic difference. Mouse genome.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' FastANOVA: an Efficient Algorithm for Genome-Wide Association Study' - carol-swanson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fastanova an efficient algorithm for genome wide association study

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

Xiang Zhang Fei Zou Wei Wang

University of North Carolina at Chapel Hill

Speaker: Xiang Zhang

genotype phenotype association study
Genotype-phenotype association study
  • Goal: finding genetic factors causing phenotypic difference

Mouse genome

Phenotype variation

http://www.bcgsc.ca

http://www.jax.org/

genotype phenotype association study1
Genotype-phenotype association study

Chrom1 bp3,568,717

Chrom6 bp120,323,342

  • Single Nucleotide Polymorphism
    • Mutation of a single nucleotide (A,C,T,G)
    • The most abundant source of genotypic variation
    • Server as genetic markers of locations in the genome
    • High throughput genotyping -- thousands to millions of SNPs

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

Thousands to millions of SNPs

genotype phenotype association study2
Genotype-phenotype association study
  • Genotype
    • SNPs can be represented as binary {0,1} (e.g. inbred mouse strains)
  • Quantitative phenotypes
    • Body weight, blood pressure, tumor size, cancer susceptibility, ……
  • Question
    • Which SNPs are the most highly associated with the phenotype?

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3

a simple example single marker association study
A simple example: single marker association study
  • Partition individuals into groups according to genotype of a SNP
  • Do a statistic (t, ANOVA) test
  • Repeat for each SNP

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3

two locus association mapping
Two-locus association mapping
  • Many phenotypes are complex traits
    • Due to the joint effect of multiple genes
    • Single marker approach may not suffice
  • Consider SNP-SNP interactions
    • Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11
    • Split mice into four groups according to the genotype of each SNP-pair
    • Do statistic test for each SNP-pair
statistical issue
Statistical issue
  • Multiple test problem
    • Do n tests with Type I error , the family-wise error rate is
  • Example
    • Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64
    • 64% probability to get at least one spurious result
  • Solution
    • permutation test
permutation test
Permutation test
  • K permutations of phenotype values
  • For each permutation, find the maximum test value
  • Given Type I error α, the critical value Fαis αK-thlargest value among K maximum values
  • SNP-pairs whose test values are greater than Fα are significant
genome wide association study
Genome-wide association study
  • What’s GWA?
    • Simple Idea: search for the associations in the whole genome
  • Hard to implement
    • Enormoussearch space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010
preliminary anova test and f statistic
Preliminary: ANOVA test and F-statistic
  • ANOVA test
    • To determine whether the group meansare significantly different
    • Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares
  • F-statistic
    • SNPs {X1, X2, …, XN},
    • a quantitative phenotype Y
    • Single SNP test -- F(Xi, Y)
    • SNP-pair test --F(XiXj, Y)

SST

SSB

SSW

problem formalization
Problem Formalization
  • Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}.
  • Maximum ANOVA test (F-statistic) value of permutation Yk

FYk = max {F(XiXj, Yk)|1≤i<j≤N}

  • Problem 1: Given Type I error threshold α, find critical valueFα, which is αK-th largest value among {FYk|1≤k≤K}
  • Problem 2: Given the threshold Fα, find all significant SNP-pairs such that F(XiXj, Y)≥ Fα
brute force approach
Brute force approach
  • Problem 1: Permutation test to find critical value
    • For permutation Yk, test all SNP-pairs to find the maximum test value FYk
    • Repeat for all permutations
    • Report αK-th largest value in {FYk|1≤k≤K}
  • Problem 2: Finding significant SNP-pairs
    • For phenotype Y, test all SNP-pairs and report the SNP-pairs whose test values are above Fα

Problem 1 is more demanding due to large number of permutations

overview of fastanova
Overview of FastANOVA
  • Goal: Scale large permutation test to genome-wide
  • Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations?
  • Idea:
    • Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?)
    • Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?)
    • Identify redundant computations in the permutation tests (reuse computations, how?)
the upper bound
The upper bound
  • For any SNP-pair (XiXj)

equivalent

SSB (XiXj, Y) ≥θ

F(XiXj, Y) ≥ Fα

Fixed for given Fα

  • Bound on SSB

Need to be greater than θ for (XiXj) to be significant

the upper bound1
The upper bound

Given Xi ,Xj ,and Y

Constant

f(na)

f(nb)

Only depend on the genotype ofXj

applying the upper bound
Applying the upper bound

For a given Xi , let AP= {(XiXj)|i+1≤j≤N}.

Index the SNP-pairs in AP in the 2D space of (na, nb).

(X1X3)

(X1X5)

(X1X6)

(1,3)

(3,3)

(X1X2)

(X1X4)

(2,1)

key properties
Key properties

f(na)

f(nb)

  • Maximum possible size:
  • Many SNP-pairs share the same entry
  • All SNP-pairs in the same entry have the same upper bound
  • The indexing structure does not depend on the phenotype permutations

Same upper bound value

schema of fastanova for permutation test
Schema of FastANOVA (for permutation test)
  • For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb)
  • For each permutation, find the candidate SNP-pairs by accessing the indexing structure
    • Candidates are SNP-pairs whose upper bounds are above the threshold.
    • The dynamic threshold is the maximum test value found so far.
complexity of fastanova
Complexity of FastANOVA
  • Time complexity
    • FastANOVA: O(N2M + KNM2 +CM)
    • Brute force: O(KN2M)
  • Space complexity
    • O((N+K)M)

N = # SNPs

M = # individuals

K = # permutations

C = # candidates

M << N

brute force v s fastanova
Brute force v.s. FastANOVA

Two orders of magnitude faster than the brute force alternative

#SNPs = 44k, #individuals = 26,

phenotype: metabolism (water intake)

SNP and phenotype data available at http://www.jax.org

future work
Future work
  • Association study involving more than two SNPs
    • Computationally much more demanding
    • Three loci VS. two loci: in the order of number of SNPs
  • Association study for heterozygous case
    • SNPs are encoded as ternary variables {0, 1, 2}
thank you

Thank You !

Questions?

ad