Fastanova an efficient algorithm for genome wide association study
Download
1 / 24

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study. Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill. Speaker: Xiang Zhang. Genotype-phenotype association study. Goal: finding genetic factors causing phenotypic difference. Mouse genome.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' FastANOVA: an Efficient Algorithm for Genome-Wide Association Study' - carol-swanson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fastanova an efficient algorithm for genome wide association study

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

Xiang Zhang Fei Zou Wei Wang

University of North Carolina at Chapel Hill

Speaker: Xiang Zhang


Genotype phenotype association study
Genotype-phenotype association study Association Study

  • Goal: finding genetic factors causing phenotypic difference

Mouse genome

Phenotype variation

http://www.bcgsc.ca

http://www.jax.org/


Genotype phenotype association study1
Genotype-phenotype association study Association Study

Chrom1 bp3,568,717

Chrom6 bp120,323,342

  • Single Nucleotide Polymorphism

    • Mutation of a single nucleotide (A,C,T,G)

    • The most abundant source of genotypic variation

    • Server as genetic markers of locations in the genome

    • High throughput genotyping -- thousands to millions of SNPs

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

Thousands to millions of SNPs


Genotype phenotype association study2
Genotype-phenotype association study Association Study

  • Genotype

    • SNPs can be represented as binary {0,1} (e.g. inbred mouse strains)

  • Quantitative phenotypes

    • Body weight, blood pressure, tumor size, cancer susceptibility, ……

  • Question

    • Which SNPs are the most highly associated with the phenotype?

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3


A simple example single marker association study
A simple example: single marker association study Association Study

  • Partition individuals into groups according to genotype of a SNP

  • Do a statistic (t, ANOVA) test

  • Repeat for each SNP

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3


Two locus association mapping
Two-locus association mapping Association Study

  • Many phenotypes are complex traits

    • Due to the joint effect of multiple genes

    • Single marker approach may not suffice

  • Consider SNP-SNP interactions

    • Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11

    • Split mice into four groups according to the genotype of each SNP-pair

    • Do statistic test for each SNP-pair


Statistical issue
Statistical issue Association Study

  • Multiple test problem

    • Do n tests with Type I error , the family-wise error rate is

  • Example

    • Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64

    • 64% probability to get at least one spurious result

  • Solution

    • permutation test


Permutation test
Permutation test Association Study

  • K permutations of phenotype values

  • For each permutation, find the maximum test value

  • Given Type I error α, the critical value Fαis αK-thlargest value among K maximum values

  • SNP-pairs whose test values are greater than Fα are significant


Genome wide association study
Genome-wide association study Association Study

  • What’s GWA?

    • Simple Idea: search for the associations in the whole genome

  • Hard to implement

    • Enormoussearch space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010


Preliminary anova test and f statistic
Preliminary: ANOVA test and F-statistic Association Study

  • ANOVA test

    • To determine whether the group meansare significantly different

    • Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares

  • F-statistic

    • SNPs {X1, X2, …, XN},

    • a quantitative phenotype Y

    • Single SNP test -- F(Xi, Y)

    • SNP-pair test --F(XiXj, Y)

SST

SSB

SSW


Problem formalization
Problem Formalization Association Study

  • Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}.

  • Maximum ANOVA test (F-statistic) value of permutation Yk

    FYk = max {F(XiXj, Yk)|1≤i<j≤N}

  • Problem 1: Given Type I error threshold α, find critical valueFα, which is αK-th largest value among {FYk|1≤k≤K}

  • Problem 2: Given the threshold Fα, find all significant SNP-pairs such that F(XiXj, Y)≥ Fα


Brute force approach
Brute force approach Association Study

  • Problem 1: Permutation test to find critical value

    • For permutation Yk, test all SNP-pairs to find the maximum test value FYk

    • Repeat for all permutations

    • Report αK-th largest value in {FYk|1≤k≤K}

  • Problem 2: Finding significant SNP-pairs

    • For phenotype Y, test all SNP-pairs and report the SNP-pairs whose test values are above Fα

Problem 1 is more demanding due to large number of permutations


Overview of fastanova
Overview of FastANOVA Association Study

  • Goal: Scale large permutation test to genome-wide

  • Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations?

  • Idea:

    • Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?)

    • Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?)

    • Identify redundant computations in the permutation tests (reuse computations, how?)


The upper bound
The upper bound Association Study

  • For any SNP-pair (XiXj)

equivalent

SSB (XiXj, Y) ≥θ

F(XiXj, Y) ≥ Fα

Fixed for given Fα

  • Bound on SSB

Need to be greater than θ for (XiXj) to be significant


The upper bound1
The upper bound Association Study

Given Xi ,Xj ,and Y

Constant

f(na)

f(nb)

Only depend on the genotype ofXj


Applying the upper bound
Applying the upper bound Association Study

For a given Xi , let AP= {(XiXj)|i+1≤j≤N}.

Index the SNP-pairs in AP in the 2D space of (na, nb).

(X1X3)

(X1X5)

(X1X6)

(1,3)

(3,3)

(X1X2)

(X1X4)

(2,1)


Key properties
Key properties Association Study

f(na)

f(nb)

  • Maximum possible size:

  • Many SNP-pairs share the same entry

  • All SNP-pairs in the same entry have the same upper bound

  • The indexing structure does not depend on the phenotype permutations

Same upper bound value


Schema of fastanova for permutation test
Schema of FastANOVA Association Study(for permutation test)

  • For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb)

  • For each permutation, find the candidate SNP-pairs by accessing the indexing structure

    • Candidates are SNP-pairs whose upper bounds are above the threshold.

    • The dynamic threshold is the maximum test value found so far.


Complexity of fastanova
Complexity of FastANOVA Association Study

  • Time complexity

    • FastANOVA: O(N2M + KNM2 +CM)

    • Brute force: O(KN2M)

  • Space complexity

    • O((N+K)M)

N = # SNPs

M = # individuals

K = # permutations

C = # candidates

M << N


Brute force v s fastanova
Brute force v.s. FastANOVA Association Study

Two orders of magnitude faster than the brute force alternative

#SNPs = 44k, #individuals = 26,

phenotype: metabolism (water intake)

SNP and phenotype data available at http://www.jax.org


Pruning power of the bound
Pruning power of the bound Association Study


Runtime of each component
Runtime of each component Association Study

One time cost


Future work
Future work Association Study

  • Association study involving more than two SNPs

    • Computationally much more demanding

    • Three loci VS. two loci: in the order of number of SNPs

  • Association study for heterozygous case

    • SNPs are encoded as ternary variables {0, 1, 2}


Thank you

Thank You ! Association Study

Questions?