Loading in 2 Seconds...

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

Loading in 2 Seconds...

- 82 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' FastANOVA: an Efficient Algorithm for Genome-Wide Association Study' - carol-swanson

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

### Thank You !

Xiang Zhang Fei Zou Wei Wang

University of North Carolina at Chapel Hill

Speaker: Xiang Zhang

Genotype-phenotype association study

- Goal: finding genetic factors causing phenotypic difference

Mouse genome

Phenotype variation

http://www.bcgsc.ca

http://www.jax.org/

Genotype-phenotype association study

Chrom1 bp3,568,717

Chrom6 bp120,323,342

- Single Nucleotide Polymorphism
- Mutation of a single nucleotide (A,C,T,G)
- The most abundant source of genotypic variation
- Server as genetic markers of locations in the genome
- High throughput genotyping -- thousands to millions of SNPs

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C C ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A A C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C G ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

…… A A T C G …… A A T C C ……

Thousands to millions of SNPs

Genotype-phenotype association study

- Genotype
- SNPs can be represented as binary {0,1} (e.g. inbred mouse strains)
- Quantitative phenotypes
- Body weight, blood pressure, tumor size, cancer susceptibility, ……
- Question
- Which SNPs are the most highly associated with the phenotype?

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3

A simple example: single marker association study

- Partition individuals into groups according to genotype of a SNP
- Do a statistic (t, ANOVA) test
- Repeat for each SNP

Phenotype value

SNPs

…… 0 0 0 1 0 1 …… 8

…… 0 0 0 0 0 0 …… 7

…… 0 1 1 0 0 1 …… 12

…… 0 1 0 0 1 0 …… 11

…… 0 1 0 1 0 1 …… 9

…… 0 1 0 0 0 0 …… 13

…… 1 0 1 1 1 1 …… 6

…… 1 0 0 0 1 0 …… 4

…… 1 1 1 1 1 1 …… 2

…… 1 0 0 1 0 0 …… 5

…… 1 0 0 1 0 1 …… 0

…… 1 0 1 1 0 0 …… 3

Two-locus association mapping

- Many phenotypes are complex traits
- Due to the joint effect of multiple genes
- Single marker approach may not suffice
- Consider SNP-SNP interactions
- Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11
- Split mice into four groups according to the genotype of each SNP-pair
- Do statistic test for each SNP-pair

Statistical issue

- Multiple test problem
- Do n tests with Type I error , the family-wise error rate is
- Example
- Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64
- 64% probability to get at least one spurious result
- Solution
- permutation test

Permutation test

- K permutations of phenotype values
- For each permutation, find the maximum test value
- Given Type I error α, the critical value Fαis αK-thlargest value among K maximum values
- SNP-pairs whose test values are greater than Fα are significant

Genome-wide association study

- What’s GWA?
- Simple Idea: search for the associations in the whole genome
- Hard to implement
- Enormoussearch space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010

Preliminary: ANOVA test and F-statistic

- ANOVA test
- To determine whether the group meansare significantly different
- Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares
- F-statistic
- SNPs {X1, X2, …, XN},
- a quantitative phenotype Y
- Single SNP test -- F(Xi, Y)
- SNP-pair test --F(XiXj, Y)

SST

SSB

SSW

Problem Formalization

- Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}.
- Maximum ANOVA test (F-statistic) value of permutation Yk

FYk = max {F(XiXj, Yk)|1≤i<j≤N}

- Problem 1: Given Type I error threshold α, find critical valueFα, which is αK-th largest value among {FYk|1≤k≤K}
- Problem 2: Given the threshold Fα, find all significant SNP-pairs such that F(XiXj, Y)≥ Fα

Brute force approach

- Problem 1: Permutation test to find critical value
- For permutation Yk, test all SNP-pairs to find the maximum test value FYk
- Repeat for all permutations
- Report αK-th largest value in {FYk|1≤k≤K}
- Problem 2: Finding significant SNP-pairs
- For phenotype Y, test all SNP-pairs and report the SNP-pairs whose test values are above Fα

Problem 1 is more demanding due to large number of permutations

Overview of FastANOVA

- Goal: Scale large permutation test to genome-wide
- Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations?
- Idea:
- Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?)
- Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?)
- Identify redundant computations in the permutation tests (reuse computations, how?)

The upper bound

- For any SNP-pair (XiXj)

equivalent

SSB (XiXj, Y) ≥θ

F(XiXj, Y) ≥ Fα

Fixed for given Fα

- Bound on SSB

Need to be greater than θ for (XiXj) to be significant

Applying the upper bound

For a given Xi , let AP= {(XiXj)|i+1≤j≤N}.

Index the SNP-pairs in AP in the 2D space of (na, nb).

(X1X3)

(X1X5)

(X1X6)

(1,3)

(3,3)

(X1X2)

(X1X4)

(2,1)

Key properties

f(na)

f(nb)

- Maximum possible size:
- Many SNP-pairs share the same entry
- All SNP-pairs in the same entry have the same upper bound
- The indexing structure does not depend on the phenotype permutations

Same upper bound value

Schema of FastANOVA (for permutation test)

- For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb)
- For each permutation, find the candidate SNP-pairs by accessing the indexing structure
- Candidates are SNP-pairs whose upper bounds are above the threshold.
- The dynamic threshold is the maximum test value found so far.

Complexity of FastANOVA

- Time complexity
- FastANOVA: O(N2M + KNM2 +CM)
- Brute force: O(KN2M)
- Space complexity
- O((N+K)M)

N = # SNPs

M = # individuals

K = # permutations

C = # candidates

M << N

Brute force v.s. FastANOVA

Two orders of magnitude faster than the brute force alternative

#SNPs = 44k, #individuals = 26,

phenotype: metabolism (water intake)

SNP and phenotype data available at http://www.jax.org

Runtime of each component

One time cost

Future work

- Association study involving more than two SNPs
- Computationally much more demanding
- Three loci VS. two loci: in the order of number of SNPs
- Association study for heterozygous case
- SNPs are encoded as ternary variables {0, 1, 2}

Questions?

Download Presentation

Connecting to Server..