- 81 Views
- Uploaded on
- Presentation posted in: General

Yufeng Wu UC Davis RECOMB 2007

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms

Yufeng Wu

UC Davis

RECOMB 2007

Cases

Controls

Diploid: two sequences per individuals

0

1

SNPs

Problem: Where are (unobserved) disease mutations? This talk: Genealogy-based approach

Disease mutation

- Tells how individuals in a population are related
- Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations
- Problem: How to determine the genealogy for “unrelated” individuals?
- Not easy with recombination

Diseased (case)

Healthy (control)

Individuals in current population

Suffix

Prefix

11000

0000001111

Breakpoint

- One of the principle genetic forces shaping sequence variations within species
- Two equal length sequences generate a third new equal length sequence in genealogy

110001111111001

000110000001111

00

1 0

0 1

10

1 1

Mutations

Recombination

10

01

00

10

11

01

00

S1 = 00

S2 = 01

S3 = 10

S4 = 11

Assumption:

At most one mutation per site

S1 = 00

S2 = 01

S3 = 10

S4 = 10

- “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005
- But we do not know the true ARG!
- Goal: infer ARGs from sequences for association mapping
- Not easy and often approximation is used (e.g. Zollner and Pritchard)

- First practical ARG association mapping method (Minichiello and Durbin, 2006)
- Use plausible ARGs: heuristic

- My work: Generate ARGs with a provable property, and works on a well-defined complex disease model
- minARGs: Most parsimonious ARGs that use the minimum number of recombinations.
- Uniform sampling of minARGs: generate one minARG from the space of all minARGs with equal probability. (Sampling is a scheme often used in genealogy-based approaches)

N1=124

N2=32

Recursion

N = 124*1 + 32*2 = 188

00000

01000

01100

01101

11100

00010

00011

00000

01000

01100

11100

00010

11011

00011

It turns out no other row choices contribute to the minARG space.

11011

01101

00000

01000

01100

01101

11100

00010

11011

00011

Assume only input sequences are generated.

1

2

N1=124

N2=32

00000

01000

01100

01101

11100

00010

00011

00000

01000

01100

11100

00010

11011

00011

11011

01101

2. Pick 11011 as last row to derive

3. Move to reduced matrix

188 minARGs

00000

01000

01100

01101

11100

00010

11011

00011

Idea: Use counting of minARGs in selecting the order of sequences to generate.

1

2

Can be easily extend to weighted sampling, e.g. generate less frequent sequences later.

1. Random value Rnd = 0.3 < 0.66

Select 11011 with prob = 124/188 = 0.66, and 01101 with prob = 32*2/188 = 0.34

Possible disease mutation

- Clear separation of cases/controls: NOT expected for complex diseases!

Case

Control

1 2

Multiple disease mutations!

Cases

Controls

Diploid: two sequences per individuals

Incomplete penetrance

Trying to find one tree branch which clearly separate cases and controls may not work for complex diseases!

Solution: Inference on a well-defined disease model.

SNPs

Probability of disease mutations occur at the branch (computed from mutation rate and branch length)

A formal model of the complex disease is needed to assess the significance of a chosen marginal tree for real data.

0.02

0.1

0.05

Disease mutations: Poisson Process

Two alleles: wild-type and mutant

0.08

0.03

0.01

0.06

0.07

cAse

PA,1: probability of a mutant sequence becomes a case

PC,1 = 1.0 - PA,1

PA,0: probability of a wild-type sequence becomes a case

PC,0 = 1.0 - PA,0

Control

0.02

0.1

0.05

0.08

0.03

0.01

0.06

0.07

PA,1 = 0.8, PC,1 = 0.2

PA,0 = 0.1, PC,0 = 0.9

- The disease model specifies a probabilistic way of assigning phenotypes for a given tree.
- But we have many trees and at which tree disease mutations occurs?
- Given a tree T and case/control phenotypes of its leaves, what is the probability of observing on T?
- High phenotype likelihood: disease mutations may occur in T
- Computable in linear time and adopted in this work

- We need to assess statistical significance of computed phenotype likelihood.
- Null model: randomly permute case/control status of leaves in the given tree.
- P-value by permutation tests: computational bottleneck!

- My result: O(n3) algorithm computing expected value (and variance) of phenotype likelihood.
- Exact, fully deterministic method.
- But, computing P-value precisely and efficiently remains open.

Case

Control

Diploid (e.g. humans): two sequences per individual

Diploid penetrance:

PA,00: prob. Individual with two wild-type sequences becomes a case

PA,01 : prob. Individual with one wild-type and one mutant becomes a case

PA,11: …

Efficient computation of phenotype likelihood: stated but unresolved in Zollner and Pritchard

My result: computing phenotype likelihood with diploid penetrance is NP-hard

Simulation Results

- Average mapping error for 50 simulated datasets from Zollner and Pritchard
- Average over 50 genealogies
- Date: January, 2007

Comparison: TMARG, LATAG (Z. P.),MARGARITA (M. D.).

TMARG (my program) and MARGRITA are much faster (20 times or more) than LATAG. Important for whole genome scan.

- Software available at: http://wwwcsif.cs.ucdavis.edu/~wuyu
- I want to thank
- Dan Gusfield
- Dan Brown
- Chuck Langley
- Yun S. Song