combinatorial algorithms for haplotype inference n.
Skip this Video
Loading SlideShow in 5 Seconds..
Combinatorial Algorithms for Haplotype Inference PowerPoint Presentation
Download Presentation
Combinatorial Algorithms for Haplotype Inference

Loading in 2 Seconds...

play fullscreen
1 / 25

Combinatorial Algorithms for Haplotype Inference - PowerPoint PPT Presentation

  • Uploaded on

Combinatorial Algorithms for Haplotype Inference. Pure Parsimony Dan Gusfield. SNP Data. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Combinatorial Algorithms for Haplotype Inference' - teleri

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
snp data
SNP Data
  • A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
  • SNP maps have been compiled with a density of about 1 site per 1000.
  • SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.
genotypes and haplotypes
Genotypes and Haplotypes

Each individual has two “copies” of each chromosome.

At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by


0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

Two haplotypes per individual

Merge the haplotypes

2 1 2 1 0 0 1 2 0

Genotype for the individual

haplotype map project hapmap
Haplotype Map Project: HAPMAP
  • NIH lead project ($100M) to find common haplotypes in the Human population.
  • Used to try to associate genetic-influenced diseases with specific haplotypes, to either find causal haplotypes, or to find the region near causal mutations.
  • Haplotyping individuals is expensive.
haplotyping problem
Haplotyping Problem
  • Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.
  • Computational Problem: Given a set of n genotypes, determine the original set of n haplotypepairs that generated the n genotypes. This is hopeless without a genetic model.
the pure parsimony objective
For a set of genotypes, find a Smallest set H of haplotypes, such that each genotype can be explained by a pair of haplotypes in H.

For each genotype G in the input, assign a pair of haplotypes in H to explain G.

The Pure Parsimony Objective reflects simple

genetic models of how haplotypes evolve in a


The Pure Parsimony Objective
example of parsimony
Example of Parsimony




3 distinct haplotypes

set S has size 3







pure parsimony is np hard
Pure Parsimony is NP-hard

Earl Hubbel (Affymetrix) showed that Pure Parsimony

is NP-hard.

However, for a range of parameters of current interest

(50 sites and 50 genotypes) a True Parsimony

solution can be computed efficiently, using Integer

Linear Programming, and two speed-up tricks.

For larger parameters (100 sites and 50 genotypes)

A near-parsimony solution can be found efficiently.

why i did this work
Why I did this work

I wanted to answer two questions:

First, can a pure parsimony solution be computed efficiently

for a range of problem sizes of current interest in biology?

Second, how accurate is the pure parsimony solution,

compared to the correct solution (in simulations and in the

available real data), and compared to solutions given by other

existing computational methods such as PHASE.

Accuracy is measured by the number of genotypes whose

originating pair of haplotypes are returned in the solution.

the conceptual integer programming formulation
The Conceptual Integer Programming Formulation

For each genotype (individual) j, create one integer

programming variable Yij for each pair of haplotypes

whose merge creates genotype j. If j has k 2’s, then

This creates 2^(k-1) Y variables.

Create one integer programming variable Xq for

Each distinct haplotype q that appears in one of the

pairs for a Y variable.

conceptual ip
Conceptual IP

For each genotype, create an equality that says that

exactly one of its Y variables must be set to 1.

For each variable Yij, whose two haplotypes are

given variables Xq and Xq’, include an inequality

that says that if variable Yij is set to 1, then both

variables Xq and Xq’ must be set to 1.

Then the objective function is to Minimize the

sum of the X variables.


02120 Creates a Y variable Y1 for pair 00100 X1

01110 X2

and a Y variable Y2 for pair 01100 X3

00110 X4

Include the following (in)equalities into the IP

The objective function will

include the subexpression

X1 + X2 + X3 + X4

But any X variable is included

exactly once no matter how many

Y variables it is associated with.

Y1 + Y2 = 1

Y1 - X1 <= 0

Y1 - X2 <= 0

Y2 - X3 <= 0

Y2 - X4 <= 0

efficiency tricks
Efficiency Tricks

Ignore any Y variable and its two X variables if those X

variables are associated with no other Y variable. The

Resulting IP is much smaller, and can be used to find

the optimal to the conceptual IP.

Also, we need not enumerate all X pairs for a given

genotype, but can efficiently recognize the pairs we


avoiding enumeration of unneeded haplotypes
Avoiding Enumeration of unneeded haplotypes

For each pair of genotypes, G1, G2 it is easy to find all the

haplotypes that appear in an explanation for G1 and in

an explanation for G2.

Example: 0 2 1 1 0 2 0 2

0 1 1 1 2 2 0 2

0 1 1 1 0 V 0 2 V and then generate all combinations

of 0,1’s over the V sites.

So the time is O(m x # haps in both explanation sets)

the apoe data a case where the haplotypes were molecularly determined
The APOE Data: A case where the haplotypes were molecularly determined

This data has 9 sites, and 47 genotypes, each with at least two

ambiguous sites.

There are 17 distinct haplotypes in the real data.

The IP finds a True Parsimony Solution with 15 distinct haplotypes.

PHASE and HAPLOTYPER each use 15 haplotypes also.

Over 10,000 executions of Clarks method, the fewest haplotypes it

used in any solutions was 20.


Recombination is a process whereby a prefix of one sequence

is concatenated to a suffix of another sequence to create a third


Ex. ABCDEFG and TUVWXYZ could recombine to create


DNA sequences evolve by mutations of different types, but also

by recombinations.

recombination helps efficiency
Recombination Helps Efficiency

As the level of recombination increases, the efficiency

of the IP increases, because the variable elimination

trick becomes more effective, reducing the size of the

IP. The reason is that recombination makes the underlying

haplotypes in the population more varied, and also increases

the number of haplotypes in the population. Hence, each

haplotype is less likely to be part of a potential explanation of

any given genotype.

recombination hurts accuracy
Recombination Hurts Accuracy

For almost the same reason as recombination helps efficiency,

it hurts accuracy. As recombination increases, the number of

haplotypes that can be part of the explanation of more than

one genotype in the data decreases. That helps efficiency,

but it reduces the level of structure and dependency among the

potential explanations, and hence the parsimony criteria is less


how fast how good
How Fast? How Good?

Depends on the level of recombination in the underlying

data. Pure Parsimony can be computed in seconds to

minutes for most cases with 50 genotypes and up to 60

sites, faster as the level of recombination increases.

As the level of recombination increases, the accuracy

of the Pure Parsimony Solution falls, but remains within

5% of the quality of PHASE (for comparison).


For 10 sites and moderate recombination, the Pure

Parsimony solutions have the same accuracy as

PHASE and HAPLOTYPER solutions. As the

number of sites and the level of recombination increases,

PHASE and HAPLOTYPER tend to be more accurate

than the Pure Parsimony solution, but the gap is moderate.

a hybrid approach for large data sets
A Hybrid Approach for Large Data Sets

We are interested in handling 100 genotypes and 150 sites.

This is too large for the IP approach, but we can use a

hybrid approach based on Clarks Method and an IP version

of it.

generic clark method
Generic Clark Method

Basic Step:

Given a known haplotype H (original homozygote or single-site

heterozygote, or previously inferred), and an unresolved

genotype G, if G can be explained by H and another vector H’,

then call H’ a known haplotype, available for additional inferrals.

example: H 0 1 0 0 1

G 2 1 0 2 2 G is “resolved” by H and H’


H’ 1 1 0 1 0

In a single run, repeat the basic step until stuck - resolve as many

genotypes as possible in the data.

Clark (1990) Randomize choices, and do the computations many

times to find an execution (run) that explains the most genotypes.

many variations of clark
Many variations of Clark
  • Variations based on which parts are randomized.
  • We closely examine eight variations on a real data set. Variation 1 randomizes every decision - probably more than Clark originally intended.
  • Truth in advertising - we implemented our own Clark versions - did not actually use Clark’s software.
clark parsimony hybrid
Clark/Parsimony Hybrid

For low recombination, large (>60) sites

Find an execution of Clark’s method that

maximizes the number of genotypes resolved

minimizes the number of distinct haplotypes used

We can do this by mixing the Digraph View of Clark’s method

(Gusfield 2001) with the parsimony criteria, and truly find

an execution of Clark’s method that minimizes the number of

distinct haplotypes used.

On datasets where we can compute True Parsimony, this

hybrid does only a bit worse than True Parsimony.

other uses of ip
Other uses of IP

On datasets where we know the solution, find the best

that a Clark method can ever do. IP can find the best

possible execution.

On the APOE data, Clark’s method can get all get 47 correct!

In fact in a huge number of ways. (But the best we found

by actually running Clark’s method was 42 correct).

This kind of test is not possible for Statistical methods.