Combinatorial Algorithms for Haplotype Inference. Pure Parsimony Dan Gusfield. SNP Data. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
Each individual has two “copies” of each chromosome.
At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
For each genotype G in the input, assign a pair of haplotypes in H to explain G.
The Pure Parsimony Objective reflects simple
genetic models of how haplotypes evolve in a
population.The Pure Parsimony Objective
3 distinct haplotypes
set S has size 3
Earl Hubbel (Affymetrix) showed that Pure Parsimony
However, for a range of parameters of current interest
(50 sites and 50 genotypes) a True Parsimony
solution can be computed efficiently, using Integer
Linear Programming, and two speed-up tricks.
For larger parameters (100 sites and 50 genotypes)
A near-parsimony solution can be found efficiently.
I wanted to answer two questions:
First, can a pure parsimony solution be computed efficiently
for a range of problem sizes of current interest in biology?
Second, how accurate is the pure parsimony solution,
compared to the correct solution (in simulations and in the
available real data), and compared to solutions given by other
existing computational methods such as PHASE.
Accuracy is measured by the number of genotypes whose
originating pair of haplotypes are returned in the solution.
For each genotype (individual) j, create one integer
programming variable Yij for each pair of haplotypes
whose merge creates genotype j. If j has k 2’s, then
This creates 2^(k-1) Y variables.
Create one integer programming variable Xq for
Each distinct haplotype q that appears in one of the
pairs for a Y variable.
For each genotype, create an equality that says that
exactly one of its Y variables must be set to 1.
For each variable Yij, whose two haplotypes are
given variables Xq and Xq’, include an inequality
that says that if variable Yij is set to 1, then both
variables Xq and Xq’ must be set to 1.
Then the objective function is to Minimize the
sum of the X variables.
02120 Creates a Y variable Y1 for pair 00100 X1
and a Y variable Y2 for pair 01100 X3
Include the following (in)equalities into the IP
The objective function will
include the subexpression
X1 + X2 + X3 + X4
But any X variable is included
exactly once no matter how many
Y variables it is associated with.
Y1 + Y2 = 1
Y1 - X1 <= 0
Y1 - X2 <= 0
Y2 - X3 <= 0
Y2 - X4 <= 0
Ignore any Y variable and its two X variables if those X
variables are associated with no other Y variable. The
Resulting IP is much smaller, and can be used to find
the optimal to the conceptual IP.
Also, we need not enumerate all X pairs for a given
genotype, but can efficiently recognize the pairs we
For each pair of genotypes, G1, G2 it is easy to find all the
haplotypes that appear in an explanation for G1 and in
an explanation for G2.
Example: 0 2 1 1 0 2 0 2
0 1 1 1 2 2 0 2
0 1 1 1 0 V 0 2 V and then generate all combinations
of 0,1’s over the V sites.
So the time is O(m x # haps in both explanation sets)
This data has 9 sites, and 47 genotypes, each with at least two
There are 17 distinct haplotypes in the real data.
The IP finds a True Parsimony Solution with 15 distinct haplotypes.
PHASE and HAPLOTYPER each use 15 haplotypes also.
Over 10,000 executions of Clarks method, the fewest haplotypes it
used in any solutions was 20.
Recombination is a process whereby a prefix of one sequence
is concatenated to a suffix of another sequence to create a third
Ex. ABCDEFG and TUVWXYZ could recombine to create
DNA sequences evolve by mutations of different types, but also
As the level of recombination increases, the efficiency
of the IP increases, because the variable elimination
trick becomes more effective, reducing the size of the
IP. The reason is that recombination makes the underlying
haplotypes in the population more varied, and also increases
the number of haplotypes in the population. Hence, each
haplotype is less likely to be part of a potential explanation of
any given genotype.
For almost the same reason as recombination helps efficiency,
it hurts accuracy. As recombination increases, the number of
haplotypes that can be part of the explanation of more than
one genotype in the data decreases. That helps efficiency,
but it reduces the level of structure and dependency among the
potential explanations, and hence the parsimony criteria is less
Depends on the level of recombination in the underlying
data. Pure Parsimony can be computed in seconds to
minutes for most cases with 50 genotypes and up to 60
sites, faster as the level of recombination increases.
As the level of recombination increases, the accuracy
of the Pure Parsimony Solution falls, but remains within
5% of the quality of PHASE (for comparison).
For 10 sites and moderate recombination, the Pure
Parsimony solutions have the same accuracy as
PHASE and HAPLOTYPER solutions. As the
number of sites and the level of recombination increases,
PHASE and HAPLOTYPER tend to be more accurate
than the Pure Parsimony solution, but the gap is moderate.
We are interested in handling 100 genotypes and 150 sites.
This is too large for the IP approach, but we can use a
hybrid approach based on Clarks Method and an IP version
Given a known haplotype H (original homozygote or single-site
heterozygote, or previously inferred), and an unresolved
genotype G, if G can be explained by H and another vector H’,
then call H’ a known haplotype, available for additional inferrals.
example: H 0 1 0 0 1
G 2 1 0 2 2 G is “resolved” by H and H’
H’ 1 1 0 1 0
In a single run, repeat the basic step until stuck - resolve as many
genotypes as possible in the data.
Clark (1990) Randomize choices, and do the computations many
times to find an execution (run) that explains the most genotypes.
For low recombination, large (>60) sites
Find an execution of Clark’s method that
maximizes the number of genotypes resolved
minimizes the number of distinct haplotypes used
We can do this by mixing the Digraph View of Clark’s method
(Gusfield 2001) with the parsimony criteria, and truly find
an execution of Clark’s method that minimizes the number of
distinct haplotypes used.
On datasets where we can compute True Parsimony, this
hybrid does only a bit worse than True Parsimony.
On datasets where we know the solution, find the best
that a Clark method can ever do. IP can find the best
On the APOE data, Clark’s method can get all get 47 correct!
In fact in a huge number of ways. (But the best we found
by actually running Clark’s method was 42 correct).
This kind of test is not possible for Statistical methods.