PH.D candidate: Lan Liu

Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang

Outline • The haplotype inference problem • The tagSNP selection problem • The minimum common integer partition problem

Outline • The haplotype inference problem • Biological background • Approximation and complexity of MRHC • Efficient algorithms for ZRHC • A linear-time algorithm for loop-free ZRHC • The tagSNP selection problem • The minimum common integer partition problem

Mendelian Law: one haplotype comes from the mother and the other comes from the father. paternal maternal Example: Mendelian experiment Introduction • Basic concepts

1111 2222 2222 2222 1111 2222 2222 2222 Mother Father Mother Father 2222 1111 1122 2222 1222 1122 2122 2222 : recombinant child Child Haplotype Configuration Genotype 1 recombinant 0 recombinant Notations and Recombinant

Pedigree • An example: British Royal Family • A mating loop: a cycle inside the pedigree.

Haplotype Reconstruction • - Haplotype: useful, expensive - Genotype: cheaper to obtain • Reconstruct haplotypes from genotypes

Problem Definitions • MRHC Given a pedigree and the genotype information for each member, find a haplotype configuration for each member which obeys Mendelian law, s.t. the number of recombinants are minimized. • ZRHC:zero-recombinant • Loop-free-ZRHC: zero recombinant, pedigree with no mating loops

Approximation and Complexity of MRHC • The known hardness results for MRHC • 2-locus-MRHC: 2 loci • Tree-MRHC: pedigree having no mating loops

Our Hardness and Approximation Results • Tree-MRHC: no mating loop • Binary-tree-MRHC: 1 mate, 1 child • Binary-tree-MRHC*: 1 mate, 1 child, missing data • 2-locus-MRHC: 2 loci • 2-locus-MRHC*: 2 loci with missing data

The ZRHC problem • Problem definition Given a pedigree and the genotype information for each member, find a recombination-freehaplotype configuration for each member that obeys the Mendelian law of inheritance.

Previous work • Li and Jiang introduced a system of linear equations over F[2] and presented an O(m3n3) time algorithm for ZRHC [LJ03] , where mis #lociandnis #members in pedigree. • Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops. • Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k2.376) on k equations with k unknowns. • The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86].

Our Result • We present a much faster algorithm for ZRHC with running time . O(n log2n log log n) O(n) redundancy elimination O(n) transformation Ax=b Ax=b Ax=b

Unknowns • : thepaternal haplotype vector of a member j. • : the scalar demonstrating inheritance info between a parent j1and a child j. The New Linear System • n, m • m : #loci n: #members in pedigree

j2 j1 j2 j1 Pj1,1 pj1,2 pj1,3pj1,4 Pj1,1+1pj1,2+0pj1,3+0pj1,4 +1 Pj2,1 pj2,2pj2,3pj2,4 Pj2,1+0pj2,2+1pj2,3+1pj2,4+1 0100 1101 0111 0000 Pj2 Pj2 +wj2 Pj1+wj1 Pj1 hj1,j hj2,j j j Pj,1 pj,2 pj,3 pj,4 Pj,1 +1pj,2 +1pj,3 +0pj,4 +0 1101 0 0 0 1 Pj+wj Pj The New Linear System Father Mother Child pj1,2=1 pj1,3=0

The Linear System • O(mn) equations on O(mn) unknowns. • Given a homozygous locus i on a member j (with a child j1), pj[i] and pj1[i] arepre-determined.

Pedigree graph G 12 11 12 22 12 12 11 12 12 1 2 3 1 2 3 4 5 6 7 12 11 12 12 12 12 12 12 12 11 22 12 4 5 6 7 8 12 22 22 8 9 22 12 12 9 #edges · 2n Pedigree Graph • A pedigree with genotype

1 0 ? 1 2 3 h1,4 1 1 0 4 5 6 7 1 h6,8 8 0 h4,9 h8,9 1 9 (b) Locus graph Locus Graph • Locus graphGi Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1} 12 22 11 1 2 3 p-variables: variables on vertices. h-variables: variables on edges shared by all locus graphs. 4 5 6 7 12 12 11 12 12 8 Zero-weight 9 : 22 (a) Genotype info Example: Locus graph for the 3rd locus

(proof sketch) Assume the path in locus graph Gi connecting two pre-determinedvertices j0and jk . … dj1, j2 djk-1, jk dj0, j1 hjk-1, jk hj1, j2 hj0, j1 Pj1[i] Pj2[i] Pjk-1[i] Pjk[i] Pj0[i] Pj0[i] = Pj1[i] + dj0, j1 + hj0, j1 Pj1[i] = Pj2[i] + dj1, j2 + hj1, j2 Pj2[i] = Pj3[i] + dj2, j3 + hj2, j2 … Pjk-1[i] = Pjk[i] + djk-1, jk + hjk-1, jk a constant An Observation • For any cycle or any path connecting two pre-determined vertices in a locus graph, the summation of h-variables along the path is a constant. We can use paths to denote constraints!

0 ? ? ? ? ? 1 2 3 1 2 3 h3,5 h3,5 h3,6 h2,4 h3,6 h2,5 h2,5 h2,6 1 ? 1 ? ? ? ? ? 4 5 6 7 4 5 6 7 h6,8 h4,9 : 8 8 1 1 0 0 9 9 (b) 2nd locus graph h3,5 + h3,6 + h2,5 + h2,6 = 0 (c) 3rd locus graph h4,9 + h2,4 + h2,5 + h3,5 + h3,6 + h6,8 = 0 Examples of Linear Constraints ? 1 0 1 2 3 1 1 0 1 4 5 6 7 h6,8 8 0 h8,9 1 9 (a) 1st locus graph h6,8 + h8,9= 1

Linear Constraints • Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient. • Moreover, we can upper bound #constraints in each locus graph as O(n), while the trivial analysis gives an upper bound O(n2). • Total #constraints = O(mn).

Traditional method • Solve h-variables and p-variables together • O(mn)equations onO(mn)unknowns: O(mn)p-variablesandO(n)h-variables. Our method • Solve h-variables and p-variables separately • O(mn) linear equations on O(n)h-variables. The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE input: a pedigree G=(V,E) and genotype{gj} output: a general solution of {pj} begin Step 1. Preprocessing Step 2. Linear constraint generation on h-variables Step 3. Solve h-variables by Gaussian Elimination Step 4. Solve the p-variables by propagation from pre-determined p-variables to others. end

Our Method O(n log2n log log n) O(n) redundancy elimination O(n) transformation Ax=b Ax=b Ax=b

Key lemma Redundant Equation Elimination • An observation j0 j1 • Given a cycle , assume that there are constraints among each pair of vertices. • Originally, there are O(k2) constraints. Notice that they are not independent. • We can replace the original constraints by an equivalent set of constraints with size O(k). j2 jk … jk-2 jk-1 j0~j2 j2~jk-1 j0~jk-1 Remove the redundant equations without solving them!

Redundant Equation Elimination • Given a spanning tree, the stretch of an edge (k, j) is defined as the length of the unique path between k and j on the tree. • Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with averagestretch O(log2n log log n). • The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sumof stretches O(nlog2n log log n).

The Loop-Free ZRHC problem • Problem definition Given a pedigree without mating loops and the genotype information for each member, find a recombination-freehaplotype configuration for each member that obeys the Mendelian law of inheritance.

An example (a) A pedigree graph with constrains (b) Corresponding constraint graph Constraint Graphs • Given the constraints in a pedigree graph, we can construct the corresponding constraint graph.

(proof sketch) • ”=>” • Each h-variables occurs even number of times in the constraint set S corresponding to C. • The sum of h-variable in S is equal to the weight sum of C. • The weight sum of C is 0. (a) The pedigree graph (b) Corresponding constraint graph A Key Lemma • There exists a solution to the loop-free ZRHC problem if and only if the weight sum of every cycle C is 0 in the corresponding constraint graph. The constraints in S are not independent! • ”<=” Done by a construction later.

(a) A spanning forest for the constraint graph (b) The pedigree graph A Mapping from Constraints to Edges • The constraints forming a spanning forest in the constraint graph are sufficient to represent all constraints. • There are at mostn-1 independent constraints. • We can construct an injective mapping f from the independent constraints to edges in the pedigree graph Each constraint is mapped to an edge on the path corresponding to the constraint.

It takes O(n3) time! The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE input: a pedigree G=(V,E) and genotype{gj} output: a general solution of {pj} begin Step 1. Preprocessing Step 2. Linear constraint generation on h-variables Step 3. Solve h-variables by Gaussian Elimination Step 4. Solve the p-variables by propagation from pre-determined p-variables to others. end

An observation Given a constraint along a path j0 ,j1,…,jk-1 , jk … h+h + …+ h= b j1 jk-1 jk j0 j0 ,j1 j1 , j2jk-1, j k We can solve the constraint in the following way: • Assign the h-variables on edges (j0 , j1), (j1, j2), …, (jk-2, jk-1)arbitrarily. • Assign the h-variables on the last edge (jk-1, jk)as a fixed value to satisfy the constraint: h= h + …+ h+ b. j0 ,j1 jk-2, j k-1 jk-1, j k Solving h-variables • In order to obtain a linear-time algorithm, we want to avoid the Gaussian elimination method.

Solving h-variables Based on the Mapping f • We have constructed the infective mapping f : S -> E , where S is the constraint set and E is the edge set. • We solve h-variables as follows: • For each h-variable corresponding to an edge enot inf (S), assign an arbitrary value. • For each h-variable corresponding to an edge e inf (S), assign a fixed value based on the constraint f –1(e), such that the constraint is satisfied. h-variables can be solved by a single BFS Traversal.

Motivation • With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP. • We aim to select a subset of informative SNPs (i.e. tagSNPs) to save the cost for genotyping all SNPs and performing disease association mapping.

(pAB –pA. p.B)2 r2 = • r2 statistics: pA.(1-pA.)p.B(1-p.B) r2 Linkage Disequilibrium Statistics • Given a pair of genetic markers 1 and 2. • If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

(a) SNP markers and their LD patterns in a population (b) TagSNPs for the population The TagSNP Selection Problem • Given a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1and vj2are in V}, we want to select a subset V' of minimum cardinality, such that given any v in V, there exists a v' in V', where r2(v,v') is no less than r0. If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.

TagSNP Selection across Populations • In two populations with different evolutionary histories, a pair of SNPs having remarkably different marker frequencies and very weak LD may show strong LD in the admixed population. • Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations.

(a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations. Problem Definition • Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. • The above problem is called the minimum common tagSNPselection problem (MCTS).

We calculate both the upper bound (i.e. the number of the tagSNPs obtained by our algorithms) and the lower bound (i.e. the minimum number of tagSNPs needed). • Lower bound: GreedyTag_lb and LRTag_lb Our Algorithms • The MCTS problem can be easily formulated by integer linear programming. • We first apply some data reduction rules, then use one of the following algorithms • A greedy algorithm: GreedyTag • A Lagrangian relaxation algorithm: LRTag

Experimental Result • We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005). • There are four populations in HapMap data. • CEU: Europe descendents. • CHB: Chinese people from Beijing. • JPT: Japanese people from Tokyo. • YRI: Yoruba people of Ibadan, Nigeria. • We get tagSNPs for the following two datasets: • Encode regions: all 10 ENCODE regions with totally 10,859 markers. • Human genome: chromosomes 1 – 22 with totally 2,862,454 markers.

Experiment Result for ENCODE Regions • We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS). • The gap between LRTag_lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r2 threshold being 0.5. • There is no gap with the r2 threshold being 0.8.

Experiment Result for Human Genome • The gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5, given the entire human genome with 2,862,454 SNPs. • The gap is 142 SNPs with the r2 threshold being 0.8. The numbers of tagSNPs selected by our algorithms are almost optimal.

IP(S): given a multiset S= {x1, L, xm}, an integer partition is a disjointunion Example:given S= {3, 3, 4}, {2,2,3,3} is an IP({3,3,4}). Problem Definitions • P(n): given an integer n, a partition is a set of integers, say {n1,n2,…, nr}, s.t.åi=1r ni=n. Example: given n=4, {2,2} is a P(4); given n=3, {3} is a P(3).

Examples • CIP(S1, S2, …, Sk): given multisets S1, S2, …, Sk , a common integer partition of all multisets. Example: given S= {3, 3, 4}, T={2,2,6}, {2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T). • MCIP(S1, S2, L, Sk):a common integer partition with the minimum cardinality. • Example: {2,2,3,3} is a MCIP(S,T). • #P(100)=190,569,292 • MCIP is NP-hard

Minimum Common Substring Partition a b c de f gh i j k h h i j k h e f ga b c d Biological Applications(1) • Genetic distance between two genomes • The distance between two strings a b c d e f g h i j k h h i j k h e f g a b c d

Biological Applications(2) • MCIP is a special case of Minimum Common Substring Partition(MCSP) MCSP(S,T) S= T= MCIP(S',T') S'= {x1, x2, L, xm} T'= {y1, y2, L, yn}

PH.D candidate: Lan Liu