Combinatorial Optimization in Computational Biology

Combinatorial Optimization in Computational Biology Dan Gusfield Computer Science, UC Davis

Some Applications of optimization methods Dynamic Programming in sequence alignment TSP and Euler paths in DNA sequencing Integer Programming in Haplotype inference Graphic Matroid Recognition in Haplotype inference Integer Programming for Virus Identification Weighted matching in genetic partition analysis

Apology This talk focuses (mostly) on my own work, for simplicity of preparation. But, many other wonderful examples exist in the literature. At present, there is no coherent story about the ways that combinatorial optimization arises in Computational Biology. Must be alert for meaningful applications.

Topic 1: Sequence Alignment Intended to show the similarity of two sequences (strings). Definition: An alignment of two strings S and S’ is obtained by inserting spaces into, or at the ends of, the two strings, so that the resulting strings (including the spaces) have the same lengths.

Matches, Mismatches and Indels Two aligned, identical characters in an alignment are a match. Two aligned, unequal characters are a mismatch. A character aligned with a space, represents an indel (insertion/deletion). A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T the bars show mismatches 10 matches, 2 mismatches, 7 indels

Basic Algorithmic Problem Find an alignment of the two strings that Maximizes the # matches - # mismatches - # spaces in the alignment of strings S and S’. That maximum number defines the similarity of the two strings. Also called Optimal Global Alignment Biologically, under the point mutation model, a mismatch represents a mutation, while an indel represents a historical insertion or deletion of a single character.

Solution By Dynamic Programming Def: Let V(I, J) be the Maximimum Value obtainable in any alignment of the first I characters of S and the first J characters of S’. If S has N characters and S’ has M characters, then we want to compute V(N, M), and the associated alignment. Let M(I, J) = -1 if S(I) mismatches S’(J) and M(I, J) = 1 if S(I) matches S’(J)

DP Solution The DP recurrences are: Base Cases General Cases I and J aligned I opposite a space J opposite a space They can be evaluated in O(NM) operations.

Many Variations on the Theme The most important alignment variant is Optimal Local Alignment: Given S and S’ find a substring A of S, and a substring B of S’ such that the value of the Global Alignment of A and B is maximized over all pairs of substrings of S and S’. The point is to find a pair of substrings of S and S’ which are highly similar to each other, even if S and S’ are not. Many biological reasons for this: DNA problems, Protein Domains etc. Naively, we would need to do O((NM)**2) global alignments, for a time bound of O((NM)**3) operations overall.

DP solution to Local Alignment But the optimal value is no longer V(N, M) as in global alignment. Instead the optimal Local Alignment Value is the largest V value computed over all the NM values. The time is again O(NM). This is the Smith-Waterman Local Alignment Algorithm.

Adding Operation Weights Find an alignment of the two strings that Maximizes Vm x (# matches) - Cm x (# mismatches) - Cs x (# spaces) in the alignment of strings S and S’. The modification to the DP is immediate, but now the modeling question is what Vm, Cm and Cs should be. Those values greatly determine the biological utility of the alignment. This leads to Parametric Alignment: Find the optimal alignment as a function of unknown values of Vm, Cm, and Cs

Parametric Sequence Alignment As a function of two selected weight parameters Vm, Cm, Cs, Cg, decompose the two dimensional parameter space into convex polygons, so that in each polygon, an alignment inside the polygon is optimal for any point in the polygon, and nowhere else. The time for this decomposition is O(NM) per polygon. In the case of global alignment, the number of polygons is bounded by 0.88(N^(2/3)), where N < M. package XPARAL to find the polygons is available on the web. Gusfield, Naor, Stelling in several papers (see the web)

Adding Gaps to the Model A Gap in an alignment is a contiguous run of spaces in one sequence. has six spaces in three gaps Example: A T T -- -- -- C C -- -- -- T C T G C C C Now the objective function for alignment is Maximize Vm x (#matches) - Cm x (#mis) - Cs x (#spaces) - Cg x (#gaps) This is the affine gap model, the dominant model for studying protein sequence similarity. The choice of Cs and Cg are critical for the biological utility. The DP in this case can also be solved in O(NM) time.

Topic 2: Sequencing By Hybridization A proposed approach to sequencing DNA The technology allows you to determine all the K-tuples, for a fixed K, that are contained in a DNA string S. The goal is to try to determine the string S from the given K-tuples. One can cast this problem as a Hamilton Path problem on a graph, but it is more productive to view it as an Euler Path Problem. This approach was introduced by Pavel Pavzner.

I will focus on the results in Graph Traversals, Genes and Matroids: An Efficient Special Case of the Traveling Salesman Problem D. Gusfield, R. Karp, L. Wang, P. Stelling Discrete Applied Math 88 special issue on Computational Molecular Biology, 1998, p. 167-180.

Sequencing by Hybridization Example: S = ACTTAAACGCGCACAA K = 3 So we learn that ACT, CTT etc. are in S, but we learn these in no predictable order.

The Hamilton Path View CGC ACG add a C on the left Each node represents a K-tuple, and a directed edge from one node to another if the second K-tuple can be obtained from the first by deleting the left character, and adding a right character: i.e, a shift and add. Then a Hamilton Path in this graph generates a sequence S’ that has the same K-tuples as S

The Euler Path View The Euler Path view is more productive, leading to more efficient algorithms both for the base problem, and for variants of the problem. Due to P. Pevzner

Graph G: One node for every k-1 tuple, and one edge for every observed k-tuple A A C AA AC CA A C A A GC G T C TA A T CG CT TT Observed triples from S: AAA, AAC, ACA, CAA, CAC, ACG, CGC, GCA, GCC, ACT, CTT, TTA, TAA

Any Euler path in G produces a string which has the same K-tuples as the original string S. So if there is only one Euler path in G, then you have reconstructed S. The problem is that typically there is more than one Euler path. So how to choose one Euler path as a most promising one? That is, one that is most likely to correspond to S? Frequently there is additional information about the K-tuples, for example the rough distances between K-tuples in S. So, we can give a value to each pair of successive edges in G, which roughly reflects the probability that the two edges would appear successively in the Euler Path for S. Another way to state this is that we evaluate the goodness of S’ according to the (k+1)-tuples that it contains.

Back to the Hamilton View Now we are back to the Hamilton Path view, where each node is a K-tuple of S, but now each edge has a distance reflecting the goodness of any (K+1)-tuple that would be generated by a Hamilton path, and we want the Maximum distance Hamilton Path in the graph. This sounds hard, but there is structure in this problem: the graphs are not arbitrary.

The structure of the Hamilton Graph The Hamilton Graph for the data, where nodes represent K-tuples in S, and edges represent (K+1)-tuples in a solution string S’, is the line di-graph of the Euler Graph for the data, where nodes represent (K-1) tuples, and edges represent K-tuples.

How Euler Relates to Hamilton The Hamilton graph is the line di-graph L(G) of the Euler Graph G. Each edge of G becomes a node in L(G), and each edge in L(G) represents a successive pair of edges in G. Each red edge has the value of the successive pair of black edges. So the problem now is to find a Maximum Value TSPath.

In the case that each node has only two in and two out edges (which happens when using a simplified DNA alphabet), this TSP problem has a polynomial time solution. In fact, if L(G) is any line digraph where every node has in and out degrees at most 2, then even for arbitrary edge distances, the TSP problem can be solved in O(n log n) time. Moreover, if every edge value is distinct, then the TSP solution is unique. The Hamilton paths for such 2-in, 2-out-degree graphs have a matroid structure that reveals many features of the TSP solutions. When degrees are greater than 2, we branch-and-bound down to cases of degree 2, which we then can solve optimally.

Approximations When the in and out-degrees are bounded by X, a greedy algorithm is guaranteed to find a Hamilton path whose value is within 1/X that of the optimal. So 1/3 or 1/4 for the case of DNA. Use of this approximation can speed up the branch-and- bound until the in and out-degree bound of 2 is reached.

Topic 3: Combinatorial Algorithms for Haplotype Inference True Parsimony and Perfect Phylogeny Dan Gusfield

Genotypes and Haplotypes Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by SNPs) 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Two haplotypes per individual Merge the haplotypes 2 1 2 1 0 0 1 2 0 Genotype for the individual

Haplotyping Problem • Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect. • Computational Problem: Given a set of n genotypes, determine the original set of n haplotypepairs that generated the n genotypes. This is hopeless without a genetic model.

We consider two models in this talk • Pure (True) Parsimony Model, Gusfield unpublished (Note added 3/18/03 technical report is available on the web, and will appear in the 2003 CPM conference) • Perfect Phylogeny Model, Gusfield RECOMB Proceedings April 2002

Topic 3a: The Pure Parsimony Objective For a set of genotypes, find a Smallest set H of haplotypes, such that each genotype can be explained by a pair of haplotypes in H.

Example of Parsimony 00100 01110 02120 3 distinct haplotypes set S has size 3 01110 10110 22110 00100 10110 20120

Pure Parsimony is NP-hard Earl Hubbel (Affymetrix) showed that Pure Parsimony is NP-hard. However, for a range of parameters of current interest (number of sites and genotypes) a Pure Parsimony solution can be computed efficiently, using Integer Linear Programming, and one speed-up trick. For larger parameters (100 sites and 50 genotypes) A near-parsimony solution can be found efficiently.

The Conceptual Integer Programming Formulation For each genotype (individual) j, create one integer programming variable Yij for each pair of haplotypes whose merge creates genotype j. If j has k 2’s, then This creates 2^(k-1) Y variables. Create one integer programming variable Xq for Each distinct haplotype q that appears in one of the pairs for a Y variable.

Conceptual IP For each genotype, create an equality that says that exactly one of its Y variables must be set to 1. For each variable Yij, whose two haplotypes are given variables Xq and Xq’, include an inequality that says that if variable Yij is set to 1, then both variables Xq and Xq’ must be set to 1. Then the objective function is to Minimize the sum of the X variables.

Example 02120 Creates a Y variable Y1 for pair 00100 X1 01110 X2 and a Y variable Y2 for pair 01100 X3 00110 X4 Include the following (in)equalities into the IP The objective function will include the subexpression X1 + X2 + X3 + X4 But any X variable is included exactly once no matter how many Y variables it is associated with. Y1 + Y2 = 1 Y1 - X1 <= 0 Y1 - X2 <= 0 Y2 - X3 <= 0 Y2 - X4 <= 0

Efficiency Trick Ignore any Y variable and its two X variables if those X variables are associated with no other Y variable. The Resulting IP is much smaller, and can be used to find the optimal to the conceptual IP. Also, we need not first enumerate all X pairs for a given genotype, but can efficiently recognize the pairs we need. Another simple trick.

How Fast? How Good? Depends on the level of recombination in the underlying data. Pure Parsimony can be computed in seconds to minutes for most cases with 50 genotypes and up to 60 sites, faster as the level of recombination increases. As the level of recombination increases, the accuracy of the Pure Parsimony Solution falls, but remains within 10% of the quality of PHASE (for comparison).

Topic 3b: Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis

The Perfect Phylogeny Model We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.

The Perfect Phylogeny Model sites 12345 Ancestral haplotype 00000 1 4 Site mutations on edges 3 00010 2 10100 5 10000 01010 01011 Extant haplotypes at the leaves

Justification for Perfect Phylogeny Model • In the absence of recombination each haplotype of any individual has a single parent, so tracing back the history of the haplotypes in a population gives a tree. • Recent strong evidence for long regions of DNA with no recombination. Key to the NIH haplotype mapping project. (See NYT October 30, 2002) • Mutations are rare at selected sites, so are assumed non-recurrent. • Connection with coalescent models.

The Haplotype PhylogenyProblem Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny. sites A haplotype pair explains a genotype if the merge of the haplotypes creates the genotype. Example: The merge of 0 1 and 1 0 explains 2 2. S Genotype matrix

The Haplotype PhylogenyProblem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

The Haplotype PhylogenyProblem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 00 1 2 b 00 a a b c c 01 01 10 10 10

The Alternative Explanation No tree possible for this explanation

When does a set of haplotypes to fit a perfect phylogeny? Classic NASC: Arrange the haplotypes in a matrix, two haplotypes for each individual. Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all three pairs: 0,1 and 1,0 and 1,1 This is the 3-Gamete Test

The Alternative Explanation No tree possible for this explanation

The Tree Explanation Again 0 0 1 2 b 0 0 a b a c c 0 1 0 1

Solving the Haplotype Phylogeny Problem (PPH) • Simple Tools based on classical Perfect Phylogeny Problem. • Complex Tools based on Graph Realization Problem (graphic matroid realization).

Combinatorial Optimization in Computational Biology

Combinatorial Optimization in Computational Biology

Presentation Transcript

Computational Biology

Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny

Combinatorial Optimization

Computational Biology

Computational Biology

Computational Biology

Computational Biology

Combinatorial Optimization in Container Scheduling

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Distributed Combinatorial Optimization

Computational Protein Design: A problem in combinatorial optimization

Computational Biology

Combinatorial Optimization

Computational Biology

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Computational Optimization

Problems in Combinatorial Optimization

Computational Biology

Computational Biology

Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny