Unlocking Biological Challenges with Dynamic Programming Approaches

Dynamic programming:one algorithmic key to many biological locks Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011

BIOINFORMATICS FOR BIOLOGISTS Pavel Pevzner and Ron Shamir, eds. (Cambridge University Press, 2011) Ch. 4. DYNAMIC PROGRAMMING: ONE ALGORITHMIC KEY FOR MANY BIOLOGICAL LOCKS Mikhail Gelfand Research and Training Center “Bioinformatics” of the Institute for Information Transmission Problems, RAS and Faculty of Bioengineering and Bioinformatics, M.V.Lomonosov Moscow State University

Alignment Three (of many) alignments of two sequences. Plus denotes a match; dot, a mismatch, minus, a gap. (a) Two matches, five mismatches, (b) three matches, one mismatch, two gaps of size three (six indels, that is one-nucleotide insertions/deletions), (c) four matches, two gaps of size three (six indels).

The number of alignments is large # of alignments of two sequences of length N~ (1+√2)2N+1√N at N = 1000 # ≈ 10767 # of elementary particles in the Universe ≈ 1080 at N = 100 # ≈ 1076 assume 1 operation per alignment, 1012 operations per second => need 1057 years => we cannot consider them one by one

Gene recognition Segmentation of a genomic fragment into protein-coding and non-coding regions based on differences in statistical properties of these regions difficult in eukaryotes due to the existence of introns, non-coding regions within genes

Toy example How many operations are needed to calculate ∑i=1…m, j=1…nxi∙yj = = x1∙y1 + x1∙y2 + … + x1∙yn + + x2∙y1 + x2∙y2 + … + x2∙yn + + … + + xm∙y1 + xm∙y2 + … + xm∙yn Naïve answer: mn multiplications and mn–1 additions

but rewrite as… (x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) = = ∑i=1…mxi ∙ ∑j=1…nyj and it becomes m+n–2 additions and just 1 multiplication

Quiz How many multiplications do we need to calculate x1y1∙ x1y2 ∙ … ∙ x1yn ∙ x2y1 ∙ x2y2 ∙ … ∙ x2yn ∙ … ∙ ∙ xmy1 ∙ xmy2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…nxiyj if we are • naïve? (b) sophisticated? (c) What if in addition to multiplication, we have an operation “taking to the power”? (d) if we may perform not only multiplication, but also addition?

Lesson Restructuring the order of calculations using properties of the data may sharply decrease the number of operations

Graphs Vertices/nodes: v1, v2, …, vn Arcs /edges– directed pairs of vertices: am(vi, vj) multiple sources and sinks contains cycles

“bad” graphs and not graphs multiple arcs multiple components loop undirected graph not a graph (hanging arc)

Sources, sinks, paths, cycles Source is a vertex that is not an end vertex for any arc Sink is a vertex that is not a start vertex for any arc. Walk p of length N is an ordered set of N arcs w = (a1, …, aN) such that the end vertex of arc an = (bn, en) coincides with the start vertex of arc an+1, en=bn+1, for all n = 1, …, N–1. v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 multiple sources and sinks no source and sink one source and one sink w=(a(v1,v3),a(v3,v2,),a(v2,v4,),a(v3,v4), a(v3,v1), a(v1,v3)) w=(a(v2,v1)) w=(a(v4,v5),a(v5,v3))

Sources, sinks, paths, cycles In a graph without loops and multiple arcs, each walk may also be defined as an ordered set of vertices w = (v1, …, vN+1) such that for each pair of adjacent vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N. v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 multiple sources and sinks no source and sink one source and one sink w=(v2,v1) w=(v4,v5,v3) w=(v1,v3,v2,v4,v3,v1,v3)

Sources, sinks, paths, cycles A path is a walk in which no arc is passed twice. Cycle is a path in which the end vertex of the last arc aN coincides with the start vertex of the first arc a1, eN=b1. Acyclic graph contains no cycles. Acyclic graph Acyclic graph Cyclic graph v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 multiple sources and sinks no source and sink one source and one sink p=(v2,v1) p=(v4,v5,v3) p=c=(v1,v3,v2,v4,v3,v1)

Quiz (a) Draw all acyclic connected oriented graphs with three vertices (up to vertex labels). (b) How many oriented graphs will there be if we label vertices with symbols A, B and C? (c) Prove that in an acyclic graph there is at least one source and at least one sink. (d) Draw sinks and sources in the graphs of (a).

Problem Consider an acyclic graph with one source and one sink. Assign each arc with a number called a weight. For a given path, its path score is defined as the sum of the weights of its arcs. Given a weighted acyclic graph, find the highest scoring path from the sink to the source.

Observation If two subpathsP and Q end at the same vertex v, and the score of P is larger than the score of Q, then for all pairs of paths P* and Q* that start with P and Q, respectively, and coincide after v, the score of P* is higher than the score of Q*. Hence, we do not need to consider all paths, as it is sufficient to construct the highest scoring subpath from the source to each vertex, finishing at the sink. Q P P*,Q* v P* > Q* P > Q

Let’s do it for this graph 1 3 3 2 2 6 8 5 6 5 5 2 1 3 4 1 4 2 2 1

Step 1 Step 2 1 1 3 3 3 2 3 2 2 2 6 8 5 6 8 5 6 5 6 5 4 1 6 5 2 5 2 5 3 1 1 3 4 1 3 4 1 4 2 4 2 2 1 2 1 2 3 2

Step 3 Step 4 1 1 3 3 10 3 2 3 2 11 2 2 6 8 5 6 8 5 6 5 6 5 4 10 6 7 5 2 5 2 5 5 1 1 3 4 1 3 4 1 4 2 4 2 2 1 2 1 2 3 2 3

Step 5 Step 6 1 1 3 3 12 18 3 2 16 3 2 11 11 2 2 6 8 5 6 8 5 6 5 6 5 10 10 7 7 5 2 5 2 5 5 1 1 3 4 1 3 4 1 4 2 4 2 2 1 2 1 2 3 2 3

Step 7 Step 8 19 1 1 3 3 18 19 16 3 2 16 3 2 11 11 2 2 6 8 5 6 8 5 6 5 6 5 10 10 7 7 5 2 5 2 5 5 1 1 3 4 1 3 4 1 4 2 4 2 2 1 2 1 2 3 2 3

Step 9 20 Backtracing 20 1 1 3 3 19 19 3 2 3 2 11 11 16 16 2 2 6 8 5 6 8 5 6 5 6 5 10 10 7 7 5 2 5 2 5 5 1 1 3 4 1 3 4 1 4 2 4 2 2 1 2 1 2 3 2 3

Quiz At what steps did we have more than one vertex with all incoming arcs processed?

Algorithm Data types and definitions: vertices: v, u, Source, Sink; arcs: (v,u), a; start vertex of arc a: Begining_vertex(a); weight of arc (v,u): W(v,u); path: BestPath; // defined as a set of arcs the highest score of subpath ending at v: Score (v); the highest score of subpath coming through (v,u) and ending at u : Top_score(v,u); the last arc of the highest scoring subpath ending at u: Last_arc(u).

Initialize: for each vertex v:Score (v) := minus_infinity. Forward process: while There are unprocessed vertices: v := arbitrary unprocessed vertex with all incoming arcs processed; for each arc (v,u): // consider all arcs starting at v Top_score(v,u) := Score (v)+W(v,u); ifTop_score(v,u)>Score (u) // subpath coming through v is better than the //current best subpath ending at u then: // update the data for u Score (u) := Top_score(v,u); Last_acr(u) := (v,u); endif; (v,u) := processed_arc; endfor; v := processed_vertex; endwhile. Backtracing: BestPath= empty_set; // initialize v := Sink; // go from the sink backwards by marked arcs untilv=Source Add Last_arc(v) to BestPath; // add the last arc of the best path ending at the //current vertex v :=Beging_vertex(Last_arc(v)); // go to the start vertex of this arc enduntil. OutputBestPath.

The number of operations The limiting procedure is processing vertices and adding arcs to paths, and we consider each arc only once Hence the number of operations is linear in the number of arcs A:the run time of the algorithm is O (A)

1 3 3 2 2 6 8 5 6 5 5 2 1 3 4 1 4 2 2 1 Greedy algorithm Start at the source and select the highest-weighted arc at each step. 13 <20 It does not work.

Quiz • Construct the simplest possible graph in which the greedy algorithm yields the highest scoring path. (b) Construct a graph with three vertices in which the greedy algorithm does not yield the highest scoring path. (c) Construct a graph with three vertices in which the greedy algorithm does yield the highest scoring path. (d) Assign new weights to the arcs of the above graph so that the greedy algorithm will yield the highest scoring path.

Quiz cont’d (e) Write an algorithm for construction of the path with the maximum number of arcs and apply it to the above graph. Hint: do not change the algorithm, set proper arc weights. (f) Modify the maximum score algorithm so as to construct the path with the minimal score and find this path for the above graph. (g) Provide a greedy algorithm for finding the path of minimal score in a graph, and apply it to the above graph. (h) For the above graph, find the path with the minimal number of arcs.

Lesson The generic dynamic programming algorithm may be applied to differentproblems. The common feature of these problems is that each one can be decomposedinto an ordered set of smaller subproblems, and to solve a more complex subproblemone needs to know only the solutions of the simpler ones, but not the entire set ofpossibilities.

Note There exist path optimization problems that cannot be solved by the dynamic programming. Traveling salesman problem. Given a non-oriented graph with weighted arcs, we need to construct the lowest scoring path passing through all the vertices (the salesman needs to visit all cities with travel time between the cities given by the arc weights, while spending the least amount of time traveling). All cities need to be visited in a single trip => NP-complete problem. No efficient algorithms are known. Most computer scientists believe that for all NP-complete problems the number of operations required to provide an optimal solution is exponential in the problem size.

Alignment Given two symbol sequences (nucleotides or amino acids) of lengths M and N, set a correspondence between these sequences so that some symbols are set in pairs, matching or mismatching, whereas other symbols are ignored (indels). The order of corresponding symbols in the subsequences should coincide. The alignment score is the sum of match premiums r per matching pair minus the sum of mismatch penalties p per mismatching pair and deletion penalties q per ignored symbol. The goal is to construct the highest scoring alignment.

Quiz What are the scores of the alignments

Reduction to the optimal path problem Construct a graph. Vertices correpond to pairs of positions (endpoint of partial alignments). Outcoming arcs (for each vertex) are of three types: • match (weight r ) or mismatch (weight(–p)); total M∙N arcs • deletion in the 1st sequence (weight (–q)); total M∙(N+1) arcs • deletion in the 2nd sequence (weight (–q); total (M+1)∙N) arcs

g e l f a n d g a n d a l f Alignment graph

g e l f a n d q q q q q q q g p q r q p q p q q p q p q p q q q q q q q q a p q p q q p q q p q p q q p p r q q q q q q q n q q q p q p q p q p q p q r p q q q q q q q d p q q p q p q p q p q p q r q q q q q q q q a q p q p q p q p q r q p q p q q q q q q q q l q p q p q q p q p q r q p q p q q q q q q q f q p q p q q p q p q p q r q p q q q q q q q Alignment graph with weights

g e l f a n d g a n d a l f Paths for the three alignments

Variants • Hanging-end alignment (genome assembly) • zero-weight arcs from the source to the top and left “perimeter” and from the right and bottom perimeter to the sink • Local alignment • zero-weight arcs from the source to all internal vertices and from internal vertices to the sink

Weights • Amino-acid substitution weight matrices • evolutionary • PAM (sure alignment of closely related proteins, take matrix to the power) • BLOSUM (alignment of conservative regions in distantly related proteins) • based on physical and chemical properties of residues • Deletion penalty • affine penalties (opening and extension penalties) • Structural alignment as the gold standard

Quiz For the above alignments, assuming match premium r=10, what combinations of mismatch and deletion penalties would yield optimal alignments (a), (b), and (c)?

Multiple alignment • triple  cubic graph • etc • for K sequences of length N requires O(NK) operations • soon becomes unworkable • progressive alignment • all pairwise alignments, distance matrices • guide tree • alignment of partial alignment

Lesson Weights matter. The same graph with differently assigned arc weights will yield different types of alignment.

Gene recognition Define a gene as a sequence fragment consisting of exons and introns. The boundaries between them are donor sites (between exons and introns, usually GT) and acceptor sites (between introns and exons, usually AG). Each exon and intron is assigned a weight, measuring coding affinity (respectively, non-coding affinity) of its sequence. The gene’s score is the sum of weights of constituent exons and introns. The goal is, given a sequence and a set of candidate donor and acceptor sites, construct the highest-scoring exon–intron structure for a gene.

(a) actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga (b) actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga Construct a graph

Complexity Assume even distribution of sites (leave out details) => O(L) vertices, O(L2) arcs Can we do better?

(a) (a) actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga (b) (b) actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga It makes sense to assume that the segment weights are additive (we assume that for exons anyhow). Then we have just O(L) arcs

Quiz There are two paths in the segment graph that describe exon–intron structures not represented in the exon–intron graph. What are they? What arcs need to be added to the exon–intron graph to represent these structures?

Lesson Structure matters. The same problem may be represented by different graphs, and the conceptually simplest representation is not necessarily the most efficient one.

Return to the toy problem calculate the standard trick would not work because x∙z + y∙z = (x + y)∙ z (before) holds, but (x+z) ∙ (y+z) = x∙y + z generallydoes not. Quiz. When (x+z) ∙ (y+z) = x∙y + z ?

Unlocking Biological Challenges with Dynamic Programming Approaches