Optimizing Sequence Transformation: Edit Distance and Dynamic Programming Techniques

This material in not in your text (except as exercises) • Sequence Comparisons • Problems in molecular biology involve finding the minimum number of edit steps which are required to change one string into another. • Three types of edit steps: insert, delete, replace. • Example: abbc babb • abbc  bbc  bbb  babb (3 steps) • abbc  babbc  babb (2 steps) • We are trying to minimize the number of steps.

Idea: look at making just one position right. Find all the ways you could use. • Count how long each would take and recursively figure total cost. • Orderly way of limiting the exponential number of combinations to think about. • For ease in coding, we make the last character right (rather than any other). (Then I can tell the routine the beginning address and number of positions without having the string begin in a different location. In C it would be no problem, but other languages use a different technique.)

There are four possibilities (pick the cheapest) • If we delete an, we need to change A(n-1) to B(m). The cost is C(n,m) = C(n-1,m) + 1 C(n,m) is the cost of changing the first n of str1 to the first m of str2. 2. If we insert a new value at the end of A(n) to match bm, we would still have to change A(n) to B(m-1). The cost is C(n,m) = C(n,m-1) + 1 3. If we replace an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1 4. If we match an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1)

We have turned one problem into three problems - just slightly smaller. • Bad situation - unless we can reuse results. Dynamic Programming. • We store the results of C(i,j) for i = 1,n and j = 1,m. • If we need to reconstruct how we would achieve the change, we store both the cost and an indication of which set of subproblems was used.

M(i,j) which indicates which of the four decisions lead to the best result. • Complexity: O(mn) - but needs O(mn) space as well. • Consider changing do to redo: • Consider changing mane to mean:

Longest Increasing Subsequence Find the longest increasing subsequence in a sequence of distinct integers. Idea 1. Given a sequence of size less than m, can find the longest sequence of it. (Induction) The problem is that we don't know how to increase the length. Case 1: It either can be added to the longest subsequence or not Case 2: It is possible that it can be added to a non-selected subsequence (creating a sequence of equal length - but having a smaller ending point) Case 3: It can be added to a non-selected sub-sequence creating a sequence of smaller length but successors make it a good choice. Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11

Idea 2. Given a sequence of size string < m, we know how to find all the longest increasing subsequences. • Hard. There are many, and we need it for all lengths.

Idea 3Given a sequence of size < m, can find the longest subsequence with the smallest ending point. • We might have to create a smaller subsequence, before we create a longer one.

Idea 4. Given a sequence of size <m, can find the best increasing sum (BIS) for every length (k < m-1). • For each new item in the sequence, when we add it to the sequence of length 3 will it be better than the current sequence of length 4?

For s= 1 to n (or recursively the other way) For k = s downto 1 until find correct spot If BIS(k) > As and BIS(k-1) < As BIS(k) = As

Actually, we don't need the sequential search as can do a binary search. 5 1 10 2 12 8 15 18 45 6 7 3 8 9 Length BIS 1 1 2 2 3 3 4 7 5 8 6 9 • To output the sequence would be difficult as you don't know where the sequence is. You would have to reconstruct.

Probabilistic Algorithms • Suppose we wanted to find a number that is greater than the median (the number for which half are bigger). • We could sort them - O(n log n) and then select one. • We could find the biggest - but stop looking half way through. O(n/2) • Cannot guarantee one in the upper half in less than n/2 comparisons. • What if you just wanted good odds? • Pick two numbers, pick the larger one. What is probability it is in the lower half?

There are four possibilities: • both are lower • the first is lower the other higher. • the first is higher the other lower • both are higher. We will be right 75% of the time! We only lose if both are in the lowest half.

Select k elements and pick the biggest, the probability is 1 - 1/2k . Good odds - controlled odds. • Termed a Monte Carlo algorithm. It may • give the wrong result with very small probability. Another type of probabilistic algorithm is one that never gives a wrong result, but its running time is not guaranteed. • Termed Las Vegas algorithm as you are guaranteed success if you try long enough.

A coloring Problem: Las Vegas Style • Let S be a set with n elements. (n only effects complexity not algorithm) • Let S1, S2... Sk be a collection of • distinct (in some way different) • subsets of S, each containing exactly r elements such that k 2r-2 . (Use this bound below) • Color each element of S with one of two colors (red or blue) such that each • subset Si contains at least one red and one blue element.

Idea • Try coloring them randomly and they just checking to see if you happen to • win. Checking is fast, as you can quit checking each subset when you see • one of each. You can quit checking the collection when any single color subset is found. • What is the probability that all items in a set are red? 1/2r • as equal probability that each color is assigned and r items in the set.

What is the probability that any one of the collection is all red? • k/2r • Since we are looking for the {\it or} of a set of probabilities, we add. • k is bound by 2r-2 so k*1/2r <= 1/4 • The probability of all blue or all red in a single set is one half. (double probability of all red) • If our random coloring fails, we simply try again until success. • Our expected number of tries is 2.

Finding a Majority • Let E be a sequence of integers x1,x2,x3, ... xn The multiplicity of x in E is the number of times x appears in E. A number z • is a majority in E if its multiplicity is greater than n/2. • Problem: given a sequence of numbers, find the majority in the sequence or determine that none exists.

For example, suppose there is an election. Candidates are represented as integers. Votes are represented as a list of candidate numbers. • We are assuming no limit of the number of possible candidates.

Ideas 1. sort the list O(n log n) 2. If have a balanced tree of candidate names would be n log c (where c is number of candidates) Note, if we don’t know how many candidates, we can’t give them indices. 3. See if median (kth largest algorithm) occurs more than n/2 times. O(n) 4. Take a small sample. Find the majority - then count how many times it occurs in the whole list. This is a probabilistic approach. 5. Make one pass - Discard elements that won’t affect majority.

Note: • if xi xj and we remove both of them, then the majority in the old list is the majority in the new list. • If xi is a majority there are m x_i s out of n, where m > n/2. If we remove two elements, (m-1 > (n-2)/2). • The converse is not true. If there is no majority, removing two may make something a majority in the smaller list: 1,2,4,5,5.

Thus, our algorithm will find a possible majority. • Algorithm: find two unequal elements. Delete them. Find the majority in the smaller list. Then see if it is a majority in the original list. • How do we remove elements? It is easy. We scan the list in order. • We are looking for a pair to eliminate. • Let i be the current position. All the items before xi which have not been eliminated have the same value. All you really need to keep is the number of times, Occurs, this candidate, C value occurs (which has not been deleted).

For example: List: 1 4 6 3 4 4 4 2 9 0 2 4 1 4 2 2 3 2 4 2 Occurs: X X 1 X 1 2 3 2 1 X 1 X 1 X 1 2 1 2 1 2 Candidate: 1 6 4 4 4 4 4 ? 2 ? 1 ? 2 2 2 2 2 2 2 is a candidate, but is not a majority in the whole list. • Complexity: n-1 compares to find a candidate. n-1 compares to test if it is a majority. So why do this over other ways? Simple to code. No different in terms of complexity, but interesting to think about.

Optimizing Sequence Transformation: Edit Distance and Dynamic Programming Techniques