1 / 63

Genomic Sorting with Length-Weighted Intervals - PowerPoint PPT Presentation

Genomic Sorting with Length-Weighted Intervals. 236818 - Seminar in Bioinformatics Advanced Algorithms in Computational Biology Spring 2005, Technion Asaf Merschon. What we saw so far.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Genomic Sorting with Length-Weighted Intervals ' - harrison-lopez

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Genomic Sorting with Length-Weighted Intervals

236818 - Seminar in Bioinformatics Advanced Algorithms in Computational Biology

Spring 2005, Technion

Asaf Merschon

• Current algorithms of genome rearrangements ignore the length of reversals; rather they only count their number.

• Traditionally, such analysis assumes that each reversal is of unit cost.

• The assumption of unit cost reversals is not completely defensible biologically:

• A longer genomic reversal will cause more upheaval to the organism, resulting in a lower likelihood of the organism surviving to pass the mutation.

• The mechanics of genome reversal may suggest that probabilities of reversals depends on their length (among other factors).

• On top of the surface:

• Introduction to Genomic Sorting with Length-Weighted Intervals.

• Lower and upper bounds on complexity of solution.

• Proofs (Partial).

• Down under:

• Improved bounds on Sorting with Length-Weighted Reversals (Extended Abstract).

• Concept and examples.

• Sorting by Length-weighted Reversals: Dealing with Signs and Circularity.

• General approach to solutions.

• Find an algorithm that efficiently sorts one sequence into another by reversals under length sensitive cost models.

• Focus is on sorting unsigned permutations by reversals.

• The problem remains NP-hard in our new model and hence we will try to reach approximation results.

• Let the function denote the cost of a reversal of length .

• Traditionally, .

• We say a function is:

• Additive if

• Subadditive if

• Superadditive if

• A Reversal Graph of permutations of length n is a graph where:

• The vertices are all the permutations of length n.

• There is an edge (p1,p2) if of weight if there exists one -reversal that transforms the permutation p1into the permutation p2.

• Minimize the cost sufficient to sort any permutation of n elements (actually achieving an upper bound). Equivalent to computing the diameter of the reversal graph under the shortest-path metric.

• Approximate the minimum-cost reversal sequence for a given permutation. We would like a heuristic that assures the resulting sequence costs no more than a slowly growing function of n times that of the optimal sequence.

• The relatively coarse bounds generated by the following techniques applying them to biological data.

• The work presented leads to interesting algorithmic results and raises some interesting questions as a basis for further bioinformatics studies.

• Unit cost, unsigned reversals was shown to be NP-hard by Caprara. Our problem inherits hardness under more general metrics from this result.

• Kececloglu & Sankoff gave approximation algorithms on reversal distance that guarantee results at most 2 times optimal.

• Bafna & Pevzner improved this to a factor of 7/4.

• Berman et al improved this factor to 1.375.

• Minimum-cost unsigned reversal sorting has been studied also under models where cost increases so dramatically that only length-2 reversals are afforded.

• Experiments were done on both mitochondrial genomes of two fungi as well as on random samples. They suggest that length may play an important role in biasing certain rearrangement patterns.

• By bounding the diameter of the Reversal Graph, we establish an upper bound on the cost of sorting any n-element permutation.

• Standard sorting algorithms exhibit interesting performance on highly subadditive and superadditive functions, but not additive measures. The primary result of this section is a new reversal-based sorting algorithm which performs well on additive cost functions. (Examples in next slides).

• Subadditive: A reversal-based version of selection sort performs at most n-1 reversals, a fraction of which are potentially in length. Thus selection sort gives an diameter algorithm.

• Especially efficient for

• Superadditive: Bubble sort and insertion sort perform transpositions of neighboring elements, one for each inversion in the input permutation. This gives an diameter algorithm.

• Particularly efficient for

• Additive functions, particularly

• Presented is an algorithm for sorting any permutation of n elements incost using divide and conquer.

• The key operation is MedianEject.

Sorting a permutation involves putting element i in position i.

• Let denote the element in the position in the permutation.

• Let denote the position of the element in the permutation.

• An element x is wrong-sided if x & are on different sides of the median . Meaning or vice versa.

• We apply MedianEject to portions of the permutation from position a to b. One round of MedianEject moves all wrong-sided elements in the interval [a,b] to the correct side relative to its median in the following manner:

• MedianEject(a,b)=Identify the maximal runs of wrong-sided elements r, the median (b-a)/2.for (i = 1 to log r)reduce the number of wrong-sided runs by half using non-overlapping reversals, none crossing the median.With two reversals, move remaining wrong-sided runs to median boundary. Reverse the left and right wrong-sized runs using a single reversal.

MedianEject – Sample Run

• Lemma 1:MedianEject costs O(f(b-a)logr) for any additive cost function f.

• Proof (intuitively): There are O(logr) reversals since with each pass there are half as many maximal runs of wrong-sided elements on each side of the median. Each reversal reveres at most b-a elements and hence costs O(f(b-a)) resulting in a total of O(f(b-a)logr).

• MedianEject is the partitioning operation of the following Quicksort-like algorithm:

• Lemma 2:ReversalSort runs intime for any additive cost function f(n).

• Proof: By the master theorem, the recurrenceevaluates to .

• From a biological point of view, constructing the least expensive transformation from a given permutation A to another permutation B is more interesting than minimizing diameter. This is because we want to reconstruct the evolutionary history from A and B, a history which presumably took the most parsimonious possible path.

• We now show that for all permutations, the reversal sorting algorithm yields a cost which is times optimal for any additive cost function.

• Our analysis requires the definition of a weighted graph G(p) associated with a given permutation p.

• The vertices of G(p) will be the n elements (positions) of p. There will be an edge (i,j) in G(p) where . The weight of this edge is .

• G(p)may be used to provide lower bounds on the optimal cost of sorting. However, these bounds can be very coarse.

• Instead, we bound the optimal cost in terms of the weight of the heaviest non-crossing matchingM(G(p)).

• We say a matching M(G(p)) (namely a group of edges from G(p)) is non-crossing ifSuch maximal matching can be easily found using dynamic programming.

• Theorem 3: The greedy breakpoint-merging heuristic can yield a reversal sequence whose cost is optimal.

• Proof: Won’t be provided in this presentation.

• Lemma 4: The weight of M(G(p)) is a lower bound on the reversal-sorting cost for permutation p under additive weight functions.

• Proof: Consider the simpler task of just placing the elements defining edges from M(G(p)) into their proper position. This task can be done in cost f(w), where w is the total weight of M(G(p)), by performing the reversals defined by the edges in the matches. Because none of the intervals overlap or nest, no longer reversal can be helpful to move multiple elements into the proper position; because the cost function is additive we cannot benefit by using shorter reversals.

• To argue that the weight of M(G(p)) is a good lower bound, we will bound certain properties of p & G(p) in the size of this matching.

• Lemma 5:Let denote the kth edge of M(G(p)), where . Let be a function which equals 1 if intersects the interval [i,…, j] and is zero otherwise. Then edge if

• Proof: By definition, M(G(p)) is the maximum cost non-crossing matching. Hence such an edge (i, j) cannot exist in G(p), for if so we could remove all intersected matching edges and insert (i, j) into M(G(p)) to yield a higher cost non-crossing matching.

• Lemma 6: The number of out-of-position elements in p is at most .

• Proof: Won’t be provided in this presentation.

• Lemma 7: No element outside of the penumbra moves during the execution of MedianEject.

• Definition: The penumbra is the set of positions where out-of-position elements potentially lie unioned with all positions overlapped by edges of M(G(p)).

• Proof: Won’t be provided in this presentation.

• Implied (By Lemma 7): Every round of non-overlapping reversals costs at most throughout the execution of ReversalSort.

• Corollary 1: The cost of the each round of MedianEject is , and therefore ReversalSort costs .

• Theorem 8: The ReversalSort heuristic solution is at most a factor of times the optimal solution.

• Proof: Derived from the previous lemmas.

• Improved bounds on Sorting with Length-Weighted Reversals (Extended Abstract).

• Sorting by Length-weighted Reversals: Dealing with Signs and Circularity.

• Conclusions, Suggestions & Questions raised.

• Comments!?

• We will now approach the problem of sorting integer sequences by length weighted reversals using a wider range of cost functions.

• For the cost function we consider a wide class of functions, namelywhere l is the length of the reversal.

• So far we have mainly dealt with the case where .

• To sort a sequence of 0’s and 1’s.

• Recursively sort the left and right halves.

• Perform one more reversal across the median for a sorting cost of:

• Pinter and Skiena used this algorithm to obtain an upper bound of on diameter for linear cost reversals.

• As was shown in first part of the presentation.

• The table summarizes the found bounds and approximations ratios for different values.

• Proofs for some of the bounds and approximation ratios will be presented as proof of concept.

• In the case of additive cost functions we saw that the upper bound on sorting any given permutation is .

• Similarly, we would like to find such bounds for other functions in the class we are using (i.e. ).

• To do this, we will use the concept of sorting sequences of 0’s and 1’s.

• Case 1 – :

• Consider the divide and conquer sorting algorithm described in the previous slide. The recursion relation for sorting the 0’s and 1’s becomes:

• For permutations, the cost for the recursion sorting becomes:

• Obviously, these results are upper bounds.

• Case 2 – :

• Consider the divide and conquer sorting algorithm described in the previous slide. The recursion relation for sorting the 0’s and 1’s becomes:

• For permutations, the cost for the recursion sorting becomes:

• Obviously, these results are upper bounds.

• Case 3 – :

• This case has no use for reversals of more than two elements. As such, bubble sort is an asymptotically optimal solution.

• As a result of this, a tight bound (Upper and Lower) on the diameter is:

Lower Bounds on Diameter:Concept

• Proving the lower bounds on the diameters for different values of is much more complex than proving the upper bounds.

• We will see the proof of a lower bound for a linear cost function .

• Tighter than what we have already seen.

• Theorem 2.3: The cost to sort n elements by reversals with a linear cost function is , even when all elements are 0’s and 1’s.

• Thus, our bounds for sorting 0/1 sequences are tight (same Upper and Lower Bounds), but a multiplicative gap of exists for sorting permutations.

• We will approach the problem by exhibiting a difficult sorting instance.

• Specifically, we will prove a lower bound of on the cost of sorting the length-n sequence 010101…01 by reversals.

• The proof follows a potential function argument.

Definitions (6) Function (1)

• Before the sorting begins, we match the0 with the 1. Throughout the sorting algorithm we will keep this matching.

• Let be the current distance between the 0 and the 1 after the reversal.

• When there is no ambiguity, we abbreviate by .

• The potential function is:

Lemmas (8) Function (1)

• Lemma 2.1: The initial value of the potential function is 0, and the final value is .

• We will show how a reversal affects the value of in the potential function by considering the ith(0,1) pair.

• Observation 2.1: The distance can only change when one element of the pair is inside the reversal and the other is outside.

• Lemma 2.2: A reversal of length k increases the potential P(t) by at most 4k.

• Proof of these two lemmas results in theorem 2.3.

• Proof: Suppose that for a reversal of length k, one the elements of a (0,1) pair is inside the reversal and another is outside so that is affected by the reversal.

• At the most, the distance between the two elements of this pair can increase by k because each element is moved at most by a distance k.

Before reversal Function (2)

After reversal

Proof of Lower Bound on Diameter for the Linear Cost Function (3)

• Let us assume by symmetry that 0 is outside the reversed sequence and the 1 is inside. Suppose that the distance from the 0 to the closest element in the reversal is l.

• The increase of the potential caused by the change in for this pair is at most:

• The distance l must be a natural number and occurs at most twice in one reversal, once on the left side and once on the right side of the reversed sequence.

• According to observation 2.1, there are at most k such pair whose distance changes the value of the potential function.

• As a result, the increase in the value of the potential function increases by at most:

• Notice that grows as l gets smaller.

• By Sterling’s approximation,therefore and the potential thus increases by at most .

• Abstract:

• Sorting linear and circular permutations and 0/1 sequences by reversals in a length sensitive cost model.

• We consider both the signed and unsigned case.

What Lies Ahead Circularity.

• Lower and upper bounds on the various cases.

• Mentions of some approximations that guarantee the bounds shown

• Partial proofs some of the bounds and approximations.

• Cost functions are still of the class .

A Word (or Two) on Circularity Circularity.

• Circularity generally offers more opportunities to reduce the optimal cost to sort a given permutation by reversals.

• At the same time, it presents a greater challenge of finding a more efficient solution.

• A non unit cost model exacerbates these problems even further.

• Take as an example the permutation .

• One can sort it by using two reversals.

• In the circular case, where the two ends of the permutation meet, one can sort it by using one reversal.

• In the case of a unit cost model, the ratio of the costs is 2.

• However, in the case of a linear cost model, the ratio is .

Relationship of Costs for the Different Cases Circularity.

• The following relationships hold for the four different cases:

Lower and upper bounds for SBR of singed or unsigned and linear or circular 0/1 sequences and permutations.

Approximation ratios for SBR of signed linear as well as signed and unsigned circular 0/1 sequences and permutations.

Bounds and Approximation Ratios

Approximation Algorithms for Sorting 0/1 Sequences linear or circular 0/1 sequences and permutations.

• We will now introduce lower bounds for sorting linear signed as well as circular unsigned 0/1 sequences.

• We will see an approximation algorithm for linear signed 0/1 sequences.

• We will deal with the case of .

SBR of Circular Unsigned 0/1 Sequences – Definitions linear or circular 0/1 sequences and permutations.

• Given a circular sequence S, denote the length of the 0 and 1 blocks contained in S by and respectively.

• Let and .

• We define the potential function P(S) as follows:

SBR of Circular Unsigned 0/1 Sequences – Lemmas linear or circular 0/1 sequences and permutations.

• Lemma 1: A reversal of length r acting on a circular sequence S increases the value of the potential function P(S) by at most .

• Proof: Won’t be provided in this presentation.

• Lemma 2: The function is a lower bound for sorting an unsigned circular sequence S.

• Proof: By induction (next slide).

SBR of Circular Unsigned 0/1 Sequences – Proof linear or circular 0/1 sequences and permutations.

• Let m be the number of reversals in some optimal sorting solution. We want to prove that if a sorting solution uses exactly m reversals it costs at least V(S).

• Base case: m=0 trivial.

• Induction step: Suppose the claim holds for all . Consider a 0/1 sequence S of that has an optimal sorting series of reversals. Denote the first reversal and let r be its length. Can be sorted by k reversals and hence V(S’) is a lower bound for sorting S’. By lemma 1 we get and by the definition of V we know get . Therefore:

• as needed.

SBR of Linear Signed 0/1 Sequences – Definitions linear or circular 0/1 sequences and permutations.

• Consider a linear signed 0/1 sequence. Define a block in the sequence to be a contiguous segment of 0’s or 1’s of the same sign.

• Notice that there are four kinds of blocks in such a sequence.

• We represent the sequence as a series of . Let us denote as the potential function for such a linear sequence S.

SBR of Linear Signed 0/1 Sequences – Lemmas linear or circular 0/1 sequences and permutations.

• Lemma 3: The potential V(S) is a lower bound on the cost of sorting linear signed sequences.

• Proof: Won’t be provided in this presentation.

• Theorem 2: The algorithm signedImprovedDC is an O(1) approximation algorithm.

• Proof: Won’t be provided in this presentation.

• And the algorithm? In the next slide…

• Given a signed sequence S, let unsign(S) represent the sequence without the signs.

• signedImprovedDC(S)

• U  unsign(S)

• u  improvedDC(U)

• Mimic the reversals used to sort U on S. Denote the resulting sequence as S ’.

• Reverse elements of S ’ with a negative sign. Let s be the cost of this step.

• Output s + u

• improvedDC(S) is an O(1) approximation algorithm for unsigned sorting of linear 0/1 sequences when . (Not supplied and not proved in this presentation.)

Summary – What We’ve Seen (1) Algorithm

• The introduction of Length Weighted Models for Sorting By Reversals.

• Incentive:

• Unit cost isn’t biologically defensible.

• Experiments show that length weighted models may help substantially in biasing between two evolutionary paths.

• Lower and Upper Bounds on sorting with additive cost functions.

• Upper Bounds: For any given permutation.

• Lower Bounds: For a specific permutation p.

Summary – What We’ve Seen (2) Algorithm

• Improved Bounds on Cost of Length Weighted Sorting By Reversals:

• Dealing with a wider range of functions .

• Improved Upper Bounds on sorting unsigned 0/1 sequences and permutations for all values of .

• Improved Lower Bound for the case of .

• An improvement from something we’ve already seen.

Summary – What We’ve Seen (3) Algorithm

• Sorting By Reversals by Length Weighted Models – Dealing with Signs and Circularity:

• Still with the same family of functions.

• Lower Bounds for the cases circular unsigned and linear singed 0/1 sequences.

• Approximation algorithm for the sorting of linear signed 0/1 sequences.

• Many lemmas, theorems and corollaries 

Questions Raised Algorithm

• Aside from what hasn’t been covered in this presentation (which is, other than more bounds and approximation algorithms, another gargantuan set of lemmas, theorems and corollaries) there are many questions left open.

• What is the right cost function, or what are the right cost functions for various types of sequences?

• Is the family of functions presented in this presentation large enough to contain the right one(s)?

• Is the real cost function defined differently over different ranges? Should it be species specific?

• Should we include more data (other than length) for computing a reversals cost? e.g. The place of the reversal or the sequences being reversed.

And least but not last… Algorithm(as far as this presentation goes)

• Questions?

• Comments!?

Bibliography Algorithm

• Pinter, R.Y., and Skiena, S., "Sorting with length-weighted reversals", Proceedings of the 13th International Conference on Genome Informatics (GIW 2002), December 2002, pp. 103-111.

• M. A. Bender, D. Ge, S. He, H. Hu, R. Y. Pinter, S. Skiena, and F. Swidan. "Improved Bounds on Sorting with Length-Weighted Reversals (Extended Abstract).“Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 912-921, 2004.

• F. Swidan, M. A. Bender, D. Ge, S. He, H. Hu, and R. Pinter: "Sorting by length-weighted reversals: Dealing with signs and circularity"." Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), Vol. 3109, July 2004, pp. 32-46.