1 / 11

Fast Edit Distance and LCS Calculation Using Dominance Relationships

by Pascual Oliu. Fast Edit Distance and LCS Calculation Using Dominance Relationships. Introduction. Edit Distance and LCS Calculation are important problems Computational Biology diff Pattern Matching There are a few different algorithms for solving them Some are inefficient but simple

thyra
Download Presentation

Fast Edit Distance and LCS Calculation Using Dominance Relationships

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. by Pascual Oliu Fast Edit Distance and LCS Calculation Using Dominance Relationships

  2. Introduction • Edit Distance and LCS Calculation are important problems • Computational Biology • diff • Pattern Matching • There are a few different algorithms for solving them • Some are inefficient but simple • Some are efficient but complex • Dr. Papamichail is working on a new algorithm for this • Complex and efficient • Theoretically superior to previous algorithms • Want to study its practical usefulness • Use the Edit Distance solution to solve LCS problem • Implement all algorithms as efficiently as possible • Compare performance to other algorithms

  3. What is the problem? • Want to know how similar two strings are • Edit Distance helps us measure similarity • Minimum number of steps to change one to the other • The operations: • Insertion and Deletion ("Indels") • Substitution • Others possible, but unnecessary and uncommon • The smaller the distance, the more similar the strings • LCS is another measure of similarity • Length of the Longest Common Subsequence • Sequence of characters in the same order in both • Similar to edit distance, but without substitutions • The longer the LCS, the more similar the two strings 

  4. Older Algorithms • Needleman-Wunsch • The most basic algorithm • Uses Dynamic Programming • Cheapest route to each cell  • Solution is bottom-right cell • O(mn) time • O(mn) space, O(m+n) possible • Ukkonen's Algorithm •  Based on diagonals •  Expanding diagonal bands • Score-Based iterations • Eliminates many unnecessary cells • O(s * min(m, n)) time • O(min(s,m,n)) space • Added Complexity

  5. Papamichail Algorithm • Uses a linked list instead of a matrix • Uses new scoring system • Uses idea of dominance • One cell dominates previous cells • Removes dominated cells from list • Cell A dominates Cell B if: • No path through B is cheaper than a path through A • When A is in list, we don't have to consider B again • Score-based iteration • Add all cells that can be reached from cells in the list • Remove any newly dominated nodes from the list • Avoids many unnecessary computations •  O(min(m,n,s)2 + m + n) time •  O(m+n) space

  6. LCS Algorithm • Can modify these algorithms to caluclate length of LCS • |LCS| = (m+n - s') / 2  • s' is the edit distance with no substitutions allowed • Needleman-Wunsch and Ukkonen Algorithms • Simply disallow substitutions • Do not change much • Papamichail Algorithm •  Remember, this algorithm uses the following scoring: • 2 for indels in one direction, 0 for the other • 1 for substitutions • If we eliminate substitutions, s' must be a multiple of 2 •  We can increment by two each iteration • Only half the original iterations are necessary • Makes this much faster!

  7. Efficient Implementation of the Papamichail Algorithm • Double or Single Linked List? • Originally used single linked list • Modified it to use double linked list • Efficient allocation • Linked list constantly allocating and deallocating nodes • Have to wait for the system to manage memory • This is wasted time! • Instead, we can use a stack • Stack of pre-allocated nodes • Allocation and deallocation become pop() and push() • Two styles • Allocate worst-case number of nodes, or • Allocate a new node every time stack is empty • Both reduce time spent waiting for the system

  8. Results Time performance with strings of length 15000 and 10000

  9. Results Time Performance Calculating LCS With Two Strings, both of Length 20000

  10. Conclusion • Efficiently implemented, the Papamichail algorithm • Excels at comparing strings of different lengths • Excels at calculating the length of the LCS between two strings • Still trails Ukkonen's algorithm for calculating edit distance between strings of identical length • Further research might explore • Using the LCS Calculation to speed Edit Distance Calculation • Other LCS-specific algorithms, to compare performance

  11. Fast Edit Distance and LCS Calculation Using Dominance Relationships by Pascual Oliu Any Questions?

More Related