110 likes | 265 Views
by Pascual Oliu. Fast Edit Distance and LCS Calculation Using Dominance Relationships. Introduction. Edit Distance and LCS Calculation are important problems Computational Biology diff Pattern Matching There are a few different algorithms for solving them Some are inefficient but simple
E N D
by Pascual Oliu Fast Edit Distance and LCS Calculation Using Dominance Relationships
Introduction • Edit Distance and LCS Calculation are important problems • Computational Biology • diff • Pattern Matching • There are a few different algorithms for solving them • Some are inefficient but simple • Some are efficient but complex • Dr. Papamichail is working on a new algorithm for this • Complex and efficient • Theoretically superior to previous algorithms • Want to study its practical usefulness • Use the Edit Distance solution to solve LCS problem • Implement all algorithms as efficiently as possible • Compare performance to other algorithms
What is the problem? • Want to know how similar two strings are • Edit Distance helps us measure similarity • Minimum number of steps to change one to the other • The operations: • Insertion and Deletion ("Indels") • Substitution • Others possible, but unnecessary and uncommon • The smaller the distance, the more similar the strings • LCS is another measure of similarity • Length of the Longest Common Subsequence • Sequence of characters in the same order in both • Similar to edit distance, but without substitutions • The longer the LCS, the more similar the two strings
Older Algorithms • Needleman-Wunsch • The most basic algorithm • Uses Dynamic Programming • Cheapest route to each cell • Solution is bottom-right cell • O(mn) time • O(mn) space, O(m+n) possible • Ukkonen's Algorithm • Based on diagonals • Expanding diagonal bands • Score-Based iterations • Eliminates many unnecessary cells • O(s * min(m, n)) time • O(min(s,m,n)) space • Added Complexity
Papamichail Algorithm • Uses a linked list instead of a matrix • Uses new scoring system • Uses idea of dominance • One cell dominates previous cells • Removes dominated cells from list • Cell A dominates Cell B if: • No path through B is cheaper than a path through A • When A is in list, we don't have to consider B again • Score-based iteration • Add all cells that can be reached from cells in the list • Remove any newly dominated nodes from the list • Avoids many unnecessary computations • O(min(m,n,s)2 + m + n) time • O(m+n) space
LCS Algorithm • Can modify these algorithms to caluclate length of LCS • |LCS| = (m+n - s') / 2 • s' is the edit distance with no substitutions allowed • Needleman-Wunsch and Ukkonen Algorithms • Simply disallow substitutions • Do not change much • Papamichail Algorithm • Remember, this algorithm uses the following scoring: • 2 for indels in one direction, 0 for the other • 1 for substitutions • If we eliminate substitutions, s' must be a multiple of 2 • We can increment by two each iteration • Only half the original iterations are necessary • Makes this much faster!
Efficient Implementation of the Papamichail Algorithm • Double or Single Linked List? • Originally used single linked list • Modified it to use double linked list • Efficient allocation • Linked list constantly allocating and deallocating nodes • Have to wait for the system to manage memory • This is wasted time! • Instead, we can use a stack • Stack of pre-allocated nodes • Allocation and deallocation become pop() and push() • Two styles • Allocate worst-case number of nodes, or • Allocate a new node every time stack is empty • Both reduce time spent waiting for the system
Results Time performance with strings of length 15000 and 10000
Results Time Performance Calculating LCS With Two Strings, both of Length 20000
Conclusion • Efficiently implemented, the Papamichail algorithm • Excels at comparing strings of different lengths • Excels at calculating the length of the LCS between two strings • Still trails Ukkonen's algorithm for calculating edit distance between strings of identical length • Further research might explore • Using the LCS Calculation to speed Edit Distance Calculation • Other LCS-specific algorithms, to compare performance
Fast Edit Distance and LCS Calculation Using Dominance Relationships by Pascual Oliu Any Questions?