Tuning Algorithms for Jumbeled Matching
E N D
Presentation Transcript
Tuning Algorithms for Jumbeled Matching Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio
Jumbled matching • Interesting variation of string matching. • To find substrings of T which are permutations of P. • For example: P=abcb in T=aababcaabc.
Jumbled matching • Parikh Vector- The pattern can be described as parikh vector. • Vector of multiplicities of the characters. • p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}.
Approximate Permutaion Matching • The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds • set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u.
Motivation • Alignment of strings • SNP discovery • Discovery of repeated patterns • Interpretation of mass spectrometry data
Previous Algorithms • Key Idea- scan the text forward while maintaining counts of characters. • Work in linear time. • These algorithms were developed as filtration methods for online approximate string matching.
Previous Algorithms • Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters. • Navarro’s counting algorithm - sliding window approach.
Previous Algorithms • Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters. • It grows with the acceptable characters. • Navarro presented a Mcount for multiple patterns (Proc. WSP 1997) .
Previous Algorithms • Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher). • Associate a counter(bin) to each distinct character in P. • A single 1-bit counter for the remaining characters of the alphabet.
Previous Algorithms • At the start of processing a window, every overflow bit is zero. • 1-bit counter reserved for all the characters not occurring in p is initially null. • And it gets set as soon as any character not in p is encountered in the text window. • It becomes clear that the text window cannot be a permutation of the pattern P.
Bit Parallel simulation P = abbccc cbaother characters
Initialization for state vector P = abbccc c b a All other characters
New solutions • Solutions for both exact and approximate jumbled matching. • We present two algorithms that are modifications of BAM. • ABAM (approximate BAM). • BAM2 (enhanced BAM with 2-grams).
Key Idea: Counters • We used bit fields to store counters. • For each character that appears in the pattern. • One for all other characters. • Highest bit is an overflow indicator. • Space to represent number of times the character appears in the pattern + maximum error count k.
State Vector D • Counters are stored in state vector D. • If they do not fit in one word • We can put several different characters in one field. • But then we must verify matches. • Initial vales of D are fetched from precomputed word. • Processing of each character is made by using array M[tj]which has the one in the field for tj. • Value of D is updated by DD + M[tj].
Initialization for state vector D and M[ ] for pattern P = abbccc All other characters x c b a I M[a] M[b] M[c] M[x]
Variations of BAM • BAMs • Some bins are shared if necessary. • If bins are shared, each match candidate needs to be verified. • BAM2 • Handles 2 text characters (2-gram) at a time. • Separate loop for patterns of even and odd length. • Reads four characters before testing D first time. • Hence the minimum width of a field is four bits instead of two.
ABAM • ABAM : Approximate BAM. • C is the error counter. • F[tj] is mask for testing overflow bits.
EBL (Exact Backward for Large alphabets) • EBL is based on SBNDM2. • Instead of representing occurrence vectors. • Array B states of a character is present in the pattern. • When the alignment window contains only acceptable characters, the window is a match candidate. • Acceptable: characters that appear in the pattern. • Update step is simply D = D & B[ti+j-1].
EFS (Exact forward for small alphabets)AFL (Approximate Backward for small alphabets) • EFS: Update step is DD + M[ti] – M[ti-m]. • AFL is modification of Mcount tuned for single pattern. • Different initial value of the counter.
ABS (Approximate Backward for Small Alphabets) • Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it. • Shift is utilizing array o[ ] which contains the positions of overflow bits.
Experimental Results • English data • BAM2a works more than two times faster than the previous algorithms. • DNA data • EFS works in a double speed an compared to previous algorithms. • Protein data • BAM2a is fastest and takes less than half time compared to previos agorithms.
Concluding remarks • We introduced new variations jumbled matching algorithms. • All the forward algorithms are clearly linear. • The speed of AFL do not depend on the value of k. • Technique of shared bins showed to be useful for jumbled matching.