Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Efficient Computation of Substring EquivalenceClasses with Suffix Arrays Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan

Contents • Introduction • Problem definition • Suffix treebased algorithm • Simulation by suffix array • Computational experiment • Application • Summary

Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays • Linear time and space • is faster and requires less memory than suffix tree and CDAWG based methods.

Equivalence relation and classes Given a string w, the maximal extension of a substring x is 　　　　　　・Every time xoccurs in w, x = αxβ it is preceded by αand followed by β. 　　　　　　・Stringsαandβare longest possible. equivalence relation x y  x = y equivalence class [x] = { y | yx } Substrings with essentially identical occurrence in w example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet = –better–b bet  [–better–b]

Problem • Input : string w of length n • Output: the equivalence classes on w • Difficulty • The total number of elements in the equivalence classes (shortly ECs) is O(n2). • Solution • The number of the ECs is O(n). • Each EC can be succinctly represented in O(1) space.

Succinct representation of the ECs • representative ・・・　the longest element(maximal extension) • minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ (　 ∪ Superstring( y ) ) yisminimal the elements of [x] can be enumerated with the representative and minimal strings example representative Betty-bought-a-bit-of-better- butter-and-made-a-better- batter-after-breakfast. minimal strings

Problem • Input : string w of length n • Output: succinct representations of the equivalence classes on w • additionally, we will output • size ( the number of elements in each EC) • frequency ( the number of occurrences of the elements in each EC ) of each EC

Possible solutions • Suffix Tree[Weiner 1973] • Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] • ECs can be computed with either of the data structures in linear time and space.

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b c 1 b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix tree (with suffix link) ababbbabbc$ Ignore leaves here because they form a trivial EC.

Equivalence classes on suffix tree ababbbabbc$ $ a b c b b $ equivalence relation 11 two nodes are connected by suffix link and subtrees have the same number of leaves 10 a c $ b b b b a a EC b b b EC a 9 b b b abb babb bab ba b babb bab ba b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c EC def. 5 4 8 $ Essentially same occurrence substrings 2 6

Suffix tree algorithm • foreach node v in suffix tree { • if(node v is representative of EC [v]≡ ) { • follow suffix link; • while(node is inEC[v]≡) { • follow suffix link; • compute size and minimal strings; • } • } • output succinct representation ofEC [v]≡; • }

Algorithm with suffix tree $ a b c b $ representative ? representative ? output 11 number of incoming suffix links = 1 or not ? number of incoming suffix links = 1 or not ? representative, minimal strings, size, frequency 10 a c $ b b b b a Suffix tree requires large memory space. b b a 9 b b b b c a c a b c $ b in other EC b a $ $ b representative in sameEC ? b b not representative c b 1 c back to representative b a $ same leaves number ? follow suffix link $ c c continue suffix tree traversal b $ b $ 3 7 c 5 4 8 $ 2 6

Suffix array [Manber and Myers 1993] • Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] • Can simulate traversal on suffix links using additional data structure: suffix link table[Abouelhoda et al. 2004] Our algorithm simulate traversal on suffix links without suffix link table

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix array lexicographically sort suffixes ababbbabbc$ Suffix Array

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Lcparray ababbbabbc$ lcp[i]：the length of the longest common prefix of i th and (i –1) th suffixes Lcp Array Suffix Array

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Rankarray ababbbabbc$ rank[SA[i]] = i Rank Array Suffix Array

Suffix arrayhas less information Information available during traversal for each data structure, when visiting node v Suffix Tree Suffix Array 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node length of label from root to v length of label from root to the parent of v left most leaf ID in subtree rooted at v right most leaf ID in subtree rooted at v

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix arrayhas less information length of parent label from root：1 11 10 9 label length from root：4 1 2 3 6 7 8 5 4

Suffix array algorithm • foreach v in suffix tree (simulated by suffix array){ • if(node v is representative of EC [v]≡) { • follow suffix link; • while(node is in EC [v]≡) { • follow suffix link; • compute size and minimal strings; • } • } • output succinct representation of EC [v]≡; • } difficulty 1 difficulty 2 difficulty 3 These are difficult because suffix array has less information.

Solving difficulty 1 (representativejudge) v l – 1 r – 1 l r Suffix Array index L’= rank(l –1) R’= rank(r –1) L = rank(l) R = rank(r) Lemma: x=x 1.2. or 3. • R – L≠ R’ – L’, (different num. leaves ? ) • w[l – 1] ≠ w[r – 1], or (different first char ? ) • l – 1 = 0 orr – 1 = 0 (first char in string ? )

Solving difficulty 2 (equivalence relation judge) x:label from root ax:label from root v l r l+1 r+1 Suffix Array index L = rank(l) R = rank(r) L’ = rank(l+1) R’ = rank(r+1) Lemma:ax x 1.2. and 3. • R – L =R’ – L’, (same number of leaves ?) • lcp(L’) < |ax| – 1, and (left most ?) • lcp(R’ + 1) < |ax| – 1 (right most ?)

Solving difficulty 3 (size computation) case 1 case 2 case 3 size = sum of this l r r’ l r r’ l r Suffix Array index L R R +1 L R R +1 L R Lemma lcp(R + 1) lcp(R + 1) lcp(L) label length of parent size = { lcp(R) – max{ lcp( L ), lcp( R +1) }}

Computational experiment • Comparison of algorithms • suffix tree • CDAWG • suffix array • Data • two English and two Genome corpora • Canterbury corpus, Protein corpus • Machine spec. • Red Hat Linux • CPU 2.8GHz, 1 GB memory

Experimental result

Application : spam detection the size of the equivalence classes formed by spams are larger than that of non spams. the number of the equivalence class SPAM • This is Japanese • “Sushi” using spam, • but this spam does not relate to this study. Many copies of the same message are sent. the size of the equivalence class

Application : spam detection “Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda Accepted The Tenth International Conference Discovery Science Sendai, Japan, 1-4 October, 2007 (DS ‘07) if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”.

Summary • Presented an algorithm for computing the equivalence classes with suffix array • simulating traversal on suffix tree + suffix links • using only lcp and rank arrays • running in linear time and space • Compared with other data structures • less memory • faster computation • Can be applied to spam detection[ DS ’07 ]

Thank You

$ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b c 1 b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Compute size of the EC sum of the length of label from parent to each node 1 + 3 = 4

Compute minimal strings of the EC case 2 if two label length relation is k > m, the label “zz1” is one of the minimal strings z x y z1 case 1 y1 z2 y1 x1 if the node is the representative, the label “xx1” is one of the minimal strings x2 zm yk xk

suffix tree • each node has: • parent • leftmost child • right sibling • suffix link • label of the incoming edge

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Presentation Transcript

Suffix trees and suffix arrays

Compressed Compact Suffix Arrays

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Equivalence Classes

Fine Tuning the Enhanced Suffix Arrays

Suffix Trees and Suffix Arrays

Counting Suffix Arrays and Strings

Suffix arrays

Compressed Suffix Arrays and Suffix Trees

Efficient computation of photohadronic interactions

Genomic Repeat Visualisation Using Suffix Arrays

Suffix Trees and Suffix Arrays

Efficient Computation of Temporal Aggregates with Range Predicates

Linear-Time Search in Suffix Arrays

Suffix Arrays

Compressed Suffix Arrays

Chapter 10. Arrays Array Basics Arrays in Classes and Methods Programming with Arrays and Classes

The Limits of Efficient Computation