1 / 34

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Efficient Computation of Substring Equivalence Classes with Suffix Arrays. Kazuyuki Narisawa , Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan. Contents. Introduction Problem definition Suffix tree based algorithm Simulation by suffix array

Download Presentation

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Computation of Substring EquivalenceClasses with Suffix Arrays Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan

  2. Contents • Introduction • Problem definition • Suffix treebased algorithm • Simulation by suffix array • Computational experiment • Application • Summary

  3. Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays • Linear time and space • is faster and requires less memory than suffix tree and CDAWG based methods.

  4. Equivalence relation and classes Given a string w, the maximal extension of a substring x is        ・Every time xoccurs in w, x = αxβ it is preceded by αand followed by β.        ・Stringsαandβare longest possible. equivalence relation x y  x = y equivalence class [x] = { y | yx } Substrings with essentially identical occurrence in w example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet = –better–b bet  [–better–b]

  5. Problem • Input : string w of length n • Output: the equivalence classes on w • Difficulty • The total number of elements in the equivalence classes (shortly ECs) is O(n2). • Solution • The number of the ECs is O(n). • Each EC can be succinctly represented in O(1) space.

  6. Succinct representation of the ECs • representative ・・・ the longest element(maximal extension) • minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ (  ∪ Superstring( y ) ) yisminimal the elements of [x] can be enumerated with the representative and minimal strings example representative Betty-bought-a-bit-of-better- butter-and-made-a-better- batter-after-breakfast. minimal strings

  7. Problem • Input : string w of length n • Output: succinct representations of the equivalence classes on w • additionally, we will output • size ( the number of elements in each EC) • frequency ( the number of occurrences of the elements in each EC ) of each EC

  8. Possible solutions • Suffix Tree[Weiner 1973] • Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] • ECs can be computed with either of the data structures in linear time and space.

  9. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b c 1 b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix tree (with suffix link) ababbbabbc$ Ignore leaves here because they form a trivial EC.

  10. Equivalence classes on suffix tree ababbbabbc$ $ a b c b b $ equivalence relation 11 two nodes are connected by suffix link and subtrees have the same number of leaves 10 a c $ b b b b a a EC b b b EC a 9 b b b abb babb bab ba b babb bab ba b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c EC def. 5 4 8 $ Essentially same occurrence substrings 2 6

  11. Suffix tree algorithm • foreach node v in suffix tree { • if(node v is representative of EC [v]≡ ) { • follow suffix link; • while(node is inEC[v]≡) { • follow suffix link; • compute size and minimal strings; • } • } • output succinct representation ofEC [v]≡; • }

  12. Algorithm with suffix tree $ a b c b $ representative ? representative ? output 11 number of incoming suffix links = 1 or not ? number of incoming suffix links = 1 or not ? representative, minimal strings, size, frequency 10 a c $ b b b b a Suffix tree requires large memory space. b b a 9 b b b b c a c a b c $ b in other EC b a $ $ b representative in sameEC ? b b not representative c b 1 c back to representative b a $ same leaves number ? follow suffix link $ c c continue suffix tree traversal b $ b $ 3 7 c 5 4 8 $ 2 6

  13. Suffix array [Manber and Myers 1993] • Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] • Can simulate traversal on suffix links using additional data structure: suffix link table[Abouelhoda et al. 2004] Our algorithm simulate traversal on suffix links without suffix link table

  14. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix array lexicographically sort suffixes ababbbabbc$ Suffix Array

  15. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Lcparray ababbbabbc$ lcp[i]:the length of the longest common prefix of i th and (i –1) th suffixes Lcp Array Suffix Array

  16. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Rankarray ababbbabbc$ rank[SA[i]] = i Rank Array Suffix Array

  17. Suffix arrayhas less information Information available during traversal for each data structure, when visiting node v Suffix Tree Suffix Array 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node length of label from root to v length of label from root to the parent of v left most leaf ID in subtree rooted at v right most leaf ID in subtree rooted at v

  18. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b 1 c b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Suffix arrayhas less information length of parent label from root:1 11 10 9 label length from root:4 1 2 3 6 7 8 5 4

  19. Suffix array algorithm • foreach v in suffix tree (simulated by suffix array){ • if(node v is representative of EC [v]≡) { • follow suffix link; • while(node is in EC [v]≡) { • follow suffix link; • compute size and minimal strings; • } • } • output succinct representation of EC [v]≡; • } difficulty 1 difficulty 2 difficulty 3 These are difficult because suffix array has less information.

  20. Solving difficulty 1 (representativejudge) v l – 1 r – 1 l r Suffix Array index L’= rank(l –1) R’= rank(r –1) L = rank(l) R = rank(r) Lemma: x=x 1.2. or 3. • R – L≠ R’ – L’, (different num. leaves ? ) • w[l – 1] ≠ w[r – 1], or (different first char ? ) • l – 1 = 0 orr – 1 = 0 (first char in string ? )

  21. Solving difficulty 2 (equivalence relation judge) x:label from root ax:label from root v l r l+1 r+1 Suffix Array index L = rank(l) R = rank(r) L’ = rank(l+1) R’ = rank(r+1) Lemma:ax x 1.2. and 3. • R – L =R’ – L’, (same number of leaves ?) • lcp(L’) < |ax| – 1, and (left most ?) • lcp(R’ + 1) < |ax| – 1 (right most ?)

  22. Solving difficulty 3 (size computation) case 1 case 2 case 3 size = sum of this l r r’ l r r’ l r Suffix Array index L R R +1 L R R +1 L R Lemma lcp(R + 1) lcp(R + 1) lcp(L) label length of parent size = { lcp(R) – max{ lcp( L ), lcp( R +1) }}

  23. Computational experiment • Comparison of algorithms • suffix tree • CDAWG • suffix array • Data • two English and two Genome corpora • Canterbury corpus, Protein corpus • Machine spec. • Red Hat Linux • CPU 2.8GHz, 1 GB memory

  24. Experimental result

  25. Application : spam detection the size of the equivalence classes formed by spams are larger than that of non spams. the number of the equivalence class SPAM • This is Japanese • “Sushi” using spam, • but this spam does not relate to this study. Many copies of the same message are sent. the size of the equivalence class

  26. Application : spam detection “Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda Accepted The Tenth International Conference Discovery Science Sendai, Japan, 1-4 October, 2007 (DS ‘07) if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”.

  27. Summary • Presented an algorithm for computing the equivalence classes with suffix array • simulating traversal on suffix tree + suffix links • using only lcp and rank arrays • running in linear time and space • Compared with other data structures • less memory • faster computation • Can be applied to spam detection[ DS ’07 ]

  28. Thank You

  29. $ a b c b $ 11 10 a c $ b b b b a b b a 9 b b b b c a c a b c $ b b a $ $ b b b c b c 1 b a $ $ c c b $ b $ 3 7 c 5 4 8 $ 2 6 Compute size of the EC sum of the length of label from parent to each node 1 + 3 = 4

  30. Compute minimal strings of the EC case 2 if two label length relation is k > m, the label “zz1” is one of the minimal strings z x y z1 case 1 y1 z2 y1 x1 if the node is the representative, the label “xx1” is one of the minimal strings x2 zm yk xk

  31. suffix tree • each node has: • parent • leftmost child • right sibling • suffix link • label of the incoming edge

More Related