1 / 94

Suffix Tree and Suffix Array

Suffix Tree and Suffix Array. Outline. Motivation Exact Matching Problem Suffix Tree Building issues Suffix Array Build Search Longest common prefixes Extra topics discussion Suffix Tree VS. Suffix Array. Motivation. Text search Need fast searching algorithm(with low space cost)

dwallace
Download Presentation

Suffix Tree and Suffix Array

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Tree and Suffix Array

  2. Outline • Motivation • Exact Matching Problem • Suffix Tree • Building issues • Suffix Array • Build • Search • Longest common prefixes • Extra topics discussion • Suffix Tree VS. Suffix Array

  3. Motivation • Text search • Need fast searching algorithm(with low space cost) • DNA sequences and protein sequences are too large to search by traditional algorithms • Some improved algorithms perform efficiently • KMP, BM algorithms for string matching • Suffix Tree with linear construction and searching time • Suffix Array with Suffix Tree based construction

  4. Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

  5. Exact Matching Problem • Find ‘ssi’ in ‘mississippi’

  6. Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ si s

  7. Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ ssippi si s Every leaf below this point in the tree marks the starting location of ‘ssi’ in ‘mississippi’. (ie. ‘ssissippi’ and ‘ssippi’)

  8. Exact Matching Problem • Find ‘sissy’ in ‘mississippi’

  9. Exact Matching Problem • Find ‘sissy’ in ‘mississippi’

  10. Exact Matching Problem • Find ‘sissy’ in ‘mississippi’ s i ss

  11. Exact Matching Problem • Find ‘sissy’ in ‘mississippi’ s i ss

  12. Exact Matching Problem • So what? Knuth-Morris-Pratt and Boyer-Moore both achieve this worst case bound. • O(m+n) when the text and pattern are presented together. • Suffix trees are much faster when the text is fixed and known first while the patterns vary. • O(m) for single time processing the text, then only O(n) for each new pattern. • Aho-Corasick is faster for searching a number of patterns at one time against a single text.

  13. Boyer-Moore Algorithm • For string matching(exact matching problem) • Time complexity O(m+n) for worst case and O(n/m) for absense • Method: backward matching with 2 jumping arrays(bad character table and good suffix table)

  14. What are suffix arrays and trees? • Text indexing data structures • not word based • allow search for patterns or • computation of statistics • Important Properties • Size • Speed of exact matching • Space required for construction • Time required for construction

  15. Suffix Tree

  16. Properties of a Suffix Tree • Each tree edge is labeled by a substring of S. • Each internal node has at least 2 children. • Each S(i) has its corresponding labeled path from root to a leaf, for 1in . • There are n leaves. • No edges branching out from the same internal node can start with the same character.

  17. Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree

  18. Building the Suffix Tree • papua

  19. Building the Suffix Tree • papua papua

  20. Building the Suffix Tree • papua apua papua

  21. Building the Suffix Tree • papua apua p apua ua

  22. Building the Suffix Tree • papua apua p apua ua ua

  23. Building the Suffix Tree • papua pua a p apua ua ua

  24. Building the Suffix Tree • papua pua a p apua ua ua

  25. Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree Naïve method - O(m2) (m = text size)

  26. Building the Suffix Tree in O(m) Time • In the previous example, we assumed that the tree can be built in O(m) time. • Weiner showed original O(m) algorithm (Knuth is claimed to have called it “the algorithm of 1973”) • More space efficient algorithm by McCreight in 1976 • Simpler ‘on-line’ algorithm by Ukkonen in 1995

  27. Ukkonen’s Algorithm • Build suffix tree T for string S[1..m] • Build the tree in m phases, one for each character. At the end of phase i, we will have tree Ti, which is the tree representing the prefix S[1..i]. • In each phase i, we have i extensions, one for each character in the current prefix. At the end of extension j, we will have ensured that S[j..i] is in the tree Ti. NTHU Make Lab http://make.cs.nthu.edu.tw

  28. Ukkonen’s Algorithm • 3 possible ways to extend S[j..i] with character i+1. • S[j..i] ends at a leaf. Add the character i+1 to the end of the leaf edge. • There is a path through S[j..i], but no match for the i+1 character. Split the edge and create a new node if necessary, then add a new leaf with character i+1. • There is already a path through S[j..i+1]. Do nothing.

  29. Ukkonen’s Algorithm - mississippi

  30. Ukkonen’s Algorithm - mississippi

  31. Ukkonen’s Algorithm - mississippi

  32. Ukkonen’s Algorithm - mississippi

  33. Ukkonen’s Algorithm - mississippi

  34. Ukkonen’s Algorithm - mississippi

  35. Ukkonen’s Algorithm - mississippi

  36. Ukkonen’s Algorithm - mississippi

  37. Ukkonen’s Algorithm - mississippi

  38. Ukkonen’s Algorithm - mississippi

  39. Ukkonen’s Algorithm - mississippi

  40. Ukkonen’s Algorithm - mississippi

  41. Ukkonen’s Algorithm • In the form just presented, this is an O(m3) time, O(m2) space algorithm. • We need a few implementation speed-ups to achieve the O(m) time and O(m) space bounds.

  42. Suffix Array

  43. The Suffix Array Definition:Given a string D thesuffix arraySA for this string is the sorted list of pointers to all suffixes of D. (Manber, Myers 1990)

  44. 中山大學資工系--楊昌彪教授 The Suffix Array http://par.cse.nsysu.edu.tw/~cbyang/ • In a suffix array, all suffixes of S are in the non-decreasing lexical order. • For example, S=“ATCACATCATCA”

  45. fin

  46. How do we build it ? • Build a suffix tree • Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. • O(n) time • Suffix tree construction loses some of the advantage that the suffix array has over the suffix tree

  47. Direct suffix array construction algorithm • Unfortunately, it is difficult to solve this problem with the suffix array Pos alone because Pos has lost the information on tree topology. In direct algorithm, the array Height (saving lcp information) has the information on the tree topology which is lost in the suffix array P “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”

  48. Skew-algorithm • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0= sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0and SA≠ 0. 0 12 3 45 6 78 9 10 s = m i s s i s s i p p i

  49. Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 11 12 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let S12 = [ 3 3 2 1 5 5 4 ] => SA≠0= [ 10 7 4 1 8 5 2 ] in T(2n/3)

  50. 1 4 7 102 5 8 s12= [ 3 3 2 1 5 5 4 ] s121= 3 3 2 1 5 5 4 s124 = 3 2 1 5 5 4 s127 = 2 1 5 5 4 s1210 = 1 5 5 4 s122 = 5 5 4 S125= 5 4 s128 = 4 s = m i s s i s s i p p i s1 = i s s i s s i p p i s4 = i s s i p p i s7= i p p i s10 = i s2= s s i s s i p p i s5= s s i p p i s8= p p i SA≠ 0 = [ 107418 5 2 ], It suffices to show that S12i < S12j <=> si < sj.

More Related