1 / 57

Suffix Arrays

Suffix Arrays. A New Method for Online String Searches U.Manber and G.Myers. Introduction - String matching. Let A = a 0 a 1 ... a N- 1 be a large text of length N Let W = w 0 w 1 ... w p- 1 be a word of length P Is W a substring of A?. Introduction - Suffix Trees. Build time

denim
Download Presentation

Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers

  2. Introduction - String matching • Let A = a0a1...aN-1 be a large text of length N • Let W = w0w1...wp-1 be a word of length P • Is W a substring of A?

  3. Introduction - Suffix Trees • Build time • O(N) • Search time • O(P) • Structure space • O(N) • Big constant • Dependent of |Σ|

  4. Suffix Arrays • An array of all the suffixes of A • Sorted by lexicographical order A = aababa

  5. Suffix Arrays • Ai = aiai+1...aN-1 • The suffix of A that starts at position i. • Position array (Pos) • Pos[k] is the start position of kth smallest suffix • APos[k] is the suffix pointed from Pos[k] • APos[k] is the kth smallest suffix Pos A = aababa

  6. Searching • “Is W a substring of A?” • W is a substring of A Some suffix Ai starts with W • i is W’s location • All the instances of W must match consecutive suffixes in the array • Find the array interval that contains those suffixes

  7. Searching - Definitions • For a string u • up = u0u1...up-1 • For strings u,v • u ≤p v up ≤ vp • Same for ≠, =, >… • For any p, Pos is ordered according to ≤p

  8. W =p APos[k] LW RW W >p APos[k] W < p APos[k] Searching - Definitions • W = w0w1…wP-1 • LW= min (k : W ≤pAPos[k]or k = N) • First suffix ≥p from W • RW= max (k : APos[k]≤p W or k = 1) • Last suffix ≤p from W

  9. Search Algorithm • k [LW, RW] W =p APos[k] • To find W’s instances - find [LW, RW] • Number of W’s occurrences is (RW-LW+1) • Matches are APos[LW],…, APos[RW] • Suffix array is sorted - use binary search

  10. Binary Search • Search interval [L,R] • Midpoint M • Compare W to APos[M] • Decide where to search next • W ≤pAPos[M] - search in left half (R = M) • W >pAPos[M] - search in right half (L = M) • O(PlogN) W = abc L M R

  11. Search Algorithm • Observation: • We can use information from one comparison to speedup the next comparisons • Use additional information • lcp = longest common prefix

  12. Search Algorithm - lcp • lcp(v,w) = the length of the longest common prefix of v and w • Obtained by comparing v and w and stopping at the first unequal symbol • Use precomputed lcp information to reduce the number of comparisons to O(P + logN)

  13. Search Algorithm • Consider all possible midpoints • M = 1…N-2 • Every midpoint corresponds to a triplet [LM,M,RM] • Suppose we precomputed two arrays: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM])

  14. l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm • Maintain two more variables • l = lcp(APos[L], W) • r = lcp(W, APos[R]) W = abcd

  15. Go Right!l remains unchanged W = abcd l = 2 r = 1 LM M RM Llcp[M] = 3 Rlcp[M] = 1 Search Algorithm • Assume l≥r • Compare l with Llcp • If l < Llcp[M] • W >l+1 APos[LM] • APos[LM] =l+1 APos[M] • W >l+1 APos[M]

  16. Go Left!r = Llcp[M] l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm If l > Llcp[M] • APos[LM] <l APos[M] • W =l APos[LM] • W <l APos[M] W = abcd

  17. l = 2 r = 1 LM M RM Llcp[M] = 2 Rlcp[M] = 1 Search Algorithm If l = Llcp[M] • W can be in either half • Start comparing A and APos[M] from the (l+1) symbol • First unequal symbol determines whether to go right or left • r/l will be updated to l+j • j+1 comparisons W = abcd

  18. Search Algorithm - Complexity • In each Iteration: • Let h=max(l,r) • We start comparing from the hth symbol to the h+j+1 • j+1 symbol comparisons • Next time we will start from the h+j symbol • j symbols out of the j+1 will not be compared again

  19. Search Algorithm - Complexity • Every symbol in W will be successfully matched at most once • O(P) successful comparisons • At most one symbol will be unsuccessfully matched in each iteration • O(logN) unsuccessful comaprsions • Total: O(P + logN) comparisons

  20. Build Suffix Array So far… • A O(P + logN) search algorithm • Given a sorted suffix array • Given lcp information (Llcp, Rlcp) Next… • Sort the suffix array in O(NlogN) • Compute the lcp’s while sorting the array

  21. Sort Algorithm • First stage • Sort the suffixes into buckets, according to first symbol • Inductive stage • Assume array is bucket sorted according to first H symbols • Every H-bucket holds suffixes with the same H first symbols • Buckets are ordered according to the ≤H relation • Sort according to 2H first symbols

  22. Sort Algorithm – Intuition • Let Ai, Aj be two suffixes in the same H-bucket • Ai =H Aj • Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H • In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H A = aababaa H = 2 Aj+H Ai Aj Ai+H

  23. Sort Algorithm – Main Idea • Let Ai be a suffix in the first H-bucket • Ai starts with the smallest H-symbol string • Ai-H should be the first in its 2H-bucket A = aababa H = 1

  24. Sort Algorithm • In stage H • Go over all the suffixes in the ≤H order • For each Ai move Ai-H to the next available place in its H-bucket • The suffixes are now sorted according to the ≤2H order • Go on to stage 2H to produce ≤4H order

  25. assin assassin in n ssassin sin ssin sassin H = 1 Sort Algorithm - Example A = assassin assassin assin in sassin sin ssin ssassin H = 2

  26. A = assassin 0 1 2 3 4 5 6 7 assassin assin in n sassin sin ssin ssassin H = 2 assassin assin in n sassin sin ssassin ssin H = 4 Sort Algorithm - Example

  27. Sort Algorithm - Complexity • First Stage • Bucket sort according to first symbol • O(NlogN) • Inductive Stages • O(logN) stages • O(N) per stage • Total O(NlogN) • Space • Can be implemented using two N-sized integer arrays

  28. Finding Longest Common Prefixes • The search algorithm uses lcp information: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM]) • We want to compute this information while we are sorting the array

  29. Finding Longest Common Prefixes • Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm • Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array • Show how to compute lcps for all the necessary suffixes

  30. Finding LCP for adjacent buckets • After the first sort stage, lcp’s of suffixes in adjacent buckets is 0 • Assume after stage H we know the lcps between suffixes in adjacent H-buckets • Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket • H ≤ lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H

  31. k [i+1,j] H = 1 i j 2 1 0 Finding LCP for adjacent buckets • Let i,j be Ap+H, Aq+H’spositions in the suffix array • Assume i<j • Array is ordered according to the <H order • lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))

  32. LCP Data Structures – Hgt[] • We need a data structure that will allow us: • get the lcp’s of consecutive suffixes • get their minimum • Hgt[] – an N-1 sized array • Hgt[i] = lcp(APos[i-1], APos[i])

  33. k [a+1,b] LCP Data Structures – Hgt[] • Hgt will be computed inductively throughout the sort • Initialized to N+1 • Hgt[i] is updated in stage 2H APos[i] started a new 2H-bucket • To update Hgt[i]: • Let a,b be the array positions of APos[i-1]+H and APos[i] +H • Assume a≤b • Hgt[i] = H + min(Hgt[k])

  34. 9 0 0 0 9 9 9 9 0 0 0 9 3 0 0 0 1 1 2 Finding LCP - Example H = 1 H = 2 1 1 lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1 lcp(sassin,sin) = 1 + lcp(assin, in) = 1 H = 4

  35. k [i,j] LCP Data Structures - Interval Tree • We need the following operations for Hgt[]: • Set(i, h) – sets Hgt[i] to h • Min_height(i,j) – determines min(Hgt[k]) • We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions

  36. LCP Data Structures - Interval Tree • A full and balanced binary tree • N-1 leaves, correspond to Hgt[] • O(logN) height, N-2 interior vertices • Keep a Hgt value for each interior vertex as well: • Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])

  37. LCP Data Structures - Interval Tree • Operations implementation: • Set(i,h) • Set Hgt[i] to h and update the Hgt values on the path from i to the root • Min-height(i,j) • Finds the minimal Hgt value by scanning O(logN) vertices in the tree • Operations complexity – O(logN)

  38. 1 0 9 0 0 0 1 9 (0,1) (1,2) (2,3) (3,4) (4,5) (5,6) (6,7) 1 9 0 0 0 9 9 9 Finding LCP – Interval Tree

  39. Finding LCP - Complexity • In stage 2H we update Hgt[i] for all the leaves that started new buckets • Each update is one set operation and one Min_height - O(logN) • Throughout the algorithm every leaf is updated exactly once - O(N) updates • Updates complexity: O(NlogN) • In each stage we scan the array to see which suffixes opened new buckets • Scans complexity: O(NlogN) • Total LCP complexity O(NlogN)

  40. Finding LCP - Llcp[] and Rlcp[] • We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort • Use an interval tree that represents a binary search • Each interior node corresponds to (LM, RM) for some M • For each interior node (LM, RM) • Left(LM, RM) = (LM,M) • Right(LM, RM) = (M, RM) • N-2 interior nodes • Leaves correspond to (i-1,i) • Leaf(i-1,i) = Hgt[i]

  41. Finding LCP - Llcp[] and Rlcp[] • According to interval tree structure: • Hgt[(L,R)] = min(Hgt[k]) • Hgt[(L,R)] = lcp (APos[L], APos[R]) • Llcp[M] = Hgt[(LM,M)] • Rlcp[M] = Hgt[(M,RM)] k [L+1,R]

  42. Suffix Array Build time O(NlogN) Search time O(P+logN) Structure space O(N) 2N - 3N integers Independent of |Σ| Suffix Tree Build time O(N) Search time O(P) Structure space O(N) Big constant Dependent of |Σ| Worst Case Complexity

  43. Expected Time Improvements • Improve the expected case time of • Search Algorithm • Sort Algorithm • LCP computation • Use the following assumptions • All N-symbol strings are equally likely • Under this assumption: • Expected length of longest repeated substring of A is O(log|Σ|N)

  44. Expected Case Improvements - Main Idea • Let T = • Let IntT(u) = integer encoding in base |Σ| of the T-symbol prefix of u • Example: • T = 3 • Σ = a,b • u = abaa • IntT(u) = 010 = 2 • There are |Σ|T ≤ N possible T-symbol prefixes • IntT(u) is a number in [0,N-1] • Map each suffix Ap to IntT(Ap) • Can be done in O(N) time

  45. Expected Case Improvements - Search Algorithm • Use an additional array Buck[] • Think of the sorted array as buckets, based on the IntT encoding • Buck[k] = min{ i | IntT (APos[i]) = k} • The first position that contains a suffix that’s mapped to k • Compute Buck[] • at the end of the sort algorithm • O(N) additional time

  46. Expected Case Improvements - Search Algorithm • Given a word W • We need to find Lw and Rw • Let k = IntT(W) • Lw and Rw must be in k’s bucket • (Buck[k], Buck[k+1]) • We only need to search one bucket

  47. Expected Case Improvements - Search Algorithm • Number of buckets = |Σ|T ≤ N • Average number of elements in a bucket = O(1) • In the binary search for W • Expected size of bucket to search = O(1) • Expected number of search steps: O(1) • Expected case time: O(P)

  48. Expected Case Improvements - Sort Algorithm • First stage of sort • Sort according to first symbol • Replace first stage with sort according to IntT • Equivalent to sort according to first T symbols • Can be done in O(N) time • We changed the base case of the sort from H=1 to H=T

  49. Expected Case Improvements - Sort Algorithm Observation: • Let C be the length of the longest repeated substring of A • Sort is in fact complete once we have reached (C+1)-buckets • Suppose some (C+1)-bucket contains more than one suffix • Then we have two suffixes with lcp > C • This prefix is a repeated substring longer than C - contradiction

  50. Expected Case Improvements - Sort Algorithm • Expected case: • C = O(log|Σ|N) = O(T) • Number of stages: O(1) • Expected case time: O(N)

More Related