1 / 29

Succinct Data Structures

Succinct Data Structures. Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. L ó pez-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman, Adam Storm …

oliana
Download Presentation

Succinct Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman, Adam Storm … How do we encode a large tree or other combinatorial object of specialized information … even a static one in a small amount of space and still perform queries in constant time ???

  2. Example of a Succinct Data Structure: The (Static) Bounded Subset Given: Universe of n elements [0,...n-1] and m arbitrary elements from this universe Create: a static structure to support search in constant time (lg n bit word and usual operations) Using: Essentially minimum possible # bits ... Operation: Member query in O(1) time (Brodnik & M.)

  3. Focus on Trees .. Because Computer Science is .. Arbophilic - Directories (Unix, all the rest) - Search trees (B-trees, binary search trees, digital trees or tries) - Graph structures (we do a tree based search) - Search indices for text (including DNA)

  4. A Big Patricia Trie / Suffix Trie 0 1 • Given a large text file; treat it as bit vector • Construct a trie with leaves pointing to unique locations in text that “match” path in trie (paths must start at character boundaries) • Skip the nodes where there is no branching ( n-1 internal nodes) 0 1 • 1 0 0 0 1 1

  5. Space for Trees Abstract data type: binary tree Size:n-1 internal nodes, n leaves Operations: child, parent, subtree size, leaf data Motivation: “Obvious” representation of an n node tree takes about 6 n lg n words (up, left, right, size, memory manager, leaf reference) i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only)

  6. Succinct Representations of Trees Start with Jacobson, then others: There are about 4n/(πn)3/2 ordered rooted trees, and same number of binary trees Lower bound on specifying is about 2n bits What are the natural representations?

  7. Arbitrary Ordered Trees • Use parenthesis notation • Represent the tree • As the binary string (((())())((())()())):traverse tree as “(“ for node, then subtrees, then “)” • Each node takes 2 bits

  8. Heap-like Notation for a Binary Tree Add external nodes Enumerate level by level Store vector 11110111001000000 length2n+1 (Here don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations) 1 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0 0

  9. How do we Navigate? Jacobson’s key suggestion:Operations on a bit vector rank(x) = # 1’s up to & including x select(x) = position of xth1 So in the binary tree leftchild(x) = 2 rank(x) rightchild(x) = 2 rank(x) + 1 parent(x) = select(x/2)

  10. Rank & Select Rank -Auxiliary storage ~ 2nlglg n / lg n bits #1’s up to each (lg n)2rd bit #1’s within these too each lg nth bit Table lookup after that Select -more complicated but similar notions Key issue: Rank & Select take O(1) time with lg n bit word (M. et al) Aside: Interesting data type by itself

  11. Other Combinatorial Objects Planar Graphs (Lu et al) Permutations [n]→ [n] Or more generally Functions [n] → [n] But what are the operations? Clearly π(i), but also π-1(i) And then πk(i) and π-k(i) Suffix Arrays (special permutations) in linear space

  12. Permutations: a Shortcut Notation Let P be a simple array giving π; P[i] = π[i] Also have B[i] be a pointer t positions back in (the cycle of) the permutation; B[i]= π-t[i] .. But only define B for every tth position in cycle. (t is a constant; ignore cycle length “round-off”) So array representation P = [8 4 12 5 13 x x 3 x 2 x 10 1] 1 2 3 4 5 6 7 8 9 10 11 12 13 2 4 5 13 1 8 3 12 10

  13. Representing Shortcuts In a cycle there is a B every t positions … But these positions can be in arbitrary order Which i’s have a B, and how do we store it? Keep a vector of all positions 0 indicates no B 1 indicates a B Rank gives the position of B[“i”] in B array So: π(i) and π-1(i) in O(1) time & (1+ε)n lg n bits Theorem: Under a pointer machine model with space (1+ ε) n references, we need time 1/ε to answer πandπ-1 queries; i.e. this is as good as it gets.

  14. Getting n lg n Bits: an Aside This is the best we can do for O(1) operations But using Benes networks: 1-Benes network is a 2 input/2 output switch r+1-Benes network … join tops to tops 1 2 3 4 5 6 7 8 3 5 7 8 1 6 4 2 R-Benes Network R-Benes Network

  15. A Benes Network Realizing the permutation (3 5 7 8 1 6 4 2) 1 2 3 4 5 6 7 8 3 5 7 8 1 6 4 2

  16. What can we do with it? Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism and also Modify approach to avoid power of 2 issue Can trace a path in time O(lg n/(lg lg n) This is the best time we are able get for π and π-1 in minimum space. Observe: This method “violates” the pointer machine lower bound by using “micropointers”.

  17. Back to the main track: Powers of π Consider the cycles of π ( 2 6 8)( 3 5 9 10)( 4 1 7) Keep a bit vector to indicate the start of each cycle ( 2 6 8 3 5 9 10 41 7) Ignoring parentheses, view as new permutation, ψ. Note: ψ-1(i) is position containing i … So we have ψ and ψ-1 as before Use ψ-1(i) to find i, then bit vector (rank, select) to find πkorπ-k

  18. Functions Now consider arbitrary functions [n]→[n] “A function is just a hairy permutation” All tree edges lead to a cycle

  19. Challenges here Essentially write down the components in a convenient order and use the n lg n bits to describe the mapping (as per permutations) To get fk(i): Find the level ancestor (k levels up) in a tree Or Go up to root and apply f the remaining number of steps around a cycle

  20. Level Ancestors There are several level ancestor techniques using O(1) time and O(n) WORDS. Adapt Bender & Farach-Colton to work in O(n) bits But going the other way …

  21. f-k is a set Moving Down the tree requires care f-3( ) = ( ) The trick: Report all nodes on a given level of a tree in time proportional to the number of nodes, and Don’t waste time on trees with no answers

  22. Final Function Result Given an arbitrary function f: [n]→[n] With an n lg n + O(n) bit representation we can compute fk(i) in O(1) time and f-k(i) in time O(1 + size of answer).

  23. Back to Text … And Suffix Arrays Text T[1..n] over (a,b)*#(a<#<b) There are 2n-1 such texts, which of the n! suffix arrays are valid? 1 2 3 4 5 6 7 8 SA=4 7 5 1 8 3 6 2 is a b b a a b a # SA-1=4 8 6 1 3 7 2 5 M=4 7 1 5 8 2 3 6isn’t ..why?

  24. Ascending to Max M is a permutation so M-1 is its inverse i.e. M-1[i] says where i is in M Ascending-to-Max:  1  i  n-2 • M-1[i] < M-1[n] and M-1[i+1] < M-1[n] M-1[i] < M-1[i+1] • M-1[i] > M-1[n] and M-1[i+1] > M-1[n] M-1[i] > M-1[i+1] 4 7 5 1 8 3 6 2 OK 4 7 1 5 82 3 6 NO

  25. Non-Nesting Non-Nesting: 1  i,j  n-1 andM-1[i]<M-1[j] • M-1[i] < M-1[i+1] and M-1[j] < M-1[j+1] M-1[i+1] < M-1[j+1] • M-1[i] > M-1[i+1] and M-1[j] > M-1[j+1] M-1[i+1] < M-1[j+1] 4 7 5 1 8 3 6 2 OK 4 7 1 582 36NO

  26. Characterization Theorem for Suffix Arrays on Binary Texts Theorem:Ascending to Max & Non-nesting  Suffix Array Corollary: Clean method of breaking SA into segments Corollary: Linear time algorithm to check whether SA is valid

  27. Cardinality Queries T= a b a a a b b a a a b a a b b # Remember lengths longest run of a’s and of b’s SA (broken by runs, but not stored explicitly) 8 3 | 9 4 12| 1 10 5 13|16 | 7 2 11 15|6 14 Ba, bit vector .. If SA-1[i-1] in an “a” section store 1 in Ba,[SA-1[i]], else 0 Ba 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 Create rank structure on Ba, and similarlyBb, (Note these are reversed except at #) Algorithm Count(T,P) s ← 1; e ←n; i ← m; while i>0 and se do if P[i[=a then s← rank1(Ba,s-1)+1; e←rank1(Ba,e) else s← na + 2 + rank1(Bb,s-1); e←na + 1 +rank1(Bb,e) i ← i-1 Return max(e-s+1,0) Time: O(length of query)

  28. Listing Queries Complex methods Key idea: for queries of length at least d, index every dth position .. For T and forT(reversed) So we have matches for T[i..n] and T[1,i-1] View these as points in 2 space (Ferragina & Manzini and Grossi & Vitter) Do a range query (Alstrup et al) Variety of results follow

  29. General Conclusion Interesting, and useful, combinatorial objects can be: Stored succinctly … O(lower bound) +o() So that Natural queries are performed in O(1) time (or at least very close) This can make the difference between using them and not …

More Related