Space-Efficient Data Structures for Top- k Completion

Space-Efficient Data Structures for Top-k Completion 蔡晓华

Outline • Motivation • Completion Trie • RMQ Trie • Score-demonposedTrie • Q & A

Motivation: focus on the case where the string set is so large that compression is needed to fit the data structure in memory Contribution: present three different trie-based data structures to address this problem

Definition: Scored string set the completion suggestions are drawn from a set of strings, each associated with a score. We call such a set a scored string set Problem: top-k completion Given a string p and an integer k, a top-k completion query in the scored string set S returns the k highest scored pairs in S

Completion Trie Trie : Each edge represents a single character in the simple trie Compacted Trie : Allows a sequence of characters to be associated with each edge(except root)

Completion Trie Score : (1) Assign to each leaf node the score of the string it represents (2) Assign to each intermediate node the maximum socre among its descendant leaf nodes By construction , the score of each non-leaf node is simply the maximum score among its children

Completion Trie Compacted trie with max scores in each node

Completion Trie The way to find top-k completions : Find the locus node with input string prefix p b) Add the locus node into a priority queue c) If the node is a leaf node , return; else insert all children of each expanded node to the priority queue

Completion Trie Improvement Instead of inserting all children of each expanded node to the priority queue , sort the children by order of decreasing score Result Only need to add the first child and the next sibling (if any)

Completion Trie • Reduce the time complexity to find the top-k completions • In practice , reduce the number of comparisons needed to find the locus node

CompletionTrie Input : prefix = c k=2 Find locus node : C String =caca String =cbac

算法优点 • 传统的Trie树，每个叶子节点存储与String对应的score，解决top-K问题时，需要找到所有满足这个prefix的叶子节点，然后动态排序，再返回前K个元素 • 问题：当prefix较短时，返回的结果很多，排序耗时也占用空间 • 一种方案是提前对数据进行处理，找到每个prefix的K个completion，然后对应存储 • 问题：需要提前知道K的大小，而且K是固定的

Compressed Encoding Motivation • Improve the theoretical time complexity • Improve the locality of memory access (random access to RAM and hard drive is much slower than that from CPU cache)

Compressed Encoding Two strategies BFS: when finding the locus node, store each group of child nodes consecutively Access the next sibling is less likely to incur a cache miss

Prefix=c Cache size=2

DFS: encoding in DFS order As each internal node is assigned the maximum score of its children and the children are sorted by decreasing score, following the first child is guaranteed to reach a leaf node matching the score of an internal node Typically incur only one cache miss per completion

Encoding for each node • Character sequence associated with its incoming edge • Score • Whether it is the last sibling • An offset pointer to its first child( If not , put 0) (L+1)+4+1+4=l+10 bytes

Variable-byte encoding to scores and offsets • Store only score difference between the current node and its previous sibling • Store the delta offset between first child offset and its previous siblings

ImplementationDetails How to get the string match the leaf node ? DFS: reconstruct the string by starting from the root node and iterative finding the child whose subtrie node offset range includes the target leaf node Reduce the cost by keeping additional bookkeeping in the search algorithm • Store the nodes to be inserted into the queue in an array ,along with the index of its parent node in the array • We can retrieve the path from each completion node to the locus node by following the parent indices

End of the First Section Questions?

RMQ Trie What is RMQ ? • RMQ is short for Range Minimum Query data structure • Maps a set of strings to consecutive integers in lexicographic order

RMQ Trie • If the string set S is represented with a trie, the set of strings prefixed by p is a subtrie • If the scores are arranged in DFS order within an array R, the scores of Sp are those in an interval R[a,b] • PrefixRange(p) : an operation ,given p, return the pair (a,b) or null

RMQ Trie Build an RMQ data structure on top of R using an inverted ordering i.e. the minimum is the highest score strategy • The index of the completion is i=RMQ(a,b) • The second completion is the one with highest score among RMQ(a,i-1) and RMQ(i+1,b) • Recursive splitting • In general, the index of the next completion is the highest scored RMQ among all the intervals • Maintaining the intervals in a priority queue orderd by score

RMQ Trie • Advantage • Simplicity and modularity (re-use an existing dictionary data structure without any significant modification) • Disadvantage • Hard to implement the operation PrefixRange • The cost of PerfixRange is significantly worse

End of the Second Section Q&A

Score-Decomposed Trie • Path decompositions : • Let T be the trie built on the strings of the scored string set S. A path decomposition of T is a tree Tc whose nodes correspond to node-to-leaf paths Π in T and associating it with the root node of Tc; the children of root node are defined recursively as path decompositions of the subtries hanging off the path Π

Score-Decomposed Trie • Find a root-to-leaf path • Let the path be the root node of the new TrieTc • Recursively define the children of the root node

Score-Decomposed Trie Note that while each string s in S corresponds to a root-to-leaf path in T, in Tc it corresponds to a root-to-node path. • Max-score path decomposition • It is a way to choose a path • Choose path as the one to the leaf with the highest score .The subtries at the same level are arranged in decreasing order of score (the score of a subtrie is defined as the highest score in the subtrie)

Score-Decomposed Trie • R represents the highest score in the subtrie rooted at Vi • Add r to the label of the edge leading to the corresponding child, such that the label becomes the pair (b,r)

Score-decomposed Trie example 第一条路径root +B B为叶子节点，路径结束，可以看到ab有两个兄弟，故2ab 分解树中边是原来的节点如第二层的c,3和b,2 以左边c,3为例，走过路径后，递归寻找subtrie的路径，即C节点+E+G，路径是aca,因为ac和a各有一个兄弟，所以是1ac1a G只有一个兄弟，下一个路径就是H自己，拆分两个的CC，路径是c,1，节点是C 剩下的也一样注意的就是最右边那个K节点，里面的字符就直接用的1,因为是空

2ab c,3 b,2 1ac1a 1 c,1 b,2 b,1 c 1ac a b,1 a

Score-Decomposed Trie How to support top-k completions enumeration ? 1）Because of the max-score decomposition strategy , the highest score in each subtrie is exactly the score of the decomposition path for that subtrie. 2）The tree has the heap property : the score of each node is less or equal to the score of its parent

How to support top-k completions enumeration ? 3）This implies that for each (s,r) in S, if u is the node corresponding to s, then r is stored in the incoming edge of u, except when u is the root, whose score is stored separately.

Score-Decomposed Trie • First, follow the algorithm of the Lookup operation until the prefix p is exhausted , leading to the locus node u, the highest node whose corresponding string contains p. (report it) • Find the next completions : prefix p ends at some position in Lu. Thus all the other completions must be in the subtrees whose roots are the children of u branching after position I • Extract the highest scored node from the priority queue , report the string corresponding to it ,and add all its children to the priority queue.

Prefix = cac k=2 Locus node : 1ac1a caca caccc

Thank You! Questions ?

Space-Efficient Data Structures for Top- k Completion

Space-Efficient Data Structures for Top- k Completion

Presentation Transcript

Space-Efficient Algorithms for Document Retrieval

Efficient Top-k Querying over Social Tagging Networks

I/O-efficient Algorithms and Data Structures

I/O-efficient Algorithms and Data Structures

Dynamic Structures for Top- k Queries on Uncertain Data

Efficient Top-K Query Evaluation on Probabilistic Data

Data Structures: Range Queries - Space Efficiency

Space-Efficient Data Structures for Top-k Completion

Techniques and Data Structures for Efficient Multimedia Similarity Search

Space Data Routers for Exploiting Space Data

Cleaning Uncertain Data for Top-k Queries

Efficient Top-k Search across Heterogeneous XML Data Sources

Space Efficient Data Structures for Dynamic Orthogonal Range Counting

Glift: An Abstraction for Generic, Efficient GPU Data Structures

Space Frame Structures for SNAP

Efficient Top-K Query Calculation in Distributed Networks

Efficient Top-k Query Evaluation on Probabilistic Data

Hierarchical Data Structures for Efficient Rendering and Navigation

Cache Efficient Data Structures and Algorithms for d -Dimensional Problems

Efficient Computation of Frequent and Top- k Elements in Data Streams

Blooming Trees: Space-Efficient Structures for Data Representation

Data Structures and Algorithms for Efficient Shape Analysis