Efficient Keyword Search for Smallest LCAs in XML Database

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. Keyword search in XML documents, modeled as labeled trees (efficient algorithms) The set of smallest trees containing all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Core contribution: Lookup Eager algorithm Exploits key properties of smallest trees. Used when the query contains keywords with significantly different frequencies. Scan Eager algorithm is tuned for keywords with similar frequencies. Analytically and experimentally evaluates Present XKSearch system Utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms

Efficient Keyword Search for Smallest LCAs in XML Database Outline • Introduction • Notation • Algorithms for finding the SLCA of keyword lists • The Indexed Lookup Eager Algorithm (IL) • Scan Eager Algorithm • The Stack Algorithm • XKSearch System Implementation • Experiments • Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Introduction According to the Smallest Lowest Common Ancestor (SLCA) semantics : The result of keyword query is the set of nodes that: • Contain the keywords either in their labels or in the labels of their descendant nodes and • They have no descendant node that also contains all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Introduction Example: if you ask for the relation between John and Ben the node list [0.1.1, 0.1.2, 0.2.0.0] • XQuery Complex and difficult to be executed efficiently

Efficient Keyword Search for Smallest LCAs in XML Database Notation - each node v of the tree corresponds to an XML element and is labeled with a tag λ(v). - for each node numerical id pre(v) - The XKSearch implementation uses Dewey numbers as the id’s Provide a straightforward solution to locating the LCA of two nodes 0.1.0.0.0 < 0.1.1.1 Compatible with preorder numbering

Efficient Keyword Search for Smallest LCAs in XML Database Notation For a list of k keywords and an input XML tree T: • an answer subtreea subtree of T such that it contains at least one instance of keywords . • a smallest answer subtreean answer subtree non of its subtrees is an answer subtree • = the set of the roots of all smallest answer subtrees of

Efficient Keyword Search for Smallest LCAs in XML Database Notation • the keyword list of (i.e. the list of nodes whose label directly contains sorted by id) • the node is an ancestor of node • or • is an ancestor node if there exists a node such that • If then • the lowest common ancestor lca( 0.1.1.1.0 , 0.1.1.2.0 )=0.1.1

Efficient Keyword Search for Smallest LCAs in XML Database Notation • Given sets of nodes , a node if there exist such that • v belongs to the smallest lowest common ancestor (SLCA) of if and • The result is removes ancestor nodes from its input

Efficient Keyword Search for Smallest LCAs in XML Database Notation • ()= right (left) match of v in the set S The node of S that has the smallest (biggest) id that is greater (smaller) than or equal to pre(v) • returns null when there is no right (left) match node. • Cost: steps to find the right (left) match to compare two Dewey numbers Cost:

Efficient Keyword Search for Smallest LCAs in XML Database Algorithms for finding the SLCA • A Brute-forcesolution to the SLCA problem Computes the LCAs of all node combinations and then removes ancestor nodes Complexity: It is blocking After it computes an LCA for some , it cannot report v as an answer since there might be another set of k nodes such that

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Preferred when the keyword search includes at least one low frequency keyword • Based on four properties of SLCAs Property(1) Observations: • for any two nodes to the right of a node if • for any two nodes to the left of a node if

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Property(2) for k>2 Property(3) Leads to an algorithm to compute - computes for each (1≤i≤n) - the answer is q r

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Benefit over Brute-force: for each node v1 in S1, the algorithm does not compute for all Computes a single where each is computed by the matched functions (lm and rm) (2≤i≤k) Complexity: or

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Result = {} U {0.1.1} Result = {0.1.1}U{1.2.0} = {0.1.1, 1.2.0} x u u x x v v v

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Derivation of algorithm to compute Property(4): blocking algorithm it only processes the last keyword list after it completely processes the first k-1 keyword lists subroutine

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm All nodes in xᵢ except the last one are guaranteed to be SLCAs The last node is carried on the next operations Repeat the operation for all groups of P nodes of Sᵢ The smaller P is, the faster the algorithm produces the first SLCA. No operations to remove ancestor nodes from a set ->

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm “Class”, “John” and “Ben” P=1 v B=[0.1.0] Output B=ø B={} B=[0.1.1] B B Output v=0.1.1 (line #13)

Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm When the occurrences of keywords do not differ significantly Its lm and rm implementations scan keyword lists to find matches a cursor for each keyword list Observation: nodes from different lists may not be accessed in order

Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm Complexity: or

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm • Each stack entry has a pair of components • Id components from the bottom entry to a stack entry en are • Keywords an array of length k of boolean values keywords[i]=T the subtree rooted at the node denoted by the stack entry directory or indirectly contains the keyword wᵢ

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm Example: ”Class”, “John” and ”Ben” Keyword lists: [0.1.0,0.1.1], [0.0.0, 0.1.1.1.0,0.2.0.0] , [0.1.0.0.0, 0.1.1.2.0, 0.2.0.1] Initially: the stack is empty V=0.0.0 P=NULL Add non-matching components to the stack: Second iteration: v=0.1.0 (the next smallest node) p=lca(stack, v) = 0 pop out top 2 entries of the stack : (the important information is carried) add non-matching components:

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm ….. Seventh iteration : the initial stack: v=0.2.0.0 p=lca(0.1.1.2.0, 0.2.0.0)=0 pop out top 4 entries of the stack: when popping out the third component: we find a SLCA : Outputs 0.1.1 as SLCA Complexity: the number of lca operations and the number of Dewey number comparisons are Not a SLCA => pass keyword witness information to the top entry

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm The Scan Eager algorithm has several advantages over the Stack algorithm. • the Scan Eager algorithm starts from the smallest keyword list, does not have to scan to the end of every keyword list may terminate much earlier than the Stack • the number of lca operations of the Scan Eager algorithm is usually much less than that of the Stack algorithm • the Stack algorithm operates on a stack whose depth is bounded by the depth of the input tree while the Scan Eager algorithm with P=1 only needs to keep three nodes in the whole process and no push/pop operations are involved.

Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation Indexed Lookup Eager, Scan Eager and Stack algorithms implemented in Java using the Apache Xerces XML parser and Berkeley DB

Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation The architecture: B-tree structure allows efficient implementation of the match operations

Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation The table LT has d entries ( d= depth of the input tree) LT(i) = the maximum number of bits needed to store the i-th component in a Dewey number; where c is the number of children of the node at the level of i-1 that has the maximum number of children among all odes at the same level In general: bytes are needed for a Dewey number of a node at level i

Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation Indexed Lookup Eager algorithm keyword lists are in a single B+ tree where keywords are the primary key and Dewey numbers are the secondary key For w and a Dewey number p, it takes a single scan operation to find the right and left match of p in the keyword list of w The number of disk accesses: - cannot be more than (Bᵢ = the number of blocks of keyword list Sᵢ )

Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation Scan Eager algorithm The keys in the B+ tree are simply keywords • The data associated with each key w is the list of Dewey numbers of the nodes directly containing the keyword w • The number of disk accesses:

Efficient Keyword Search for Smallest LCAs in XML Database Experiments • similarities among the Scan Eager, Indexed Lookup Eager and Stack algorithms. However, the differences between the performance of algorithms for cold cache is not as significant as those in the hot cache experiments. The reason is that most keyword lists do not take many pages. • The size of the keyword lists and the time to construct them are proportional to the size of the input XML documents • XKSearchB stores Dewey numbers without using a level table • On average, the size of indexes constructed by XKSearch is 65% of XKSearchB • the construction time of XKSearch is 55% of XKSearchB • the query response time of XKSearch for hot cache is 70% of XKSearchB

Efficient Keyword Search for Smallest LCAs in XML Database Conclusions • The XKSearch system inputs a list of keywords and returns the set of Smallest Lowest Common Ancestor nodes • The complexity of Indexed Lookup Eager algorithm: • The Indexed Lookup Eager algorithm outperforms, often by orders of magnitude, other algorithms when the keywords have different frequencies. • Scan Eager algorithm as the best variant for the case where the keywords have similar frequencies.

Efficient Keyword Search for Smallest LCAs in XML Database

Efficient Keyword Search for Smallest LCAs in XML Database