Relational Keyword Search Efficiency Techniques for Structured Data Retrieval

Keyword Search over Relational Tablesand Streams ALEXANDER MARKOWETZ University of Bonn YIN YANG and DIMITRIS PAPADIAS Hong Kong University of Science and Technology Doklea Meci (A.M 2152) May 2012 University Of Crete Department Of Computer Science

the challenges of accessing structured data • Query languages: • Numerous complex SQL statements • Schemas: • Complex, or nontrivial schema • R-KWS queries: • replaces numerous complex SQL statements • liberates users from studying a database schema • allows querying for terms in unknown locations (tables/attributes)

Introduction • KeyWord Search (KWS) • each document/Web page constitutes one unit of information • a result if it contains a subset of the query’s keywords • has been applied to relational DBMS • allows data retrieval without SQL • Relational-Keyword Search (R-KWS) • the basic unit of information is a record/tuple • queries cannot be answered by inspecting records individually • results have to be constructed by joining tuples

Outline • Introduction • Relational Keyword Search On Tables • Graph-Based Processing • Operator-Based Processing • Optimizations For Continuous GB • Predecessor-KL • Time-KL • Optimizations For Continuous OB • Operator Mesh • Demand-Driven Operator Execution • Partial-Mesh • Experimental Evaluation • Snapshot R-KWS Queries over Tables • Continuous R-KWS Querie0s over Streams • Summary of Experimental Evaluation • Conclusion

Relational Keyword Search On Tables • Goal: methods for BG and OB processing • avoid the shortcomings of prior systems • improve performance of R-KWS in conventional databases

Graph-Based Processing • Basic Idea: • given an inverted index I (on disk), it traverses an undirected data graph G (in memory), searching for MTJNT (Minimal Total Join Networks of Tuples ) results • JNT –Join Networks of Tuples (JNT), which are connected acyclic components of G • A JNT is called Minimal Total JNT (MTJNT) iff it is impossible to remove any node and find the remainder to be total

Gsearch Algorithm • Basic Idea: the algorithm enumerates all possible trees in G rooted at sn • Result:a tree that corresponds to an MTJNT

Gsearch Algorithm • GSearch maintains a queue Q of trees • each constituting a fraction of a potential MTJNT • Every tree is de-queued and expanded by adding one new node , • resulting in a new tree • The new tree falls into one of three categories: • It forms an MTJNT, and is included in the result set • It has the potential to become an MTJNT, and is inserted in Q to be expanded later • None of the previous and the tree can be safely discarded • The algorithm terminates when Q becomes empty

Gsearch Algorithm • GSearch computes the set of MTJNT containing node snand so GB answers an R-KWS query q • correctly, • completely, • without duplicates.

Operator-Based Processing • Basic Idea: • Query processing relies on Candidate Networks (CN) • Candidate Networks (CN) are projections of MTJNT onto the expanded schema • a tuple s of relation S maps to node S{K} EG(q), iff s contains all keywords in K , but does not contain any other term in q\K • An MTJNT projects to a unique CN

Example

Optimizations For Continuous GB • Basic Idea: Keyword labeling • a simple and effective method to summarize reachable keywords for a given node. • Improves performance by avoiding unnecessary calls to GSearch and constraining graph traversals. • A keyword label (KL) of format , stored at node n, indicates a path of h edges in the data graph, connecting n to an occurrence of keyword .

Example • s:[,2] corresponds to the path connecting s to an occurrence of , via 2 edges

benefits of a min-complete labeling • GSearch(G, q, s) is called if s node can reach all query terms, onlyif the node stores a KLfor every k ∈q. • In any other case, s is guaranteed not to participate in an MTJNT. • KL-aware GsearchAlgorithm: • Inserts into Q iff there exists a set NL of labels with belowscriteria: • The KL in NL can reach all missing keywords; that is, NL

Example - Intermediate trees abandoned by KL-aware GSearch. (=9) • lacking keyword • new nodes can only be added to • node can reach in four hops, the shortest path to 2-nd criterianot satisfied! while = 6; + 4 FAIL! 6+4

Predecessor-KL implementation Basic Idea: • A predecessor-KL is a triplet of the form [k, h, p] • a path of length h, connecting n to an occurrence of keyword k • p is n’s predecessor • Every node n must contain a predecessor-KL [k, h, p] for the shortest path leading from n through p to the occurrence of k • An arriving tuple s can itself contain a keyword, or create new paths between keywords and nodes • require KL insertions and updates • each path contains at most edges

Predecessor-KL Example • must keep both KL [] , KL[,1, ] • represent the shortestpath via predecessors and • both paths (to and ) share the same predecessor • suffices to keep KL [] through node

Time-KL Basic Idea: • More efficient labeling that does not require explicit removal • Atime-KL is a triplet [k, h, ] indicating a path of length h to an occurrence of keyword k, which exists until • KL [k, h1, ] dominatesanother [k, h2, ] iff ( h1 h2 and ) Result: • the graph that contains all KL that are not dominated by others

Time-KL example • is connected to in via 2 hops • is connected to in via 1 hop • is connected to in via 3hops and node expires at 21 Result: • (1) and (2) must be stored as each indicates the shortest path for some period of time. • (3) is not recorded as it expires sooner than the other two

OPTIMIZATIONS FOR CONTINUOUS OB • Basic Idea: • If a selection on a table (e.g., T{}) returns no tuples, all operator trees using this input can be discarded immediately • For data streams, this is not permissible • Even though the selection T{} does not currently produce tuples, it may do so in the future, and all operator trees must thus be maintained. • Solution: • optimizations that enable efficient OB R-KWS over data streams

Operator Mesh (1/3) Basic Idea: • sharing common subexpressions • all operator trees are integrated into an operator mesh, reducing CPU cost (for evaluating joins) as well as memory overhead (for intermediate results). • The mesh has |SR|*clusters • |SR| is the number of streaming relations • |K| the number of query keywords • Each cluster contains the operator trees for all CN (Candidate Networks) discovered from a certain • The entire operator mesh has |SR|*leafs/sources, one for each node of the extended schema • Maximum depth of the mesh is +1 • Number of edges depends on the schema complexity • Different clusters are interconnected only through their source operators • Joins from different clusters do not connect directly

Operator Mesh example • shows the shared execution of four operator trees

Operator Mesh example • Algorithm: • The first node in a cluster corresponds to the root node , from which CNGenstarts • Whenever the algorithm generates a new tree from (by adding a new child to a parent ), a join .op is added to the mesh • The left child of .op is .op (the operator that was inserted when was created) • The right child is the source of • For each tree t in CNGen, a pointer is maintained to the corresponding operator t.op, to decide where to place subsequent joins when t is expanded • The algorithm is initialized with t first .op pointing to the source of

Problems with Operator Mesh approach • Example: • Assume tuples from S{} and T{} and V{},U{, },V {, } are empty • none of the joins , , or requires the output of because they do not receive right input Worst case: • ’s results expire before the arrival of any tuples from V{},U{, } or V {, } • The join has wasted CPU and memory, without any contribution to the query

Demand-Driven Operator Execution (2/3) • This mesh is maintained in main memory throughout the lifespan of the query. • Ajoin is considered to be either • running - operators process input • Sleeping – operators ignore input • A join operator is sent to sleep if: • it has no input from the right child (a source), or • all its parents are sleeping Sending operators to sleep does not affect the result’s correctness or completeness because either: • the operator cannot produce output, or • its output would not be consumed

Demand-Driven Operator Execution - example • Shows the state diagram for a join operator

Demand-Driven Operator Execution - example • States are characterized by two binary flags: • d indicating that at least one parent operator is running, • and r specifying that the operator’s right input is not empty. • An operator only runs in the topmost state(d/r) • Operators exchange messages regarding their state, in order to ensure that all d and r flags are up-to-date. • When it leaves this state (transition 2 or 3) it goes to sleep (or halts), to wake up (or restart) later (transitions 9 and 10) • a join operator communicates changes (running/sleeping) to its left child that adjusts its d flag

Demand-Driven Operator Execution - example • Assume U{, } stops producing output Result: • turns off its r flag, goes to sleep (transition 2) • calls its left child decreases its counter of running parents • no further actions for as there are other running parents ,

Demand-Driven Operator Execution - example • If T{},V{, } dries up, too, then, goes to sleep • When operator decreases its counter (rParents=0) • Trasition 3

Example- Considering that the only running join operators are and • Join does not generate results, due to lack of left input • When T{} beginsproducing output, it causes to adjust its r flag, wake up (transition 9), and call .Pstart • operator restarts and informs

Example - All joins run again except and • Note!!! • this method is not restricted to keyword search; it can equally benefit other data stream applications.

Partial-Mesh (3/3)Basic Idea • A Partial-Mesh (PM) is built at runtime and breaks the distinction between • operator initialization • Tuple processing • The method maintains relatively few active operators in memory • It is each operator’s responsibility to create its parents before it can produce output • It destroys its parents (and other operators up the tree) if it cannot supply them with input • In large meshes operators are idle • Their absence does not affect result’s completeness, but dramatically reduces memory consumption

Partial-Mesh Example • When the leftmost source S{}first produces output • It creates its direct parents and • when generates results, it creates its own parents

Partial-Mesh Example • when outputs a first tuple t and instantiates , this operator immediately probes t against T {}

Partial-Mesh Algorithm Basic Idea: • TreeGen, is an algorithm for reconstructing a tree I • decideSwhich parents to create • The algorithm checks the join condition of .op • If is the source joined with then is generated by adding as the rightmost child of in

Partial-Mesh Examples of TreeGen. • TreeGen(S{} )returns a tree that contains a single node S{} • parent is inserted in the mesh and connected to its left and right inputs • The call TreeGen() returns the tree • The expansion of reveals the parents of (e.g., , ,)

Outline • Introduction • Relational Keyword Search On Tables • Graph-Based Processing • Operator-Based Processing • Optimizations For Continuous GB • Predecessor-KL • Time-KL • Optimizations For Continuous OB • Operator Mesh • Demand-Driven Operator Execution • Partial-Mesh • Experimental Evaluation • Snapshot R-KWS Queries over Tables • Continuous R-KWS Queries over Streams • Conclusion

Snapshot R-KWS Queries over Tables (1/3) ComparingGB and OB implementation: • Experiments are focused on tables • Part (0.2M entries), Supplier (10K), PartSupp (0.8M), Customer (150K), Orders (1.5M), and LineItem (6M) • Two tables can join if and only if there is a foreign-key to primary-key between them • The length of join sequences is restricted to , which ranges between 4 and 6.

Example

Example - seven sets of R-KWS queries QS 1 -QS 7 QS 1, QS 2 : people’s or companies’ names (denoted as PeopleName), which appear in the columns Customer. Name, Supplier.Name, and Orders.Clerk; (retrieve connections between multiple people) QS 3 /QS 4: terms from the name of apart, for example, “ivory”, from the Part.Nameattribute;

Example - seven sets of R-KWS queries QS 1 -QS 7 QS 5, QS 6 : years, which are present in LineItem.ShipDate, LineItem.CommitDate, LineItem.ReceiptDate, Orders.OrderDate; QS 7 : terms from Part.Brand, Part.Mfgr, Part.Size, and Part.Container

Example- processing time for queries QS 1 -QS 7 • The below picture depicts the total runtime ( y-axis) of GB and OB • The result set cardinality |R| (below the x-axis) for the seven query sets • Report the median values after setting to 4, 5, and 6.

Snapshot R-KWS Queries over Tables –Conclusion GB OB • (+) • For conventional tables, GB is more efficient than OB,. • GB methods, GSearchavoids duplicate results • reduces the total cost • GB is preferable for datasets with frequent updates • (-) • Not efficient for queries involving numerous keywords and/or a large value of T max • consumes a large amount of main memory to store the data graph Conclusion: On servers dedicated for R-KWS queries, GB is the best choice due to its high performance • (+) • OB utilizes the functionality provided by a DBMS, and, thus, can answer R-KWS queries using much less memory than GB Conclusion: On servers running multiple applications and only answering R-KWS queries infrequently, OB might be preferable due to its low memory footprint

Continuous R-KWS Queries over Streams(2/2)

Continuous R-KWS Queries over Streams

Relational Keyword Search Efficiency Techniques for Structured Data Retrieval

Relational Keyword Search Efficiency Techniques for Structured Data Retrieval

Presentation Transcript

Keyword++: A Framework to Improve Keyword Search Over Entity Databases

XRANK: Ranked Keyword Search Over XML Documents

DBXplorer : A System For Keyword-Based Search Over Relational Databases

DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases

DBXplorer: A System for Keyword B ased Search over Relational Databases

Graphinder Semantic Search Relational Keyword Search over Data Graphs

Perk: Personalized Keyword Search in Relational Databases through Preferences

Toward Scalable Keyword Search over Relational Data

Perk: Personalized Keyword Search in Relational Databases through Preferences

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Secure Conjunctive Keyword Search Over Encrypted Data

Keyword Search Over Graph Databases

Efficient IR-Style Keyword Search over Relational Databases

Efficient Keyword Search over Virtual XML Views

Secure Conjunctive Keyword Search Over Encrypted Data

XRANK: Ranked Keyword Search over XML Documents

DISCOVER: Keyword Search in Relational Databases

Efficient IR-Style Keyword Search over Relational Databases

Keyword Search and Keyword Selection

Efficient Keyword Search across Heterogeneous Relational Databases