1 / 36

A Graph Method for Keyword-based Selection of the top-K Databases

Quang Hieu Vu 1 , Beng Chin Ooi 1 , Dimitris Papadias 2 , Anthony K. H. Tung 1 1 National University of Singapore 2 Hong Kong University of Science and Technology. A Graph Method for Keyword-based Selection of the top-K Databases. Outline. Motivation Problem definition

alessa
Download Presentation

A Graph Method for Keyword-based Selection of the top-K Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quang Hieu Vu1, Beng Chin Ooi1, Dimitris Papadias2, Anthony K. H. Tung1 1 National University of Singapore 2 Hong Kong University of Science and Technology A Graph Method for Keyword-based Selection of the top-K Databases

  2. Outline • Motivation • Problem definition • An existing approach • System architecture • Query processing • Experimental study • Conclusion

  3. Motivation • Challenge: to issue a query in a DBMS, users need to know • Database schema • Data manipulation language (e.g. SQL) • In distributed systems: heterogeneity of different database schemas • Solution: Keyword Search (KS) • The basic unit of information is a tuple • Each result of a query is a set of tuples satisfying • Contain all or most query keywords • Can be joined together in a meaningful way • (via Primary Key – Foreign Key relationship)

  4. Problem definition • Given a set of relational databases stored at different nodes in a distributed system and a keyword query • Select the top-K databases most likely to contribute results • (K is an input parameter) • Purpose: to minimize the total cost of processing the query without sacrificing precision and recall

  5. An existing approach: M-KS [1] Each DBMS builds a keyword relationship matrix (KRM) acting as its summary For each pair of terms (ti, tj), there is an entry in KRM that records the frequencies of occurrences of the two terms having a relationship at different distances. Two terms have a relationship if They are in the same tuple relationship distance = 0 They are in different tuples, but these tuples can be joined together via d join operations relationship distance = d [1] Bei Yu et al. Effective keyword-based selection of relational databases. SIGMOD’07

  6. An example of KRM A database KRM of the database

  7. An example of KRM A database KRM of the database

  8. An example of KRM A database KRM of the database

  9. Disadvantages of M-KS Use only binary relationships between terms to eliminate non-promising databases Yield numerous false positives Record only the frequency of term co-occurrences Unsuitable for ranking based on IR measures Is designed to support only AND semantics Real applications usually support queries under OR semantics

  10. G-KS • G-KS summarizes the terms and their relationships in • each DBMS using a keyword relationship graph (KRG) • A node corresponds to a term and has a weight. • If two terms have a relationship at distance d, there is an edge between their corresponding nodes in the graph. The distance d between them is marked on the edge. • When two terms can be connected through multiple paths of variable distances, each distinct value of d is recorded. • Every distance value in the graph is associated with a weight.

  11. An example of KRG A database KRG of the database

  12. An example of KRG A database KRG of the database

  13. Weight of a node

  14. Weight of an edge

  15. Graph Compression Observation: a large percentage of terms in a DBMS appear only once. If such terms occur in the same tuple They have the same weight They have the same set of connections to other nodes and these connections are of equal weight Graph compression: 2 types of nodes Single nodes: contain one term Compound nodes: consist of multiple terms The weight of a compound node as well as its edges are computed using any of the included terms

  16. An example of a compressed KRG A database Compressed KRG of the database

  17. An example of a compressed KRG A database Compressed KRG of the database

  18. An example of a compressed KRG A database Compressed KRG of the database

  19. Graph construction Create nodes Compound nodes for terms that occur only once in the database and are in the same tuple Single nodes for other terms Create edges Nodes representing terms in the same tuple: an edge at distance 0 Nodes representing terms in two tuples, which can be connected by d join operations: an edge at distance d

  20. Join keyword tree (JKT) Given a sub graph SG of a KRG, JKT(SG) is a tree satisfying Each tree vertex maps to a non-empty set of nodes of SG and the tree vertices should collectively contain all nodes in SG Edges connecting two vertices are associated with a single distance d Mapping rules If two SG nodes map to the same tree vertex, there must exist a relationship distance 0 between them in SG If two SG nodes map to different tree vertices, there must exist a relationship distance d’ between them in SG, where d’ is the sum of distances in the path connecting two tree vertices

  21. Example of a JKT Mapping from SG to JKT(SG)

  22. Example of a JKT Mapping from SG to JKT(SG)

  23. Example of a JKT JKT(SG) Database

  24. Candidate graph (CG) Given q and KRG, CG(KRG, q) is an SG of KRG satisfying SG includes all nodes of KRG containing the query keywords, and only these nodes SG is complete There exists at least one JKT(SG)

  25. Important theorems Theorem 1: if a database contains a result with all keywords of a query q, then the corresponding KRG must have a candidate graph CG(KRG,q) Theorem 2: the existence of a candidate graph CG(KRG,q) in KRG does not guarantee that the corresponding database has results for q Note: CG(KRG,q) indicates a high probability of the database having results.

  26. Query processing

  27. Experimental study Use DBLP dataset to generate 81 databases according to bibliography types Compare G-KS vs M-KS Effectiveness measurement Use brute-force method to send the query to all databases and select top-K databases from returned results. Let KBF(M) and KG-KS(M) be the total number of results with M keywords in BF and G-KS Recall: KG-KS(M)/KBF (M) Precision: KG-KS(M)/K

  28. Pre-processing cost

  29. Effect of varying #keywords in a query

  30. Effect of varying #keywords in a query

  31. Effect of varying top-K selected DBs

  32. Effect of varying top-K selected DBs

  33. Effect of varying max relationship dist.

  34. Effect of varying max relationship dist.

  35. Conclusion • G-KS: A method that selects the top-K databases for processing a relational keyword search query • G-KS summarizes each database as a keyword relationship graph where • Nodes correspond to terms • Edges capture distance relationships between terms • IR techniques are applied to weight nodes and edges • An algorithm is designed to consider all keywords as a whole in query processing in order to minimize the number of false positives

  36. Thank you ! Questions & Answers

More Related