1 / 60

The Selim and Rachel Benin School of Engineering and Computer Science

Dagstuhl Seminar 08111 on Ranked XML Querying March 2008. Keyword Proximity Search over Data Graphs. Benny Kimelfeld. Joint work with Yehoshua Sagiv. The Selim and Rachel Benin School of Engineering and Computer Science. Extracting Data from Databases. Exposure to many databases

piera
Download Presentation

The Selim and Rachel Benin School of Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dagstuhl Seminar 08111 on Ranked XML Querying March 2008 Keyword Proximity Search over Data Graphs Benny Kimelfeld Joint work with Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science

  2. Extracting Data from Databases • Exposure to many databases • Different types (relational, XML, RDF…) • Different schemas Nowadays… • Traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) require a thorough understanding of the schema for extracting data • Goal:Enable users to instantly pose (inaccurate) queries without knowing the schema The natural (and popular) option:Keyword Search • Problem: Inherently different from standard IR

  3. Example: Search in RDB Belgium, Brussels search Cities Organizations Countries Memberships

  4. Brussels is the capital city of Belgium Belgium, Brussels search Cities Organizations Countries Memberships

  5. Brussels hosts EU and Belgium is a member Belgium, Brussels search Cities Organizations Countries Memberships

  6. Example: Search in XML Yannakakis, Approximation search

  7. Yannakakiswrote a paper aboutApproximation Yannakakis, Approximation search

  8. Yannakakisis cited by a paperaboutApproximation Yannakakis, Approximation search

  9. Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Data have varying degrees of structure • Relational (w/ foreign keys), XML (w/ id-references) • Natural representation by a graph • Usually, data-centric databases • A query is a set of keywords • No structural constraints • Agrawal et al. ICDE’02 • Ding et al., ICDE’07 • Kacholia al., VLDB’06 • Liu et al., SIGMOD’06 • Luo et al., SIGMOD’07 • Wang et al., VLDB’06

  10. Enumeration of Answers • Each of the existing systems has an algorithmic component that generates answers • The goal is usually to enumerate the answers by increasing size (more generally weight) • In some systems, answers are printed immediately after their generation • In others, additional ranking functions are applied for further sorting the answers • Original focus on answer generation • What’s done?

  11. Data Graphs • Structuralandkeyword nodes • Edges may have weights • – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical)

  12. Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}

  13. A Query Answer is a Reduced Subtree An answer is a subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) 3 variants: directed, undirected, strong(undirected, kw’s are leaves)

  14. Find the Answers in this Example!

  15. The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

  16. The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] What about this answer? Never generated! • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

  17. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  18. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant It is actually the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  19. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant Again, the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  20. The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] What about this answer? Never generated! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

  21. The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Easy to implement! All answers are generated! DBMS queries–No in-mem. graph algorithms • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

  22. The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Worse case: exponential in the data But many queries do not generate any answer at all! Inefficient! • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

  23. We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

  24. We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

  25. Not a Trivial Task! Generating all the answers efficiently is not a trivial task, even regardless of order For illustration, the next two slides show why a naïve recursion fails

  26. Query Reduction 1. Remove one keyword from the query 2. Find all answers for the smaller query 3. Extend each answer to include the missing keyword, in every possible way K ={A,B,C}

  27. Query Reduction is Inefficient! Problem: A subset of the query may have many more answers than the query itself 2nresults for {A,B} 1 result for {A,B,C} Exponential Total Time!

  28. We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

  29. Order by Increasing Weight If ≤ Then Top-k Answers

  30. Approximate and Heuristic Orders Heuristic order Approximate order Intuitively, expected to be close to the optimal order, but there is no guarantee There is a provable bound on the extent to which the order can deviate from the optimal one

  31. C-Approximate Order (inc. Weight) If Then C ≤ C-Approximation of the Top-k Answers [Fagin et al., PODS’01]

  32. We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

  33. Efficiency of Answer Generation Problem:Exponentially many answers • Even for 2 keywords • Polynomial time in the input: not suitable • Polynomial Total Time • Polynomial running time in the combined size of the input and the output • Polynomial Delay (stronger than poly. t. time) • The running time between two successive results is polynomial in the size of the input Instead (Johnson & Yannakakis, 1988):

  34. Efficiency of Enumeration Polynomial delay Polynomial time between successive answers • Generate the first few results quickly • Efficiently return results in pages • This is what we usually need!

  35. Minimal Answers are Steiner Trees • Finding the minimal answer (a.k.a. the Steiner-tree problem) is intractable • Therefore, onecannotenumerate all answers by increasing weight with polynomial delay • The heuristic approaches in the literature are always justified by this fact • However, the minimal answer can be found efficiently under data complexity • That is, the number of keywords is fixed • Approximations can be found efficiently under query-and-data complexity • There is a lot of work on Steiner-tree approximations

  36. Answer Generators w/ Poly. Delay 3 different versions–(directed, undirected and strong) • Arbitrary order • [Kimelfeld & Sagiv, DBPL’05] • Simple, easy to implement • Polynomial space • Basis for heuristic order • Exact order of increasing weight • [Kimelfeld & Sagiv, PODS’06] • Necessarily, #keywords is fixed (data complexity) • Approx. order of increasing weight • [Kimelfeld & Sagiv, PODS’06] • Ratio: (Steiner-tree)+1 • We can utilize the vast literature on Steiner-tree approximation! • 2-approx. by increasing height • [Golenberg, Kimelfeld & Sagiv, SIGMOD’08] • Directed variant • Implemented as part of an engine (under develop.)

  37. Enumeration in Arbitrary Order Why do we need an efficient enumeration in an arbitrary order? • Can be used directly if not too many answers are expected (e.g., extracting join expressions from schemas [Cohen et al., CIKM’05]) • Can be used as a basis of an efficient enumeration in a heuristic order:

  38. No Order → Heuristic Order [Kimelfeld & Sagiv, DBPL’05] • Start with a small neighborhood of the keywords in the data graph • Enumerate all answers in the neighborhood • Enlarge the neighborhood • Repeat • Avoid printing old answers

  39. Enumeration by Inc. Weight (PODS’06) A similar approach implemented (SIGMOD’08) Adapting Lawler’s method Transformation of constraints Collapse and restore Goal: Enumerating answers by inc. weight Finding the topanswer under constraints (The intricate part …) Finding minimal supertrees Efficient alg. (data complexity) Many known approximations Finding Steiner trees

  40. Our Algorithms: Order vs. Efficiency More Efficient More Desirable ExactOrder Approximate Order Heuristic Order (no guaranteed approximation) No Order

  41. Answers with High Similarity

  42. Combinations of Connections But each individual answer is relevant!

  43. Repeated Information • This problem has been ignored in the past • Only “absolute” measures were considered • Mainly because experiments were done on db’s that have simple schemas, e.g., IMDB, DBPL • In a complex data graphs (e.g., the Mondial), this problem is critical • We currently study techniques for re-ranking after printing each answer • The next answer is not only the best absolutely; it will hold the best “additional” information • [Golenberg, Kimelfeld & Sagiv, SIGMOD’08]

  44. Answer Presentation • On the Web, we instantly understand the meaning of an answer (Web page) by reading the <title> element, the URL and, possibly, a snapshot of the text • In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the hierarchy from the presentation

  45. What’s the Meaning of this Answer? IMDB Harder for XML! What information is needed to describe a node? Impressive, detailed demo! A snapshot BANKS demo(http://www.cse.iitb.ac.in/banks/)

More Related