1 / 32

Capability-Sensitive Query Processing on Internet Sources

Capability-Sensitive Query Processing on Internet Sources. Hector Garcia-Molina Wilburt Labio Ramana Yerneni Presented by Bimbi Koduru Date : 03/29/2007. Brief Overview.

Download Presentation

Capability-Sensitive Query Processing on Internet Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capability-Sensitive Query Processing on Internet Sources Hector Garcia-Molina Wilburt Labio Ramana Yerneni Presented by Bimbi Koduru Date : 03/29/2007

  2. Brief Overview • On the Internet, the limited query-processing capabilities of sources make answering even the simplest queries challenging. • To solve this, a scheme was developed called GenCompact. • GenCompact is used for generating capability-sensitive plans for queries on Internet sources.

  3. Overview Contd., • Advantages of GenCompact over query plans generated by existing query-processing systems: • Sources are guaranteed to support the query plans. • The plans take advantage of the source capabilities. • The plans are more efficient since a larger space of plans is examined.

  4. Introduction • Processing queries on a wide range of query processing capabilities pose some interesting challenges. • Examples: • A Bookstore (Think BarnesAndNoble!! ) • A car shopping guide (autobytel website)

  5. Example 1 • Consider BarnesAndNoble • Won’t allow searching for two authors on the same topic. • A good plan is to break up the query into two. • At the end, we take the union of the results of the two queries to obtain the answer.

  6. Example 2 • Suppose we want to look information of midsize or compact sedans, like Toyotas under $20k and BMWs under $40k. • The query condition in this case is: • (style = “sedan” ^ (size = “compact” v size = “midsize”) ^ ((make = “Toyota” ^ price <= 20000) v (make = “BMW” ^ price <= 40000))).

  7. Example 2 Contd., • This query condition cannot be supported directly by the web source. • Breaking up the condition is feasible. • (style = “sedan” ^ make = “Toyota” ^ price <= 20000 ^ (size = “compact” v size = “midsize”)) • (style = “sedan” ^ make = “BMW” ^ price <= 40000 ^ (size = “compact” v size = “midsize”))

  8. Explanation • In this example, both DNF and CNF query-processing system produce less desirable plans. • DNF transforms the query into one with four terms. In this case our two-query plan is more feasible. • A CNF system transforms the query into one with six clauses. Here, CNF may transform many more queries than necessary.

  9. Disadvantages of query capabilities of Internet Sources • It is difficult to generate plans that are a-priori known to be feasible. • The size of the plan space for even moderately complex queries can be very large. • In some cases, they choose infeasible plans when feasible plans exist. • In other cases, they choose inefficient plans when much more efficient feasible plans exist.

  10. Different query processing systems • Very few query-processing system take into account source capabilities. • Conventional systems such as: • System R • DB2 • NonStop SQL assume relational source capabilities without limitations.

  11. Query processing systems Contd., • Relatively new systems like • Information Manifold • TSIMMIS • Garlic • DISCO have addressed issues surrounding limited source capabilities.

  12. Notation • The efficient feasible plans for a given target query can be generated in the form • The condition expression “c” is represented by a condition tree (CT). • Leaf Nodes – Atomic Conditions Non-leaf Nodes – Boolean connectors

  13. Notation • Given a condition expression C, we denote the set of attributes in C as Attr(C). • An alternative denotation for is SP(C, A, R) • In the case of a node n of some CT, SP(n, A, R) is short hand for SP(Cond(n), A, R).

  14. Source Capabilities • Internet sources have a wide variety of query-processing limitations. • Condition-Attribute Restrictions • Condition-Expression-Size Restrictions • Condition-Expression-Structure Restrictions.

  15. SSDL • Simple Source-Description Language (SSDL) is a powerful language that describes the wide range of query capabilities. • SSDL is based on context-free grammars (CFGs). • Using SSDL, standard parsing technology can be used to check for the supportability of a source query very efficiently.

  16. SSDL Example • Consider R( make, model, year, color, price ) • The query capabilities of R can be described in SSDL as follows:

  17. SSDL Example : Rules • CFG Rules – Describes the condition expressions R can evaluate. • Rule (2) – R can evaluate conditions like (make = “BMW” ^ price < 40000) • Rule (3) – R can evaluate conditions like (make = “BMW” ^ color = “Red”) • The last two rules indicate the attributes that can be exported by R.

  18. A modular scheme - GenModular • GenModular is used for generating efficient feasible query plans for target queries. • GenModular considers various rewritings of the target query condition and chooses the least expensive among the plans. • GenModular identifies parts of the condition that can be answered by the source and pieces together the source queries.

  19. GenModular • Four Modules work together in GenModular.

  20. Rewrite Module • It produces a set of equivalent rewritings of the target-query condition. • Starts from the condition tree (CT) for the target query and generates the CTs for the rewritings. • Rewrite module uses a set of rules that are also input to the module. • Examples – Commutative, Associative and Distributive transformations of condition expressions.

  21. Mark Module • Determines the various parts of each CT produced by the rewrite module. • Example Description: • Each node n in the CThas a field n.export that records the set of attributes that can be exported by the source when asked to evaluate Cond(n). • By using the Check function described earlier, the mark module computes the export fields of all the nodes in the CT.

  22. Generate Module • Generate module uses an algorithm called Exhaustive Plan Generator (EPG) that computes the feasible plans for a given CT. • Generate module produces the set of feasible plans by repeatedly invoking EPG on each of the CTs passed on by the mark module.

  23. Exhaustive Plan Generator • EPG generates a plan for n by combining the plans for all the children of n. • EPG explores the possibility of downloading the relevant portion of the source contents and evaluating the condition expression corresponding the n at the mediator.

  24. GenCompact • GenCompact generates the same plans in a much more efficient manner than GenModular. • The major disadvantage of GenModular is that it is very inefficient in generating feasible plans for target queries.

  25. GenCompact vs. GenModular • GenCompact improves upon GenModular with two techniques. • Intelligent Plan Generation • Pruning techniques

  26. Rewrite Module • GenCompact employs a rewrite module to generate a set of CTs equivalent to the CT representing the target-query condition. • However, GenCompact can work with a lot fewer CTs than GenModular by firing fewer rewrite modules without compromising the optimality of the plans being generated.

  27. Cost Model • Given a plan for the target query that uses a set of source queries SQ, the cost of the plan is: • where, • K1 and k2 are constants that depend on the source referred to by the target query.

  28. Pruning rules • Based on the cost model, the following pruning rules can be formulated: • PR1: Prune impure plans when pure plans exists. • PR2: prune locally sub-optimal plans. • PR3: Prune dominated plans. • These rules are used in plan-generation module.

  29. Plan Generation Module • Plan-generation module takes each CT produced by the rewrite module and generates a single query plan for the CT. • The associativity and copy rules are not used in the rewrite module. • To compensate, the plan generation module has to do more work on each CT it receives from the rewrite module.

  30. Conclusion • GenCompact produces excellent feasible plans for queries over Internet sources with limited capabilities. • GenCompact is a flexible scheme and can be adapted to source-capability-description languages and cost models.

  31. Questions?

  32. Thank You!

More Related