1 / 42

Mediator Cost Models for Heterogeneous Data Sources

Mediator Cost Models for Heterogeneous Data Sources. Hubert Naacke. Outline. Context : DISCO a mediation system. Context : DISCO a mediation system Problem and Objective cost evaluation of query plan, use diverse cost info Proposition extensible cost model, declarative language

lael
Download Presentation

Mediator Cost Models for Heterogeneous Data Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mediator Cost Models for Heterogeneous Data Sources Hubert Naacke

  2. Outline Context : DISCO a mediation system • Context : DISCO a mediation system • Problem and Objective cost evaluation of query plan, use diverse cost info • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion

  3. Mediation Systems • goals • intelligent integration of information • exploit existing sources, reuse data and power • systems • IRO-DB Univ. Versailles • Hermes Univ. Maryland • Garlic IBM • Tsimmis Stanford • Cords Univ. Waterloo • DISCO Bull + INRIADistributed Information Search Component

  4. Administrator Mediator Mediator Wrapper 1 Wrapper n Data Sources Registration phase User Application Query Mediator Wrapper 1 Wrapper n Data Sources Mediator-Wrapper Architecture Interface Wrapper 1 Wrapper n Data Sources Query processing phase Result

  5. Registration Query processing Catalog Catalog Execution Engine Query Decomposition Optimization Query Recomposition DISCO Architecture Application Administrator Query Tuples Mediator Mediator Schema Capabilities Cost model Wrapper 1 Wrapper n Data Sources Relational Algebra Tuples Wrapper 1 Wrapper n Data Sources

  6. Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan in the mediator • Proposition extensible cost model, declarative language • Detailed solution cost formulas integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion Problem cost evaluation of a query plan

  7. Cost Based Optimization Problem • plan generation: 1 query ð n plans (sub-plans) • logical: operator reordering, permutation, distribution • physical: 1 operator ð n algorithms • cost estimation of plans (sub-plans) compare plans ð 1 candidate for execution • objective minimize response time, resource, memory

  8. Diversity of cost information Problem • sources autonomy • no cost info: black box • statistics only • statistics + cost formulas • instability of execution environment • network communication (e.g. daily contention) • workload (e.g. low load at night) • Objective: take into account all available cost info

  9. Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, hierarchical cost formulas • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion Proposition extensible cost model

  10. expressiveness: sufficient for heterogeneous sources extensible framework Proposition Proposed solution • declarative language • describe wrapper cost model • based on rules • hierarchical classification • integrates generic/specialized cost info • cost evaluation algorithm

  11. Assumptions • more cost info ð better plan • cost info comes from • mediator: generic cost model • wrapper: specialized cost model • cost model used by mediator • input for query plan evaluation • no cost evaluation in wrapper

  12. Requirements Proposed solution • cost model specified by wrapper implementor • statistics • formulas • description may be incomplete • missing statistics • missing formulas • integration: wrapper ðmediator • transfer at registration time • merge wrapper cost models into mediator

  13. Cost Communication Language Proposed solution • exporting statistics • collections stat: total size, card, tuple size • attribute stat: index, min, max, distinct values • exporting formulas • math. formulas: selectivity, statistic of intermediate collections • cost model = set of cost rules • interface for transfer • extension of wrapper interface • textual form parsed by mediator

  14. Cost Rule Definition Proposed solution • head is a plan (or sub-plan): e.g., scan(Publication) • body: 1 formula per cost vector component • cost vector = [Total time, Total size, Cardinality] • depends on statistics • genericity: head contains variables • unify a set of plans • e.g., scan(Collection)

  15. Cost Integration Detailed solution • hierarchical rule classification • criteria: based on rule head variables • any wrapper = generic scope generic rule for fully autonomous sources • any collection = wrapper scope e.g., source latency • any predicate= collection scope e.g., access method specificity • ...

  16. Count = ... TotalTime = ... ... Cost rules hierarchy Detailed solution any wrapper Generic-scope rules select (Collection, Predicate) TotalSize = ... wrapper 2: wrapper 1: Wrapper-scope rules select (Collection, Predicate) select (Collection, Predicate) TotalTime = ... TotalSize = ... Collection- scope rules select(author, Pred) select(publication, Predicate) TotalSize = ... TotalTime = ... Predicate-scope rules select(pub, year < S) select(pub, title = T) TotalTime = ... TotalTime = ... Query specific rules

  17. 2 1 2 Cost Evaluation Algorithm Detailed solution • traversing the plan tree • top-down = attach node « cost formulas • most specific formulas of the hierarchy • may attach many formulas • bottom-up = cost computation • depends on sub-nodes cost • sorting costs • many dimensions (response time, size) • compare: cost(plan1) < cost(plan2) plan

  18. select(scan(Person), age) select(X, P) scan(X) scan scan scan Person Pub Pub Example Detailed solution "get all persons younger than 25 who publish papers" join(select(Person, age < 25), Pub, person.name = pub.author) scan(X) scan(X) select age < 25 select age < 25 scan Person

  19. Validation experimentation on Web sources Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion

  20. Validation Validation • objectives • efficiency of cost model : generic vs. specialized • limited efficiency of generic cost model • maximal efficiency of specialized cost model • cost language power: • can describe heterogeneous execution models • experimentation • data: real sources on the Web, TPC-D data • queries: selection, projection, inter-site join

  21. Validation method Validation • evaluate the cost of plans • P5 : best cost plan • execute the plans • P3 : best response time plan • compare response time P3 / P5 • according to the cost model • generic or specialized • according to the query and data • varying selectivity, cardinality of collections

  22. Experimental system: Sources • Validation • bibliographic data sources on the Web • ACM : 80MB card: 30000 • DBLP: 60 MB card: 90000 • data from TPC-D benchmark • stored in Oracle, access via JDBC • size: 1GB (scaling factor=1) • collections card: 150K - 1.5M

  23. Wrappers for bibliographical sources • Validation • Wrapper for ACM source • schema: Publication(author, title, conf, year) • capability: select on author(optionally conf, year) • cost info: only statistics, no formula • Wrapper for DBLP source • schema: Publication(author, title, conf, year) • capability: select on author,select on conf and year (=) • cost info: select(Pub, pred(conf=x2, year=x3)) :- totalTime = S0 + card(Pub) * sel(pred) * S1

  24. Experimental system : Mediator • Validation • schema • Publication : ACM Pub or DBLP Pub (replication) • generic cost model • cost of select: • select(Pub, Pred) : totalTime = S0 + card(Pub) * S1 • cost of select on indexed attribute : • select(Pub, Pred) : totalTime = S0 + card(Pub) * sel(pred) * S1

  25. Cost model for select operation Optimizer choice ? s s s = submit to wrapper • Validation select * fromPublication where author = x1 and conf = x2 and year < x3 plan P2: union xmin<xi<x3 plan P1: select author = x1 Select author = x1 conf = x2 year < x3 generic cost model P1 materialized select conf = x2 year = xi specialized cost model P1 P2 ACM DBLP

  26. Performance results • Validation Execution of P1 vs. P2 Efficiency loss for low selectivity

  27. Performance results • Validation Execution of P1 vs. P2 Efficiency loss for high selectivity

  28. Cost model for join operation a.author=b.author s s s s • Validation Inter-site join select * from ACM a, DBLP b where a.author = b.author and a.conf=SIGMOD plan P1: plan P2: hash-join a.author=b.author dep-join DBLP select conf = SIGMOD select conf = SIGMOD select author = X ACM ACM DBLP

  29. Performance results • Validation Response time and cost of joins Efficiency of specialized cost model

  30. o.custkey=c.custkey s s s s s Specific projection within TPC-D • Validation Order(orderkey, custkey, comment...), Customer(custkey, name) select * from Order o, Customer c where o.custkey = c.custkey plan P2: tmp.orderkey=c.orderkey plan P1: tmp.custkey=c.custkey select orderkey=X dep-join materialized Order Order select custkey=X select custkey=X proj orderkey, custkey Customer Customer Order

  31. Performance results • Validation Response time: plan 2 / plan1 Generic cost model efficiency

  32. Related work comparison with other approaches Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion

  33. Known Solutions (1) • IRO-DB [Gardarin et al. in CoopIS'97] • hypothesis: generic cost model is unique • calibration of parameters • Hermes [Subrahmanian et al. in SIGMOD'96] • hypothesis: black box, stability over time • historical costs: cost[initial delay, total time, cardinality] = table[query]

  34. Known Solutions (2) • Cords [Zhu, Larson in Distributed & Parallel DB, 98] • hypothesis: similar query have similar cost • query sampling + regression cost model • Garlic [Roth, Ozcan, Haas in VLDB ’99] • multimedia context, cost of image retrieval • cost [cold resp. time, hot resp. time, cardinality] • override wrapper cost functions

  35. Comparison with related work • Conclusion • generalizes calibration and sampling approach • integrates Hermes historical cost • bottom layer of cost rules hierarchy • useful for query caching • Garlic confirmation • cost models do matter: need for wrapper input • extensible : cost of method call • trade-off expressiveness/abstraction

  36. Conclusion • Conclusion • Disco’s cost model is efficient • easy specialization of cost model for access method • yields 100% efficiency for typical web queries • improves logical and physical optimization • Extensible • may specify constraints imposed by source • periodic variation, resource limitation • Flexible • fine granularity of cost formulas

  37. Directions for Future Work • Conclusion • update cost model at runtime • polling / notification • cost rules for query caching • optimize cost evaluation algorithm • stop condition : response time < Tmax or result size < Smax • extensions • EC : cost = price, new parameters • cost of path expression, full-text search

  38. References • Disco’s cost model definition • Hubert Naacke, Georges Gardarin, and Anthony Tomasic. Leveraging Mediator Cost Models with Heterogeneous Data Sources (Extended version). ICDE 1998. • Early version in BDA 1997. • Disco’s cost model validation • Hubert Naacke, Anthony Tomasic, and Patrick Valduriez. Validating Mediator Cost Models with Disco. to appear, NISJ 1999. • Disco implementation • Anthony Tomasic, Rémy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke, and Louiqa Raschid. The Distributed Information Search Component (Disco) and the World Wide Web. In ACM SIGMOD 1997, Research Prototype Demonstration. • MIRO-WEB Esprit Project, Spanish Hospital Application.

  39. T • Validation

  40. DBLP Wrapper : Materialization • Validation • materialized view • V(conf, year) : select * fromPublication where conf = C and year = Y materialized view

  41. Cost model for projection select * from Publication pub, Person pers where pub.author = pers.name plan 1: s s s s s materialized Person select author=X Publication • Validation Person(name, picture), Publication(author, ...) plan 2: select name=X Person project name select author=X Person Publication

  42. T • Validation

More Related