1 / 36

Schema Summarization

Schema Summarization. Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 13 th , 2006. Many Databases Are Complex. *Number of elements = #tables + #columns (relational) = #elements + #attributes (XML). Reactome Schema.

kuhnk
Download Presentation

Schema Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schema Summarization Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 13th, 2006

  2. Many Databases Are Complex *Number of elements = #tables + #columns (relational) = #elements + #attributes (XML)

  3. Reactome Schema

  4. What’s the Problem ? • Why are complex schemas difficult to deal with ? • For data integration administrators (DIAs): Difficult to grasp the major topics of a complex schema • For ordinary users: Difficult to identify the small subset of relevant schema elements • Can we avoid them ? • Probably not: scientific databases are in fact getting more and more complex – MiMI is an example

  5. Existing Approaches • Ignorethe schema • Keyword-based search over relational and XML databases • Guess the schema • Schema-Free XQuery, FleXPath, etc. • Limitations: • Provide imprecise (and sometimes incorrect) answers • No help in understanding the schema (and the database) itself

  6. Our Approach • Summarize the schema • Represent the original complex schema with a simpler schema, i.e., a summary of the original schema • Help users explore the schema via the summary • Illustrates the main topics of the database • Filters away irrelevant parts of the schema Challenge: how to create a good summary ?

  7. Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

  8. A labeled, directed graph Nodes: Relational: table and column Hierarchical: element and attribute Links: Structural links: parent/child constraints Value links: inclusion constraints (key / foreign key) Schema warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

  9. A schema itself, but: Fewer number of elements  Simpler Contains abstract elements and links Abstract element: Represents a group of original elements Abstract link: Connects at least one abstract element state* authors store* @name author* contact book* @id @name @name isbn price title @address author* Schema Summary warehouse author* book*

  10. Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

  11. What Makes a Good Schema Summary ? • Which one should be the summary ? warehouse warehouse warehouse state* authors store* @name store* author* book* author* contact book* @id @name book* @name isbn price title @address author*

  12. warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author* What Information Do We Need ? • Schema summary is not only a summary of the “schema,” but also in fact a summary of the “database” ! schema structure and data distribution

  13. Desired Properties of Schema Summary • Small enough (in terms of number of elements) to comprehend – Summary Complexity • Show elements in which users are more likely to be interested – Summary Importance • Show elements that represent the entire database well – Summary Coverage • Importance and Coverage calculation will need to consider both schema structure and data distribution

  14. Not all schema elements are created equal ! First Observation: more links, more important - schema Second Observation: more popular, more important - data Intuition Behind Importance warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

  15. Compute Summary Importance • Schema Element Importance • W: Neighbor Weight – the percentage of ej’s information flows into e, estimated using relative cardinalities • Summary Importance

  16. Intuition Behind Coverage • Important ≠ Inclusion in the summary • Elements can be too “close” to each other • Two basic notions • Element Affinity • Element Coverage warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

  17. Intuition Behind Coverage, cont’d • Element Affinity: • less hops, higher affinity • higher relative cardinality, lower affinity • Element Coverage: • Element Affinity • Neighbor Weight warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

  18. Compute Summary Coverage • Schema element affinity from ea to eb • Schema element coverage of eb by ea • Summary Coverage

  19. What makes a good schema summary ? data distribution schema structure summary importance summary coverage

  20. Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

  21. Overview K Database Schema (1) Annotating Schema Graph (Computing statistics) (Algorithms MaxImportance and MaxCoverage) (2.1) Calculating Importance (2.2) Calculating Coverage Set of K elements with high coverage; Set S of Coverage Domination Pairs List L of elements sorted by Importance (3) Determine K summary elements (Algorithm BalanceSummary) (4) Cluster Original Schema Elements Balanced Summary of Size K

  22. Algorithm MaxImportance • MaxImportance generates a summary of a given size k, maximizing summary importance Compute steady-state element importance values Sort and pick top-k important elements Compute assignments of remaining elements • Complexity: O(N2 + NlogN) * Convergence is proved in [MGR02].

  23. Algorithm MaxCoverage • MaxCoverage generates a summary of a given size k, maximizing summary coverage in a heuristic way Eliminate elements being dominated; Compute summary coverage for all element set of size-k Compute coverage dominance (bottom up with A/D pairs) Pick the set with highest coverage • Complexity: O(kN2nk) * See paper for details on coverage dominance

  24. Generate Balanced Summary • No single optimal criteria to balance the two desired properties • A heuristic approach: • Pick elements in the order of their importance • Ignore elements that are dominated by elements already in the summary • Works well in practice

  25. Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

  26. Evaluation Strategies • Observation • Comparing automatic summaries with summaries generated by human experts • In general, automatic summaries agree well with human (~ 80%) • An objective evaluation framework • Models schema exploration based query behavior • Query discovery cost: the number of extra elements visited in order to construct a correct query from a query intention

  27. Query Discovery Cost Example • Query Intention: Retrieve ISBN of all books • Query: for $b in doc()/state/store/bookreturn $b/isbn warehouse warehouse Cost = 3 Cost = 5 state* state* authors store* @name store* @name author* author* book* contact book* contact book* @id @name @name isbn @name isbn price price title @address title @address author* author*

  28. Data Sets

  29. Summary Benefits

  30. Contributions of Schema Structure and Data Distribution

  31. Impact of Balancing Importance and Coverage * Percentage in parenthesis shows the reduction in savings

  32. Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

  33. Related Work • First study on summarizing schemas • Related to ER model abstraction • Limitations of ER model abstraction • Does not reflect the data distribution • ER models may not be available and may be out-of-date • For most database schemas, structure or value links are semantics-free, ER model abstraction methods are ineffective in this case (tagging those links involve significant amount of manual effort)

  34. Related Work, cont’d • Summary element importance calculation is partially inspired by PageRank • Summary element affinity calculation (used in summary coverage) is partially inspired by similar measurements in social network analysis

  35. Conclusions and Contributions • Introduced concept of schema summary • Defined summary importance and summary coverage as desiderata of schema summary • Emphasized both schema structure and data distribution as essential features for importance and coverage calculation • Designed and implemented efficient schema summarization algorithms • An objective evaluation framework

  36. Questions ?

More Related