1 / 22

Nested Mappings: Schema Mapping Reloaded

Nested Mappings: Schema Mapping Reloaded. Clio. P. Papotti Universita’ Roma Tre. M.A. Hernandez - H. Ho - L. Popa IBM Almaden Research Center. A. Fuxman - R.J. Miller University of Toronto. The Problem of Mapping Generation. Schemas can be arbitrarily different

presley
Download Presentation

Nested Mappings: Schema Mapping Reloaded

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nested Mappings: Schema Mapping Reloaded Clio P. Papotti Universita’ Roma Tre M.A. Hernandez - H. Ho - L. Popa IBM Almaden Research Center A. Fuxman - R.J. Miller University of Toronto

  2. The Problem of Mapping Generation • Schemas can be arbitrarily different • E.g., different normalization & naming, missing/extra elements • Input: correspondences between atomic schema elements • (Automatic discovery) • Logical and declarative expressions of relationships between schemas. • Abstraction for data interoperability tasks • Simpler than actual implementations of data exchange (SQL/XQuery/XSLT) • Must generate transformation that: • Preserves data relationships: pname-dname, pname-ename, etc. • Creates new target values (pid) • Produces “correct” groupings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  3. Outline • Schema mapping generation • [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) • From basic to nested: • Issues with basic mappings • Nested mappings and their advantages • Generation algorithm • Performance impact • Conclusion • Related work • Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  4. Source Concepts (relational views) Target Concepts (relational views) Schema Mapping Generation Schema Correspondences Source schema S Target schema T • Step 1. Extraction of “concepts” (in each schema). • Concept = one category of data that can exist in the schema • Step 2. Mapping generation • Enumerate all non-redundant maps between pairs of concepts Mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  5. Example The concept of “project of a department” dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] m2 m1 • m1 maps proj to dept-projects proj: Set [ dname pname emps: Set [ ename salary ] ] m1: (p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0in proj) (e0in p0.emps) (d in dept) (p in d.projects) (e in d.emps) (w in e.worksOn) w.pid = p.pid  p0.dname = d.dname  p0.pname = p.pname  e0.ename = e.ename  e0.salary = e.salary • m2 maps proj-emps to dept-emps-worksOn-projects expression for dept-emps-worksOn-projects The concept of “project of an employee of a department” • Two ‘basic’ mappings (or source-to-target tgds or GLAV formulas) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  6. Outline • Schema mapping generation • [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) • From basic to nested: • Issues with basic mappings • Nested mappings and their advantages • Generation algorithm • Performance impact • Conclusion • Related work • Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  7. Issue 1: Many Small Uncorrelated Formulas dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] ] • m1: “for every proj tuple there must be dept and project tuples such that …“ • m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “ • If we also had dependents under employees, then: “for every dependent of an emp of a proj … “ and so on … • There is a lot of common mapping behavior that is repeated • E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  8. Issue 2: Redundancy in the Generated Data Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m2 CS B1 { } { X1 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] ] CS uSearch { Alice John 120K, 90K } Required to exist based on m2 Required to exist based on m1 • m2 repeats the mapping behavior of m1: • “duplicate” dept and project tuples • “duplicate” nulls (pid values: X2 and X3, and budget values) • Moreover, this duplication happens for each joining emp tuple in the source Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  9. Issue 3: No Grouping in the Target Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m2 CS B2 { Alice, John 120K, 90K { X2} { X3 } } { X3 uSearch } CS B1 { } { X1 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] ] CS uSearch { Alice John 120K, 90K } Required to exist based on m2 Required to exist based on m1 • Alice and John are in different singleton sets (E and E’) • There can be as many singleton sets as emp tuples in the source nested set • It is desirable to enforce the grouping on the target data Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  10. Summary of issues • Fragmentation of the specification • (Too) many small tgds • Fragmentation of the data • Generate redundant data (which later needs to be removed or fused) • No grouping enforced on the target data (need additional phase to enforce any grouping) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  11. Idea dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ] • We would like to reuse (in m2) the “dept” and “project” tuples that the simpler mapping m1 asserts. • Make m2 assert only the “extra” information • Also accumulate the corresponding employees into one set • Idea: Correlate the mapping formulas based on their common part m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] ] Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  12. Correlating Mapping Formulas m1: (p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0in proj)(e0in p0.emps) (d in dept) (p in d.projects)(e in d.emps) (w in e.worksOn) w.pid=p.pid  p0.dname = d.dname  p0.pname = p.pname  e0.ename = e.ename  e0.salary = e.salary This is a nested mapping proj tuples mapped only once Submapping, correlated to the parent mapping Replace with n:(p0in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname  [ (e0inp0.emps) (e ind.emps) (w in e.worksOn) w.pid=p.pid  e0.ename = e.ename  e0.salary = e.salary ] • For every proj tuple, we map all employees, as a group. • (Source grouping is preserved) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  13. Advantages of Nested Mappings • Nested tgds can exploit the natural hierarchy that exists on the concepts of a schema • e.g., proj-emps is a “subconcept” of proj, in the source schema • Map higher concept only once; use submappings for subconcepts • Nested mappings are strictly more expressive: There is no set of source-to-target tgds that is equivalent to n. proj: Set [ dname pname emps: Set [ ename salary ] ] Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  14. Nesting Algorithm: Sketch • Step 1. Discovery: construct a DAG of basic mapping based on the concepts hierarchy • Step 2. Correlation: construct nested mappings by traversing the DAG, starting from each root, and repeatedly applying the nesting step hinted before. • We get a forest of nested mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  15. Nesting Algorithm: Example dept: Set of [ dname budget emps: Set of [ ename salary worksOn: Set of [ pid ] ] projects: Set of [ pid pname ] ] P D X proj: Set of [ dname pname emps: Set of [ ename salary ] ] PE DE DP DEPW A DAG of basic mappings for p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and PDP ( for e inp.emps exists e’ ind’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary ) PEDEPW Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  16. Experimental evaluation • Goal: show empirically that nested mappings can dramatically: • reduce the cost of producing a target instance • improve the quality of the generated data • DBLP-like schema, on both source and target, with four levels of nesting/grouping: • authors – level 1 • conferences – level 2 • years – level 3 • publications – level 4 • Mappings are implemented by generating queries (in XQuery) • Qbasic based on basic mappings • Qnested based on nested mappings Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  17. Example Queries – 2 Levels Only Qbasic Qnested let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author, $x1 in $x0/conf return <author> <name> { $x0/name/text() } </name> { for $x0L1 in $doc0/authorDB/author, $x1L1 in $x0L1/conf where $x0/name/text()=$x0L1/name/text() return <conf> <name> { $x1L1/name/text() } </name> </conf> } </author> } { for $x0 in $doc0/authorDB/author return <author> <name> { $x0/name/text() } </name> </author> } </authorDB> let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author return <author> <name> { $x0/name/text() } </name> { for $x1 in $x0/conf return <conf> <name>{ $x1/name/text() }</name> </conf> } </author> } </authorDB> Multiple query terms (one per basic mapping) • Single pass over the data • No duplicates • Need re-grouping (over entire data) • Generate duplicates Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  18. Execution time comparison • Qbasic execution time / Qnested execution time • Logarithm scale Execution time for basic: 22 minutes Execution time for nested: 1.1 seconds Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  19. Output file size comparison • Qbasic output file size / Qnested output file size • Logarithm scale • Size of generated data for basic (including duplicates): 45MB • Size of generated data for nested: 552KB The nested mapping results in much more efficient execution with less redundant data Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  20. Related work • Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al. VLDB’05] support nested data, but do not support nesting of mappings. • Nested mappings are less general than languages used for composition [Fagin et al. PODS’04, Nash et al. PODS’05], but are more compact and easier to understand/program • The generation algorithm identifies common expressions within mappings: same spirit of work in query optimization [e.g., Roy et al. SIGMOD’00]. • But query optimization preserves query equivalence, while our techniques lead to mappings with better semantics (do not preserve query equivalence). • There are already commercial tools that use similar paradigms (e.g., IBM Ascential DataStage TX) but most of the mapping generation work is manual. Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  21. Conclusion • Nested tgds: better specification language for transformation • Use correlation (hierarchy) between concepts • Less redundancy in the output, more efficient • Naturally preserve source grouping • For more complex mappings we expose Skolem functions to let users alter the default grouping behavior • Nested tgds are more compact and easier to understand/program • Humans think top-down: map top concepts, then submappings, etc. • Can be generated too ! Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

  22. Future Directions • Extend existing solutions to use nested mappings • Data integration, mapping analysis and reasoning, schema evolution, etc. • Nested tgds are more complex as a logic formalism ! • Study the formal foundation of nested mappings • More generally, develop methods for deciding when and why is a schema mapping specification “better” than another • Need to look at issues such as: • preservation of the source data (associations, correlations, etc.) • minimization of incompleteness Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

More Related