1 / 36

Efficient Discovery of XML Data Redundancies

Efficient Discovery of XML Data Redundancies. Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th , 2006. Talk Outline. Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation

cleo
Download Presentation

Efficient Discovery of XML Data Redundancies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12th, 2006

  2. Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion

  3. An Example XML Document warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  4. Constraints on XML Data • An example constraint: For any two books, if they have the same ISBN, then they have the same title. • Similar to Equality Generating Dependencies (EGDs) [BV84] and Nested EGDs [YP04] Condition Element(s) Implication Element(s) Target

  5. Data Redundancies • E.g., title is redundantly stored • Result of “non-optimal” design of the database schema in the presence of constraints • Lead to: • Update anomalies • Increased cost for data transfer and manipulation • Constraints are the properties of data • May not be known at the design phase

  6. Goal Efficiently Discover Redundancies From the XML Database By Discovering Satisfied Constraints

  7. Main Contributions • A comprehensive notion of XML FD • Capturing a semantically richer set of XML constraints • Definition of XML data redundancy in terms of XML FDs and XML Keys • Efficient algorithms for discovering FDs and data redundancies from an XML database • Experimental Evaluation

  8. Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion

  9. Example XML Constraints • Hierarchical: condition and/or implication elements can come from multiple hierarchies … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price ISBN title price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  10. Example XML Constraints, Cont’d • Set elements: condition and/or implication elements can involve set elements … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price ISBN title price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  11. Functional Dependencies (FDs) • FDs are used to describe constraints in relational databases • A similar notion of FD is needed for XML • Challenges: • Target is difficult to specify due to the hierarchical structure • Set elements introduce new semantics XML FD needs richer semantics !

  12. Previous Notions • Path Based Notion [LLL02,VLL04] • Example: {/warehouse/state/store/book/ISBN}  /warehouse/state/store/book/title • Format: LHS  RHS • Semantics: for any two RHS nodes, same (associated) LHS indicates same RHS • Tree Tuple Based Notion [AL04] • A tree tuple is a data tree, with exactly one data node for each schema element • Format: LHS  RHS • Semantics: for any two tree tuples, same LHS indicates same RHS

  13. Previous Notions, cont’d • Both capture hierarchical constraints • Neither can capture set constraints • {/store/book/ISBN}  /store/book/au • Violated in previous • Satisfied if the two au nodes are a single set • {/store/book/title, /store/book/au}  /store/book/ISBN • Undefined in previous • Intuitive if au nodes are a single set store name book “Borders” au au price title ISBN “… 269” “R.R.” “DB” “$59.9” “J.G.”

  14. A New Comprehensive Notion • Generalized Tree Tuple • A data tree constructed around a pivot data node (np) • Entire subtree rooted at np is kept • All ancestors of np and their “attributes” are kept • Tuple Class CP • The set of all generalized tree tuples, whose pivot nodes share the same path P (called pivot path)

  15. Example Generalized Tree Tuple warehouse Pivot state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  16. Example Generalized Tree Tuple Pivot warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  17. XML FD • <CP, LHS, RHS>: LHS  RHS w.r.t. CP • Semantics: for any two generalized tree tuple t1, t2 in CP, if they share the same LHS, they have the same RHS. • E.g., {./title, ./au}  ./ISBN, w.r.t. C/warehouse/state/store/book

  18. Repeatable Elements Are Special warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  19. Essential Tuple Classes • Definition: Tuple classes with pivot paths that correspond to repeatable schema elements • C/warehouse/state/store/book is essential • C/warehouse/state/store/name is not • Express XML FDs that are expressible with non-essential tuple classes • See paper for detailed proof

  20. XML Key and Data Redundancy • Let attribute @key uniquely identify each node in the entire data tree • <CP, LHS> is an XML Key, when the database satisfies XML FD: LHS  ./@key w.r.t. CP • Similar to the relative key notion proposed in [BDF+01] • Data redundancy exists if the database: • Satisfies the XML FD <CP, LHS, RHS>, • But <CP, LHS> is not an XML key  RHS is redundantly stored.

  21. Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion

  22. Strategy • Discover satisfied XML FDs and Keys • Data redundancies can then be discovered based on the definition • First, we need an efficient representation of the XML data

  23. Hierarchical Representation of XML Data • Each essential tuple class  a relation • Similar to nested relations [OY87,MNE96] • All relations together form a hierarchy • Tree tuples can be reconstructed by joining @key with parent R_state @key parent 2 root 3 root 18 root . . . . . R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9 R_au @key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G. R_store @key parent name 4 3 Borders 12 3 Amazon 19 18 Borders

  24. Intra-Relation FDs • {./ISBN}  ./title, w.r.t. C/warehouse/state/store/book … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”

  25. Inter-Relation FDs • {../name, ./ISBN}  ./price, w.r.t. C/warehouse/state/store/book … … Present in R_store state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269” Present in R_book

  26. Overview of the Discovery Process • Only interested in minimal FDs • Bottom-Up • At each relation • Discover intra-relation FDs and Keys • Discover inter-relation FDs and Keys involving descendant relations • Generate candidate inter-relation FDs and Keys for examination at the parent level • Attribute Partition as the basic data structure

  27. Attribute Partition • Groups tuples according to the attribute value • ∏{price} for Cbook = { {t6,t20}, {t13} } ∏{@key} for Cbook = { {t6}, {t20}, {t13} } ∏{price, @key} for Cbook = { {t6}, {t20}, {t13} } • FD: LHS  RHS w.r.t. CP is satisfied iff: ∏LHS∪RHS = ∏LHS R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9

  28. Set Attribute Partition • Generated through refinement  Initialize ∏{au} for R_book to be { {t6, t13, t20} }  ∏{@text} for R_au = { {t10, t24}, {t11, t25} }  { {t6, t20}, {t6, t20} }  ∏au for R_book = { {t6, t20}, {t13} } • ∏au can then be used as a normal partition R_au @key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G. R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9 Convert to parent Refine ∏{au}using partitions in ∏{@text}

  29. Discovery Algorithms • DiscoverFD: • Discover intra-relation FDs and Keys • Similar to existing relational algorithms • DiscoverXFD: • Discover inter-relation FDs and Keys • Key component: • Candidate inter-relation XML FD generation

  30. Generating Candidate Inter-Relation FDs • Let P' be a parent relation of P • Parent satisfaction property • For LHS∪X  RHS w.r.t. CP to hold for any attribute set X in relation P', LHS∪{./parent}  RHS w.r.t. CP must hold • Child implication property • For LHS∪X  RHS w.r.t. CP to be a non-trivial FD for any attribute set X in relation P', LHS RHS w.r.t. CP must not hold • An FD is a candidate inter-relation FD if it satisfies both properties

  31. Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion

  32. DBLP contains a fair amount of redundancy, as noted earlier in [AL04] as well ~ 10% redundancies in PIR (measured as # of redundant elements over total # of elements), schema modification reported to PIR Real Datasets

  33. Scalability on XMark • Linear in terms of scale factor (# of elements) – even though exponential in theory • Orders of magnitude faster than direct application of a state-of-the-art relational discovery algorithm • The latter takes over 3 hours to run on XMark scale factor 1

  34. Related Work • XML Integrity Constraints (FDs and Keys) • [BDF+01], [LLL02], [FS03] • XML Normal Form • [AL04], [VLL04] • Nested Relation Normal Form • [OY87], [MNE96] • Relational FD discovery • FUN, Dep-Miner, TANE, fdep, FastFDs

  35. Conclusion • A comprehensive notion of XML FDs and Keys, capturing set semantics • A system for for detecting XML data redundancies through the discovery of FDs and Keys • The system is practical for real datasets and out-performs direct application of the best available relational algorithm by orders of magnitude.

  36. Questions ?

More Related