1 / 80

Well-designed XML Data

Well-designed XML Data. Marcelo Arenas and Leonid Libkin University of Toronto. Outline. Part 1 - Database Normalization from the 1970s and 1980s Part 2 - Classical theory revisited: normalizing XML documents Part 3 - Classical theory re-done: new justifications for normalization.

Download Presentation

Well-designed XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

  2. Outline • Part 1 - Database Normalization from the 1970s and 1980s • Part 2 - Classical theory revisited: normalizing XML documents • Part 3 - Classical theory re-done: new justifications for normalization

  3. Part 1: Classical Normalization • Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. • We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies.

  4. Designing a Database: An Example • Attributes:number, title, section, room. • Data dependency: every course number is associated with only one title. • Relational Schema: BAD alternative:

  5. Problems with BAD: Update Anomaly Title of CSC258 is changed to Computer Organization I.

  6. Problems with BAD: Update Anomaly Title of CSC258 is changed to Computer Organization I.

  7. Problems with BAD: Update Anomaly Title of CSC258 is changed to Computer Organization I. The instance stores redundant information.

  8. Deletion Anomaly CSC434 is not given in this term.

  9. Deletion Anomaly CSC434 is not given in this term.

  10. Deletion Anomaly CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted.

  11. Insertion Anomaly A new course is created: (CSC336, Numerical Methods)

  12. Insertion Anomaly A new course is created: (CSC336, Numerical Methods)

  13. Insertion Anomaly A new course is created: (CSC336, Numerical Methods) The instance stores attributes that are not directlyrelated.

  14. Avoiding Update Anomalies Title of CSC258 is changed to Computer Organization I.

  15. Avoiding Update Anomalies Title of CSC258 is changed to Computer Organization I.

  16. Avoiding Update Anomalies Title of CSC258 is changed to Computer Organization I. CSC434 is not given in this term. The instance does not store redundant information.

  17. Avoiding Update Anomalies CSC434 is not given in this term.

  18. Avoiding Update Anomalies CSC434 is not given in this term. A new course is created: (CSC336, Numerical Methods) The title of CSC434 is not removed from the instance.

  19. Avoiding Update Anomalies A new course is created: (CSC336, Numerical Methods)

  20. Avoiding Update Anomalies A new course is created: (CSC336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related.

  21. Normalization Theory • Main idea: a normal form defines a condition that a well designed database should satisfy. • Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. • Main problems: • How to test whether a database schema is in a particular normal form. • How to transform a database schema into an equivalent one satisfying a particular normal form.

  22. BCNF: a Normal Form for FDs • Functional dependency (FD) over R(A1, …, An): X  Y , X, Y  {A1, …, An}. • X  Y: two rows with the same X-values must have the same Y-values. • number  title : two rows with the same course number must have the same title. • Key dependency : X  A1  An • X is a key: two distinct rows must have distinct X-values.

  23. BCNF: a Normal Form for FDs •  is a set of FD over R(A1, …, An). • Relation schema R(A1, …,An),  is in BCNF if for every X  Y in , X is a key. • A relational schema is in BCNF if every relation schema is in BCNF.

  24. Normalization Theory Today • Normalization theory for relational databases was developed in the 70s and 80s. • Why do we need normalization theory today? • New data models have emerged: XML. • XML documents cancontain redundant information. • Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated.

  25. XML Documents courses course course @cno taken_by @cno taken_by “CSC258” “CSC434” student student student . . . @sno @name @grade @sno @name @grade “st1” “Fox” “B+” “Fox” “A+” “st1”

  26. XML Databases XML Schema: (D, ) D : : Two students with the same @sno value must have the same name.

  27. Redundancy in XML courses course course info @cno taken_by @cno taken_by @sno @name “CSC258” “CSC434” “st1” “Fox” student student student . . . @sno @name @grade @sno @name @grade “st1” “Fox” “B+” “Fox” “A+” “st1”

  28. XML Database Normalization DTD: Data dependency: Two students with the same @sno value must have the same name.

  29. XML Database Normalization DTD: Data dependency: , info* @sno is the identifier of info elements. Two students with the same @sno value must have the same name.

  30. A “Non-relational” Example DBLP conf conf . . . issue issue @title “ICDT” article article article @year @year “2001” “1999” @year @title @year @title @year @title “2001” “. . .” “1999” “. . .” “1999” “. . .”

  31. XNF: XML Normal Form • Proposed in [AL02]. • It eliminates two types of anomalies. • It was defined for XML functional dependencies: DBLP.conf.@title  DBLP.conf DBLP.conf.issue  DBLP.conf.issue.article.@year

  32. Part 3: What was Missing? Justification! • What is a good database design? • Well-known solutions: BCNF, 4NF, … • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. • Previous work was specific for the relational model. • Classical problems have to be revisited in the XML context.

  33. Justification of Normal Forms • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory.

  34. Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n :

  35. Entropy and Redundancies • Database schema: R(A,B,C), A  B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6}

  36. Entropy and Normal Forms • Let  be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • This is a clean characterization of BCNF , but the measure is not accurate enough ...

  37. Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A  B entropy = 0 entropy = 0

  38. A General Measure InstanceI of schema R(A,B,C), A  B :

  39. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7.

  40. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7.

  41. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}.

  42. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}.

  43. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}.

  44. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) =

  45. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) =

  46. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) =

  47. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) =

  48. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/

  49. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) =

  50. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) =

More Related