Well-designed XML Data

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

Outline • Part 1 - Database Normalization from the 1970s and 1980s • Part 2 - Classical theory revisited: normalizing XML documents • Part 3 - Classical theory re-done: new justifications for normalization

Part 1: Classical Normalization • Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. • We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies.

Designing a Database: An Example • Attributes:number, title, section, room. • Data dependency: every course number is associated with only one title. • Relational Schema: BAD alternative:

Problems with BAD: Update Anomaly Title of CSC258 is changed to Computer Organization I.

Problems with BAD: Update Anomaly Title of CSC258 is changed to Computer Organization I. The instance stores redundant information.

Deletion Anomaly CSC434 is not given in this term.

Deletion Anomaly CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted.

Insertion Anomaly A new course is created: (CSC336, Numerical Methods)

Insertion Anomaly A new course is created: (CSC336, Numerical Methods) The instance stores attributes that are not directlyrelated.

Avoiding Update Anomalies Title of CSC258 is changed to Computer Organization I.

Avoiding Update Anomalies Title of CSC258 is changed to Computer Organization I. CSC434 is not given in this term. The instance does not store redundant information.

Avoiding Update Anomalies CSC434 is not given in this term.

Avoiding Update Anomalies CSC434 is not given in this term. A new course is created: (CSC336, Numerical Methods) The title of CSC434 is not removed from the instance.

Avoiding Update Anomalies A new course is created: (CSC336, Numerical Methods)

Avoiding Update Anomalies A new course is created: (CSC336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related.

Normalization Theory • Main idea: a normal form defines a condition that a well designed database should satisfy. • Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. • Main problems: • How to test whether a database schema is in a particular normal form. • How to transform a database schema into an equivalent one satisfying a particular normal form.

BCNF: a Normal Form for FDs • Functional dependency (FD) over R(A1, …, An): X  Y , X, Y  {A1, …, An}. • X  Y: two rows with the same X-values must have the same Y-values. • number  title : two rows with the same course number must have the same title. • Key dependency : X  A1  An • X is a key: two distinct rows must have distinct X-values.

BCNF: a Normal Form for FDs •  is a set of FD over R(A1, …, An). • Relation schema R(A1, …,An),  is in BCNF if for every X  Y in , X is a key. • A relational schema is in BCNF if every relation schema is in BCNF.

Normalization Theory Today • Normalization theory for relational databases was developed in the 70s and 80s. • Why do we need normalization theory today? • New data models have emerged: XML. • XML documents cancontain redundant information. • Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated.

XML Documents courses course course @cno taken_by @cno taken_by “CSC258” “CSC434” student student student . . . @sno @name @grade @sno @name @grade “st1” “Fox” “B+” “Fox” “A+” “st1”

XML Databases XML Schema: (D, ) D : : Two students with the same @sno value must have the same name.

Redundancy in XML courses course course info @cno taken_by @cno taken_by @sno @name “CSC258” “CSC434” “st1” “Fox” student student student . . . @sno @name @grade @sno @name @grade “st1” “Fox” “B+” “Fox” “A+” “st1”

XML Database Normalization DTD: Data dependency: Two students with the same @sno value must have the same name.

XML Database Normalization DTD: Data dependency: , info* @sno is the identifier of info elements. Two students with the same @sno value must have the same name.

A “Non-relational” Example DBLP conf conf . . . issue issue @title “ICDT” article article article @year @year “2001” “1999” @year @title @year @title @year @title “2001” “. . .” “1999” “. . .” “1999” “. . .”

XNF: XML Normal Form • Proposed in [AL02]. • It eliminates two types of anomalies. • It was defined for XML functional dependencies: DBLP.conf.@title  DBLP.conf DBLP.conf.issue  DBLP.conf.issue.article.@year

Part 3: What was Missing? Justification! • What is a good database design? • Well-known solutions: BCNF, 4NF, … • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. • Previous work was specific for the relational model. • Classical problems have to be revisited in the XML context.

Justification of Normal Forms • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory.

Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n :

Entropy and Redundancies • Database schema: R(A,B,C), A  B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6}

Entropy and Normal Forms • Let  be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • This is a clean characterization of BCNF , but the measure is not accurate enough ...

Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A  B entropy = 0 entropy = 0

A General Measure InstanceI of schema R(A,B,C), A  B :

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7.

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}.

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}.

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) =

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) =

Well-designed XML Data

Well-designed XML Data

Presentation Transcript

XML Data Management

Collecting Data with Well-Designed Forms

Creating Well Designed Technical Training Seminars

More HTML Well designed websites URLs

XML Data

XML Data

XML and Data Management XML Processors

XML Data

XML Data Model

XML: Semistructured Data

Tamino – a DBMS Designed for XML

Loading XML Data

Chapter 2: Well-Formed XML

Data-centric XML

Make Your Office Well Designed

Advanced and Well-designed PowerPoint Templates

XML Data

Well-designed Pergolas in Dubai

5 Elements of a Well-Designed Kitchen