1 / 31

NF-SS: A Normal Form for Semistructured Schemata

NF-SS: A Normal Form for Semistructured Schemata. Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore Gillian Dobbie University of Auckland, New Zealand. Outline. Motivations Semistructured schema and its data tree

jemma
Download Presentation

NF-SS: A Normal Form for Semistructured Schemata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DASWIS 2001 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore Gillian Dobbie University of Auckland, New Zealand

  2. Outline DASWIS 2001 • Motivations • Semistructured schema and its data tree • Integrity constraints for semistructured data • NF-SS: Normal Form for Semistructured Schemata • Designing of semistructured schema into NF-SS • Discussions of the designing approach • Comparison with related proposal • Summary

  3. department + name course cid title * student ? sid age name grade 1. Motivation: Example 1 DASWIS 2001 <!ELEMENT department (course+) <!ATTLIST department name ID #REQUIRED> <!ELEMENT course (students*)> <!ATTLIST course cid ID #REQUIRED title CDATA #implied> <!ELEMENT student (grade?)> <!ATTLIST student sid ID #REQUIRED name CDATA #REQUIRED age CDATA #IMPLIED> <!ELEMENT grade (#PCDATA)>

  4. 1. Motivation (cont.) DASWIS 2001 • Redundancy: name and age of a student • Updating Anomaly: • Insertion • Rewriting • Deletion

  5. 1. Motivation:Example 2 DASWIS 2001 <!ELEMENT teacher (ClassRoom*)> <!ATTLIST teacher tid ID #REQUIRED> name CDATA #REQUIRED> <!ELEMENT ClassRoom (subject*)> <!ATTLIST ClassRoom room# ID #REQUIRED> <!ELEMENT subject (time)> <!ATTLIST subject cid ID #REQUIRED> <!ELEMENT time EMPTY> <!ATTLIST day CDATA #REQUIRED hour CDATA #REQUIRED> • Path anomaly: • The schema doesn’t reflect the integrity constraints: tid,day,hourcid,room#

  6. department + name course cid title * student ? sid age name grade 2. Semistructured Schema and Data tree DASWIS 2001 A semistructured schema is defined to be D = (E, A, B, P, R, r) • E is a finite set of object types in D. E: Object type r: root Object type A: attributes • A is a finite set of attributes, disjoint from E. • B is a set of basic domain type like string, integer, Boolean etc. • P is a function from E to object type definition with symbol in {*, +, ? ,1} called multiplicity • e.g: P (course) = student* multiplicity • R is a function from E to the power set of A • e.g.: R(student) = {sid, name, age } • r  E and is called the object type of the root. • e.g.: r = department

  7. 2. Semistructured Schema and Data tree (Cont.) DASWIS 2001 A data treeT with respect to a semistructured schema D = (E, A, B, P, R, r) is defined to be a tree T=(V, lab, obj, att, val, root), showing a database instance. department course course name: CS · · · title: data Mining cid: title: database design cid: cs5220 cs4221 student · · · student student sid: age: sid: sid: name: age: name: name: grade s01 21 s01 s02 Jack 21 Jack Tom “A”

  8. department + name course cid title * student ? sid age name grade 2. Semistructured Schema and Data tree (Cont.) DASWIS 2001 • The path of a node n in semistructured schema D is denoted as pathD(n). e.g.: PathD for student is /department / course / student • The path of a node v in data tree T is denoted as PathT(v) e.g.: PathT for student “s02” is /department / course/ student • The target set of node n in T, T[n], is {v: vV, nEA PathT(v)= PathD(n)}. e.g.: the target set T[student] includes nodes of students with sid “s02” etc.

  9. student student 2. Semistructured Schema and Data tree (Cont.) DASWIS 2001 • Two nodes from two data tree w.r.t schema D satisfy value equality iff • they are attributes nodes with the same tag and the same value; • or they are object nodes having the same tag and their children are pairwise value equal • Twodata trees T1and T2 w.r.t schema D = (E, A, B, P, R, r), X E  A. T1and T2agree on X, denoted as iff the following condition is hold: t1T1[X],t2T2[X], such that (t1=vt2) department course course name: CS · · · title: data Mining cid: title: database design cid: cs5220 cs4221 student student · · · student sid: age: sid: sid: name: age: name: name: grade s01 21 s01 s02 Jack 21 Jack Tom “A”

  10. 3. Integrity Constraints for Semistructured Data DASWIS 2001 • Extended Functional Dependency(EFD) Let D = (E, A, B, P, R, r) be a semistructured schema, let X  EA and Y  EA. Y is extended functionally dependent on X, is denoted as XY. Let S denotes a set of data trees that are images of D, S satisfies XY, iff for any data trees T1, T2 in S, if they agree on every component in X, then they will agree on Y.that is, T1, T2S((xX, T1=xT2) such that T1=yT2). • Inference rule for EFD E1:(reflexivity) If YX, then XY, for any X, Y EA E2:(augmentation) if XY then XZYZ, for any X, Y, Z EA E3:(transitivity) If XY, YZ then XZ, for any X, Y, Z  EA

  11. 3. Integrity Constraints for Semistructured Data (Cont.) DASWIS 2001 • Notation: • EFD XY is partialEFD: If there exists an X’X such that X’Y. Otherwise, is full EFD. e.g.: (1) course[@cid],student[@sid]student[@name] is partial EFD (2) student[@sid]student[@name]its full EFD • XY is said to be coherentiff /X/Y is a path in D; otherwise it is called an incoherentEFD. O1[@X1], …, Oi[@Xi],…,On-1[@Xn-1]On[@Xn] e.g.:teacher[@tid], time [@day, @hour]subject[@cid] is an incoherent EFD, since /teacher / time /subject is not a path in schema.

  12. department + name course cid title * student ? sid age name grade 3. Integrity Constraints for Semistructured Data (Cont.) DASWIS 2001 • If there exists ZEA, such that XY and YZ and Y X, then Z is transitively extended functionally dependent on X via Z. e.g.: age is transitively dependent on course via student since (1) course[@cid]student[@sid] (2) student[@sid]student[@age] and (3)student[@sid] course[@cid]

  13. 3. Integrity Constraints for Semistructured Data (Cont.) DASWIS 2001 • Theorem Let D = (E, A, B, P, R, r) be a semistructured schema, X, Y, Z  E A. If Z is transitively dependenton X via Y, then there exists a data tree of D where a rewriting anomalyoccurs upon updating the values of Z.

  14. 3. Integrity Constraints for Semistructured Data (Cont.) DASWIS 2001 • Key Constraints : Based on EFD semantics • Notation: Ko = O1[@X1]/…/Oi[@Xi]/…/On[@Xn]/O[@X] for key of an object type O in semistructured schema D. /O1/…/O is a path in D If n equals one, then Ko is called an absolute key. Otherwise it is called a relative key. • Example • Kbook= book[@isbn]. Kbook is an absolute key • Kchapter =book[@isbn]/chapter[@number]. Kchapter is a relative key • Ksection= book[@isbn]/chapter[@number]/section[@number]. Ksection is a relative key

  15. 3. Integrity Constraints for Semistructured Data (Cont.) DASWIS 2001 Let D be a semistructured schema and O be its root object type. The set of basic dependencies of D, denoted as BD(D), is defined as follows: • Let X, Y be children of O, non-trivial extended functional dependencies of the form XY where X is a key of O or Y is part of a key of O, are in BD(D). • Let O1 be a sub-object type of O and D1 be a schema tree that is rooted at O1 and add KO as attribute(s) of O1, then BD(D1) BD(D). • No other non-trivial dependencies that is not generated from above is in BD(D)

  16. 4. NF-SS DASWIS 2001 Let D be a semistructured schema and O be its root object type. D is in Normal Form for Semistructured Schemata(NF-SS), iff • O has at least one key. • For any non-trivial EFD of the form XY satisfied by O, where X and Y are attributes of O, then either X is a key or Y is part of the key of O • For any sub-object type O1 of O (a) If adding KOto O1as its components with other remains, a schema tree rooted at O1 will be in NF-SS. (b) KO KO1= or KOKO1, where KO and KO1 are O and O1’s key respectively. (c) O1 is not transitively dependent on KO 4. Any non-trivial EFD in D can be derived from BD(D) by using the inference rules for EFDs.

  17. 5. Designing Semistructured Schema into NF-SS DASWIS 2001 • We adopt restructuring approach for the designing. • We propose four heuristic restructuring rules • Decomposition object types. • Creation new object types. • Regrouping components of an object type. • Objective • Remove transitive or partial EFD and incoherent EFD from the given dependency and key constraints.

  18. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 Rule 1. (Remove Transitive Dependency by Decomposition) Given an object type O in a semistructured schema D, if there is some non-prime component(s) Y of O that is transitively dependent on some key of O, i.e., KOX, X  Y and X KO , and X KO=. Then, restructuring the schema as follows. 1. Duplicate X to form a new node(s) Z. 2. Move Y and all the descendants of Y and their corresponding edges under Z. 3. Make X as foreign key of O, and add a reference edge from the original node X to Z.

  19. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 • Example 5.1: schema D satisfies the following EFDs (1)department[@name]course[@cid] (2) course[@cid]department (3)course[@cid]course[@title] (4)course[@cid]student[@sid (5)course[@cid],student[@sid]grade (6)student[@sid]student[@name, @age]

  20. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 Rule 2.Remove Path Anomaly by Path Splitting Given a semistructured schema D. Suppose there exists an incoherent EFD: O1[@X1],…,On[@Xn]  Y, Y is either an object type or an attribute, and there exists a path P that contains {O1,…,On,Y}. Path P can be split into two sub-paths P1 and P2,where P1 only contains {O1,…,On } and Y, while P2 contains {O1,…,On} and (P-Y).

  21. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 • Example 5.2:schema D satisfies following EFDs (1) teacher[@tid],timeClassRoom (2)teacher[@tid], timesubject

  22. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 Rule 3.Removing Partial Dependency by Creating New Object type Given an object type O in a semistructured schema, let X be a set of prime attributes of O, and Y be the set of O’s attributes. Let O1 be a sub-object type of O. If (KO -X)  O1 and no proper superset of X satisfy this property, then restructure the schema as follows: 1. (KOY –X) becomes the only attribute(s) of O while O1 remains to be its sub-object type. 2.Create a new object type O2 that is a direct component of O. 3.Move rest of the components of O and all their descendants and corresponding edges under O2.

  23. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 • Example 5.3: schema D shown in Figure (a). the following EFDs {O[@A,@B]D, O[@A,@B]O2, O[@A] O1,O[@A]E } and the key of O is {A,B}.

  24. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 Rule 4. (Restructuring To Satisfy Condition 3(b) of NF-SS Definition) Given an object type O in a semistructured schema D, X be a set of O’s attributes and single-valued atomic sub-object types, O1 be a complex sub-object type of O. O1 has relative key KO1 , but KO KO1and KO1 KO .Let Y be KO KO1 X, and Y . D is restructured as follows: 1.O1 remains to be a sub-object type of O. 2. Make Y as components of O. 3.Create a new object type O2 to be a child of O and the rest components of O (excluding Y)become children of O2.

  25. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 • Example 5.4: schema D in Figure (a) satisfies the EFD (1) O[@K, @A]O1 (2) O[@K, @B]O2and the key of O is {K, A, B}.

  26. 5. Designing Semistructured Schema into NF-SS(cont.) DASWIS 2001 Algorithm 1: Restructuring Algorithm Input: A set S that contains semistructured schemas, and a set of EFDs for S. Output: A set of semistructured schemas that in NF-SS. Begin 1.for each semistructured schema D in Sdo if D is not in NF-SS then repeat until no further change: (1) if there exists transitiveEFD: KO  X, X  Y and X KO for an object type O in D, Case X KO=: apply Rule 1 to remove the transitive EFD. Case X KO : apply Rule 3 to remove the transitive EFD. Case X KO : apply Rule 4 to remove the transitive EFD. (2) if there exists incoherent EFD then apply Rule2 to remove it. 2. output S. End

  27. 6. Discussion of Restructuring Approach for Designing DASWIS 2001 • Is the restructuring rules complete? No. • covering is not guaranteed • dependency preservation is not guaranteed • Does it give unique solution? No. • depending on the order in which the dependencies are examined • Designing task can be made easier if more semantics available. • In [5], We have proposed another approach for designing semistructured databases using ORA-SS, a semantic rich model . • Nevertheless, it does give practical heuristics and provides insights into the normalization task for semistructured databases.

  28. 7. Comparison with Related Proposal DASWIS 2001 • The first attempt to define normal form for semistructured data ([ER’99] S.Y.Lee, M.L.Lee, T.W.Ling, and L.A.Kalinichenko.) [3] • Defines a schema called S3-Graph, which makes no distinction between element node and attribute node and no cardinality specification. • Proposes S3-NF, but missing key constraints, an essential part of database design. • The decomposition method may not be able to remove some other kinds of anomalies, like partial dependency and path anomaly that may exist in a schema. • The most recent proposal: XNF (XML Normal Form) ([ER 2001] D.W.Embley and W.Y.Mok. ) [2] • It mainly provides algorithms to translate a schema, represented in a conceptual model called CM hypergraphs, to a scheme-tree forest in XNF. • Like S3-Graph, scheme tree doesn't lend itself to XML definition. • XNF isn’t formulated with the concept of key. • The algorithms given suffers from efficiency. • A large set of results is expected.

  29. 8. Summary DASWIS 2001 • A normal for semistructured schemata • It is incorporated with integrity constraints. • It guarantees no redundancy and hence no undesirable updating anomalies for the conforming semistructured databases. • It gives more reasonable representations of real worldsemantics • Restructuring Approach for designing semistructured databases • a set of heuristic restructuring rules is proposed. • an algorithm for iteratively restructuring a schema into NF-SS is developed. • It provides insights into the normalization task for semistructured databases.

  30. References DASWIS 2001 1.J. Clark and S. DeRose. XML Path Language (XPath). W3C Working Darft, November 1999. http://www.w3.org/TR/xpath. 2.D.W.Embley and W.Y.Mok. Developing XML Documents with Guaranteed “Good” Properties. Proceedings of the 20th International Conference on Conceptual Modeling (ER), 2001. 3. S. Y. Lee, M. L. Lee, T. W. Ling and L. A.. Kalinichenko. Designing Good Semi-structured Databases. Proceedings of the 18th International Conference on Conceptual Modeling (ER), 1999. 4. T. W. Ling and L. L. Yan. NF-NR: A Practical Normal Form for Nested Relations. Journal of Systems Integration. Vol4, 1994, pp309-340 5. Xiaoying Wu, Tok Wang Ling, Mong Li Lee, Gillian Dobbie. Designing Semistructured Databases Using the ORA-SS Model, accepted for publication in Proceedings of the 2nd International Conference on Web Information Systems Engineering (WISE) , IEEE Computer Society, Kyoto, Japan, December 2001.

  31. DASWIS 2001 Q&A

More Related