Download
xml constraints n.
Skip this Video
Loading SlideShow in 5 Seconds..
XML Constraints PowerPoint Presentation
Download Presentation
XML Constraints

XML Constraints

195 Views Download Presentation
Download Presentation

XML Constraints

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. XML Constraints Wenfei Fan University of Edinburgh and Bell Laboratories

  2. Outline of Part IV • XML Specifications: types and integrity constraints • Specification of XML constraints: • keys, foreign keys, FDs • absolute vs. relative constraints • Analysis of XML constraints • Consistency analysis • Implication analysis • Applications of XML constraints, and research issues • Relational storage of XML data via constraint propagation • Schema-directed XML integration • Normal forms, query optimization, updates, data cleaning . . .

  3. Introduction to XML specificaiton • XML Specification: • types • integrity constraints • the need for XML constraints

  4. db ... province province capital capital @name city capital “Hasselt” @inProvince “Limburg” “Limburg” “others” @inProvince “Hasselt” “Limburg” XML data - an example Rooted, node-labeled tree • elements: db, province, capital, city, subtree/sub-document elements/subelements, e.g., the capital child of province • @attributes: @name, @inProvince, carrying text • text nodes, with text but no label, e.g., “Hasselt”

  5. db ... province province capital capital @name city capital “Hasselt” @inProvince “Limburg” “Limburg” “others” @inProvince “Hasselt” “Limburg” XML specification: DTD (type) • Production: constrains the subelement list of each element <!ELEMENT db (province+, capital+)> <!ELEMENT province (city*, capital)> • Attributes: uniquely identified by name for each element, unordered province: @name, capital: @inProvince

  6. db ... province province capital capital @name city capital “Hasselt” @inProvince “Limburg” “Limburg” “others” @inProvince “Hasselt” “Limburg” XML specification: integrity constraints Keys and foreign keys (vs. relational constraints): • key: the value of a @name uniquely identifies a province province.@name province capital.@inProvince capital • FK: @inProvince of a capital references @name of a province capital.@inProvince  province.@name

  7. XML specification • A type (DTD) D • A set of integrity constraints,  Example: • DTD D: structure of the document, vs. types in a PL <!ELEMENT db (province+, capital+)> <!ELEMENT province (city*, capital)> province.@name, capital.@inProvince • Constraints  : defined in terms of data values across elements province.@name  province capital.@inProvince  capital capital.@inProvince  province.@name

  8. Why XML constraints? Supported by W3C XML standard, XML Schema In databases (supported by SQL standard), constraints are: • an essential part of the semantics of data, • fundamental to conceptual design, • useful for choosing efficient storage and access methods, • central to update anomaly prevention, • data cleaning … In the XML setting: constraints have proved useful in • database storage of XML data (via constraint propagation), • schema-directed database publishing/integration in XML, • XML query optimization and formulation, • design theory for XML specifications: normal forms • data cleaning, …

  9. DTD constraints Data exchange on the Web: XML publishing All members of a community (or industry) agree on a schema and exchange data w.r.t. the schema:e-commerce, health-care, ... Schema-Directed XML Publishing/Integration: • mapping data from traditional database to XML • satisfying the predefined DTD and constraints Web XML XML Q: XML view DB1 DB2

  10. Data exchange on the Web: XML shredding XML shredding: • mapping XML data to relations • relational design: normalization via constraint propagation from XML to relations • optimal relational storage of XML data • semantic connection: query/update optimization Web XML XML XML keys XML shredding propagation DB1 DB2 relational FDs

  11. XML constraints • Specification of XML constraints: • keys, foreign keys, FDs • absolute vs. relative constraints

  12. The limitations of the XML standard (DTD) <!ATTLIST country name ID #required> <!ATTLIST province capital ID #required> <!ATTLIST capital inProvince IDREF #required> • Scoping: • ID unique within the entire document (like oids), while a key needs only to uniquely identify a tuple within a relation • IDREF untyped: one has no control over what it points to -- you point to something, but you don’t know what it is! <student id=“01” name=“Saddam” taking=“qsx”/> <student id=“02” name=“Bush” taking=“qsx 01”/> <course id=“qsx”/>

  13. The limitations of the XML standard (DTD) • keys need to be multi-valued, while IDs must be single-valued (unary) enroll (sid: string, cid: string, grade:string) • a relation may have multiple keys, while an element can have at most one ID (primary) • ID/IDREF can only be defined in a DTD, while XML data may not come with a DTD/schema • ID/IDREF, even relational keys/foreign keys, fail to capture the semantics of hierarchical data – will be seen shortly A mixture of relational keys and object identities (oids) Mild extensions of relational constraints do not work for XML!

  14. db ... province province capital capital @name city capital @inProvince “Hasselt” “Limburg” “Limburg” “others” @inProvince “Hasselt” “Limburg” Absolute constraints Absolute keys and foreign keys are to hold on the entire document. province.@name  province capital.@inProvince  capital capital.@inProvince  province.@name Extensions of relational counterparts

  15. Absolute keys and foreign keys [PODS’00, 01, JACM] • key: [X]  . An XML document satisfies the key iff  x y  ext() (l X (x.l = y.l)  x = y) • foreign key (FK): a combination of an inclusion constraint  1[X]  2[Y], and a key  2[Y]   2 . A document satisfies the FK iff it satisfies the key and  x  ext(1 )  y  ext(2 ) (x[X] = y[Y]) • , 1 ,2: element types; X, Y: sets (lists) of attributes; • ext(): the set of  elements in an XML document. Equality issue: • (string) value equality: when comparing attributes • node identify: when comparing XML elements Unary keys and foreign keys: defined in terms of single-attribute.

  16. Relative constraints [WWW’01, PODS’02,SICOMP] An XML tree specifies countries, provinces, province capitals. • What is a key for a province? • What does @inProvince of a capital reference? db ... country country ... ... province capital province capital @name @name “Holland” “Belgium” @name capital @name capital “Hasselt” @inProvince @inProvince “Maastricht” “Limburg” “Limburg” “Limburg” “Limburg” @inProvince “Hasselt” @inProvince “Hasselt” “Limburg” “Limburg”

  17. Examples of relative constraints Relative constraints: on a subdocument rooted at a country: key: country (province.@name  province) country (capital.@inProvince  capital) FK: country (capital.@inProvince  province.@name) Absolute: on the entire document: country.@name  country db ... country country ... ... province capital province capital @name @name “Belgium” “Holland” @name capital “Hasselt” @name capital @inProvince “Maastricht” @inProvince “Limburg” “Limburg” “Limburg” “Limburg” @inProvince “Hasselt” @inProvince “Hasselt” “Limburg” “Limburg”

  18. Relative keys and foreign keys • key: (1[X]  1). An document satisfies the key iff  c  ext() y, z  ext(1) ( (y c)  (z  c)  l X (y.l = z.l)  y = z) • foreign key (FK): ( 1[X]  2[Y] ) and a key ( 2[Y]  2). A document satisfies the FK iff it satisfies the key and  c  ext()  y  ext(1) (( y  c)   z  ext(2 ) ((z  c)  y[X] = z[Y] )) where  • (y c):y is a descendant of c (y in the subtree rooted at c); • : context type; • ext(): the set of  elements in an XML document.

  19. Relative vs. Absolute • Absolute constraints are a special case of relative ones: country.@name  country  db ( country.@name  country ) absolute: a fixed context type -- the root type r • Absolute constraints are scoped within the entire document; whereas relative ones within the context of a subdocument. country (province.@name  province) country (capital.@inProvince  capital) country (capital.@inProvince  province.@name) country.@name  country Together they specify constraints on the entire document • Beyond relational constraints; important for hierarchically structured data: XML, scientific databases, biomedical data, ...

  20. db company company government university ... employee employee dept employee employee name name employee @id name Define keys with path expressions • XML data is hierarchically structured! “name” as a key for employees of companies only: target set is identified with a path expression: //company//employee • XML data is semistructured: it may not have a DTD/schema! • key paths may be missing or have multiple occurrences key specification should be independent of types name @id @id name firstName lastName

  21. Path expressions Path expression: navigating XML trees A simple yet powerfulpath language: q ::=  | l | q/q | // • : empty path • l: tag • q/q: concatenation • //: descendants and self – recursively descending downward

  22. Absolute path constraints [WWW’01] Absolute key: (Q, {P1, . . ., Pk} ) • Path expressions Q, Pi: XPath, regular path expressions, … • target path Q: to identify a target set [[Q]] of nodes on which the key is defined (vs. relation) • a set of key paths {P1, . . ., Pk}: to provide an identification for nodes in [[Q]] (vs. key attributes) • semantics: for any two nodes in [[Q]], if they have all the key paths and agree on them by value equality (existential), then they must be the same node (value equality and node identity) Examples: (//company//employees, {name, phone})-- composite key ( //company//employees, {//@id})-- multiple keys (//., {@id})-- capturing ID attributes in DTDs

  23. db person person person person @pnone name name @phone name “234-5678” “123-4567” firstName lastName lastName firstName “Jerk” “George” “George” “Bush” “Bush” Value equality on trees Two nodes are value equal iff • either they are text nodes (PCDATA) with the same value; • or they are attributes with the same tag and the same value; • or they are elements having the same tag and their children are pairwise value equal E.g.: two value-equal names ...

  24. db person person person person @pnone name @phone name “234-5678” “123-4567” name firstName lastName “JohnDoe” firstName lastName “George” “Bush” “George” “Bush” Capturing the semistructured nature • independent of types • no structural requirement: tolerating missing/multiple paths (person, {name}) (person, {name, @phone})

  25. Relative path constraints [WWW’01] Relative key: (Q, K) • path Q identifies a set [[Q]] of nodes, called the context path; • K = (Q’, {P1, . . ., Pk} ) is a key on sub-documents rooted at nodes in [[Q]] (relative to Q). Example. (//country, (province, {@capital})) (//country, {@name}) -- absolute key • Absolute keys are a special case of relative keys: (Q, K) when Q is the empty path • Similarly for foreign keys Specification of XML constraints is more involved than its relational counterparts

  26. Keys and foreign keys in XML Schema key: (Q, {P1, . . ., Pk} ) • Path expressions Q, Pi: fragments of XPath • Uniqueness and existence: for each node x in [[Q]] and each i in [1, n],there exists a unique nodeyi reached via Pi, and yi is either a text node or an attribute Foreign keys: (Q, {P1, . . ., Pk} )  (S, {S1, . . ., Sk} ) • (S, {S1, . . ., Sk} ) is a key • Uniqueness and existence: bothPiandSi The uniqueness and existence condition complicates the consistency and implication analyses Absolute constraint

  27. Other constraints for XML Functional dependencies: {P1, . . ., Pk}  {S1, . . ., Sk} • Generalizations of relational FDs – for deriving an extension of relational-schema normal forms • Absolute constraints [Arenas and Libkin, PODS’02] XICs:  x1 …  xn ( B(x1, …, Xn)  ∨ (i  [1, l])( y1 …  yk Ci (x1, …, xn, y1, …, yk)) • Generalization of relational embedded constraints • B, Ci: conjunction of simple XPath expressions • Subsuming relative keys and foreign keys (Deutsch and Tannen, [KRDB’01])

  28. Constraint analysis • Analysis of XML constraints • Consistency analysis • Implication analysis • Absolute, relative, path-expression constraints

  29. Consistency of XML specifications Given D: a DTD : a set of integrity constraints over D Consistency: Is there an XML document that both conforms to D and satisfies ? One wants to know whether XML specifications make sense! Run-time check: attempts to validate documents with (D, ). This would not tell us whether repeated failures are due to a bad specification or problems with the documents  static analysis is desirable

  30. An inconsistent specification The specification with D and  is inconsistent! • DTD D: <!ELEMENT db (province+, capital+)> <!ELEMENT province (city*, capital)> province.@name, capital.@inProvince • Constraints  : province.@name  province capital.@inProvince  capital capital.@inProvince  province.@name In contrast, one can specify keys and foreign keys in SQLwithout worrying about their consistency with schema.

  31. Cardinality constraints by keys, foreign keys Constraints  : province.@name  province capital.@inProvince  capital capital.@inProvince  province.@name Notation: • ext(): the set of elements in an XML document • ext(.l): the set ofl attribute values of all  elements  |ext(province.@name)| = |ext(province)| |ext(capital.@inProvince)| = |ext(capital)| |ext(capital.@inProvince)|  |ext(province.@name)|  |ext(capital)|  |ext(province)|

  32. Cardinality constraints imposed by DTDs DTD D: <!ELEMENT db (province+, capital+)> <!ELEMENT province (city*, capital)> Variables: • Xprovince: the number of province elements under the root • Xcapital: the number of capital subelements of the root • Ycapital: the number of capital subelements of province’s  Xprovince  1, Xcapital  1 |ext(province)| = Xprovince, Xprovince = Ycapital |ext(capital)| = Xcapital + Ycapital  |ext(capital)| > |ext(province)|

  33. db ... province province capital capital @name city capital “Hasselt” @inProvince “Limburg” “Limburg” “others” @inProvince “Hasselt” “Limburg” The interaction Contradiction: • From the constraints  : |ext(capital)||ext(province)| • From the DTD D: |ext(capital)| > |ext(province)| Thus there exists NO XML document that both conforms to D and satisfies .

  34. Consistency analysis [PODS’01, 02, JACM, SICOMP] • Trivial for relational databases: given any schema and keys, foreign keys, one can always find a nonempty instance of the schema satisfying the constraints. • Hard for XML: XML specifications may not be consistent! • Both DTDs and constraints impose cardinality constraints • The interaction between these two classes of cardinality constraints is rather complicated.

  35. Consistency analysis of XML constraints Theorem: The consistency problem is • undecidable for multi-attributeabsolute keys and foreign keys; • NP-complete for unary absolute keys and foreign keys, even for primary keys (primary: at most one key for each element type); • in NEXPTIME for primary multi-attributeabsolute keys and unary foreign keys • in 2NEXPTIMEand PSPACE-hard for unary absolute regular keys and foreign keys (target path: /, where  is a regular path expression and  an element type; key paths: attributes) • undecidable for relative keys and foreign keys, even when all the constraints are unary and primary. As opposed to the trivial analysis of the relational counterpart.

  36. Proof ideas • Multi-attribute constraints: reduction from the implication problem for functional and inclusion dependencies in RDBs. • Unary keys and foreign keys: • a nontrivial encoding of DTDs and unary constraints in terms of linear integer constraints (O(n2 log n)-time); • polynomially equivalent to LIP, linear integer programming • Multi-attribute primary keys and unary foreign keys: • polynomially equivalent to Prequadratic Diophantine Problem (PDE): satisfiability of linear integer constraints and prequadratic constraints of the form: x <= y z; • the precise complexity of PDE, a restriction to the Hilbert’s 10th problem, is open -- nontrivial.

  37. Proof idea for relative constraints Theorem: The consistency problem is undecidable for relative keys and foreign keys, even when all the constraints are unary and are under the primary key restriction. As opposed to the NP complexity of its absolute counterpart. Proof idea: reduction from the Hilbert’s 10th problem. Diophantine equation problem: P1 (x1, …, xk) = Q1 (x1, …, xk) + c1 . . . Pn (x1, …, xk) = Qn (x1, …, xk) + cn

  38. db ... university university government company dept employee employee employee dept dept @eid employee employee student employee student employee @taughtBy @eid @taughtBy @eid @eid @eid More on regular-expression constraints XML data is hierarchically structured: • define @eid as a key of employees of companies and schools; • define @taughtBy as a foreign key of students referencing @eid of school employees.

  39. db ... university university government company dept employee employee employee dept dept @eid employee employee student employee student employee @taughtBy @eid @taughtBy @eid @eid @eid Examples of regular constraints Key: (university._* + company._*).employee.@eid  (university._* + company._*).employee FK: _*.student.@taughtBy  university._*.employee.@eid _: wildcard that matches any label _*: the Kleene closure of _

  40. Regular path expression Vertical regular expressions:  ::=  |  | _ | . | + | * : empty word; : element type; _:wildcard; “., +, *”:concatenation, disjunction, Kleene star Example: (university._* + company._*).employee university._*.employee nodes(. ): the set of  elements in an XML document that are reachable from the root by following 

  41. Regular expression constraints • key:  .[X]   .. A document satisfies the key iff  x y  nodes( . ) (l X (x.l = y.l)  x = y) • foreign key:  1.1[X]  2.2[Y], and a key 2.2[Y]  2.2 A document satisfies the FK iff it satisfies the key and  x  nodes( 1.1 )  y  nodes( 2.2 ) (x[X] = y[Y]) where nodes(.): the set of  elements reachable from the root by following .

  42. Regular: an extension of absolute constraints Example: Key: (university._* + company._*).employee.@eid  (university._* + company._*).employee FK: _*.student.@taughtBy  university._*.employee.@eid Observation: nodes( _*.  ) = ext() Recall absolute constraints: • key: [X]      _*.  [X]   _*.  • foreign key: 1[X]  2[Y], 2[Y]   2   _*. 1 [X]  _*.2 [Y],  _*. 2 [Y]  _*.2

  43. Consistency analysis of regular constraints Corollary: The consistency problem is undecidable for multi-attribute regular keys and foreign keys. Theorem: It is decidable in 2NEXPTIMEand is PSPACE-hard for unary regular constraints. 2NEXPTIME: an involved encoding in terms of LIP • regular expressions in a DTD interact with (vertical) regular path expressions: reduce DTD to a simple normal form • regular path expressions interact with each other: introduce exponentially many variables for all boolean combinations • encoding “reachability” (nodes(.))of a path expression: tag variables with states of finite state automata

  44. Some tractable cases • Restrictions on constraints. Theorem: For multi-attribute relative keys only, the consistency problem is in linear time for arbitrary DTDs. Recall relative keys: country (province.@name  province) In contrast, due to the existence and uniqueness condition: Theorem: It is intractable for unary keys alone in XML Schema. • Restrictions on DTDs: Theorem: When DTD is fixed, the consistency problem is in PTIME for absolute unary keys and foreign keys. In practice, DTD is designed at one time, but constraints are written in stages: constraints are incrementally added.

  45. Implication analysis [PODS’00, 01, 02, DBPL’01] Given D: a DTD : a set of constraints expressed in C :a property(a constraint of C) Implication (C ): Is it the case that for any XML document, if it conforms to D and satisfies , then it must satisfy ? C: a constraint language The need for studying implication: • data integration: constraints checking at virtual views • optimization of XML queries and XML relational storage • design theory for XML specifications: normalization

  46. Some complexity results for implication analysis Theorem: The implication problem is • undecidable for multi-attribute absolute keys and foreign keys, and for unary relative keys and foreign keys; • PSPACE-hard for unary regular absolute keys and foreign keys; • coNP-complete for unary absolute keys and foreign keys. • coNP-hard for XML-Schema unary keys • in linear time for absolute multi-attribute keys; • in PTIME for arbitrary absolute keys and foreign keys when the DTD is fixed, and • in PTIME for relative path keys in the absence of DTDs The analysis of XML constraints is far more intricate than its relational counterpart

  47. Applications • Application of XML constraints, and open problems • Constraint propagation • Schema-directed XML integration • Normal form • Query rewriting/optimization • Update processing • Data cleaning • . . .

  48. XML shredding: relational storage of XML data XML shredding: • mapping XML data to relations • relational design: normalization • optimal relational storage of XML data • semantic connection: query/update optimization Web XML XML XML keys XML shredding propagation DB1 DB2 relational FDs

  49. db book book book book isbn chapter title chapter isbn title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” Example: XML constraints • (//book, {isbn}) -- isbn is an (absolute) key of book • (//book, (chapter, {number}) -- number is a key of chapter relative to book • (//book, (title, {})) -- each book has a unique title chapter chapter

  50. db book book book book isbn chapter title chapter isbn chapter chapter title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” Mapping from XML to a predefined relation Predefined RDB: chapter(bookTitle, chapterNum, chapterTitle) • Mapping: for each book, extract its title, and the numbers and titles of all its chapters • Predefined relational key: (bookTitle, chapterNum) Can the XML data be mapped to the RDB without violating the key?