1 / 38

Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validating Streaming XML Documents Luc Segoufin & Victor Vianu. Presented by Harel Paz. The Challenge. XML becoming a standard for data exchange on the Web. Need: on-line processing of large amounts of data in XML format, using limited memory.

aden
Download Presentation

Validating Streaming XML Documents Luc Segoufin & Victor Vianu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Validating Streaming XML DocumentsLuc Segoufin & Victor Vianu Presented by Harel Paz

  2. The Challenge • XML becoming a standard for data exchange on the Web. • Need: on-line processing of large amounts of data in XML format, using limited memory. • Our focus: validating XML documents against given DTDs.

  3. ... <u><v> /v><v><w> <w></v> ...< ... start accept FSA Yes/No Input stream Validating Streaming XML Documents • Restrictions over the validation: • In a single pass. • Using a fixed amount of memory, depending on the DTD. FSA

  4. The Problem in 2 Flavors • There are 2 flavors to the problem: • Strong validation: validation that includes checking well-formedness. • Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.

  5. r a a b c b c c Tree Document • XML documents are abstracted by “tree documents”. • A tree document over a finite alphabet  is a finite unranked tree with labels in  and an order on the children of each node.

  6. r a a b c b c c String Representation • XML documents are a string representation of trees using opening and closing tags for each element. • For each , • represents the opening tag. • represents the closing tag for . • Notation: .

  7. r a a • A tree document over Σsatisfies a DTD if it is a derivation tree of the grammar. • DTD : • r  a* • a  bc • b  c? • c  є b c b c satisfies c DTDs • A DTD consists of an extended context-free grammar over alphabet Σ.

  8. DTDs – cont’ • Each DTD has a unique rule for each symbol . • denotes the regular expression. • is the language over consisting of the string representations of all tree documents satisfying .

  9. Strong Validation of Streaming XML Documents • The problem: validating an XML document with respect to a given DTD. • Need to characterize the DTDs , for which can be recognized by an FSA. • Such DTDs are called stronglyrecognizable.

  10. r a . . a Strong Validation – Example 1 • DTD d: • r  a • a  a? • . • is not regular, so cannot be strongly validated by an FSA. • is not strongly recognizable.

  11. r a a . . b c Strong Validation – Example 2 • DTD d: • r  a* • a  b|c • . • is regular, so is strongly recognizable.

  12. More Definitions • Let be a DTD over . • The dependency graph of , , is the graph constructed as follows: • Its set of vertices is . • For each rule in , there is an edge from to , for each occurring in .

  13. More Definitions (cont’) • Two labels, and , are mutually recursiveif they belong to some cycle of . • is recursive if it is mutually recursive with itself. • DTD is non-recursive iff is acyclic. • A DTD is fully recursive if all labels from which recursive labels are reachable in are mutually recursive.

  14. is not acyclic. • is not fully recursive. • is recursive r a • is non-recursive. r a b c Dependency Graph – Examples • DTD d: • r  a • a  a? • DTD d: • r  a* • a  b|c

  15. Characterization of Strongly Recognizable DTDs Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive. • Proof sketch: • If is a strongly recognizable DTD, there is an FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings. • If is non-recursive, an algorithm to build an FSA recognizing is given.

  16. Validating Well-Formed XML Documents • The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. • Validation using an FSA. • Such DTDs are called recognizable. • The requirement that should be regular is now too strong. • The FSA should only work correctly on well-balanced strings representing trees.

  17. Validation - Example 1 • DTD d: • r  a • a  a? • is not strongly recognizable. • But, it is recognizable: • If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).

  18. a a b a b c a c a Validation - Example 2 • DTD d: • a  (ab|ca|є) • b  є • c  є • is not recognizable. • An FSA cannot store enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).

  19. Characterizing Recognizable DTDs • Which DTDs are recognizable? • Non-recursive DTDs. • What about recursive DTDs? • Not a trivial question. • Are there any necessary conditions of being a recognizable DTD? • Are there any sub-groups of DTDs for which the necessary conditions are also sufficient?

  20. Necessary Condition for a Recognizable DTD Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols: Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .

  21. Fully Recursive DTDs • The necessary condition stated in lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. • Next, we’ll see how to construct an FSA for a DTD , which accepts all words in (and possibly more). • For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).

  22. The Standard FSA • Let be a DTD over alphabet . • Equivalence relation on • Equivalence classes are the strongly connected components of . • Let be a partial order on the classes of , where iff for some and there is an edge from to in . • may have several maximal classes, but only one minimum class.

  23. The classes of , are and . • . r a Example • DTD d: • r  aa • a  a?

  24. Constructing FSA for Constructing FSA of class {a}’s string representation For edge in add to : • . • . a A A Example – cont’ • DTD d: • r  aa • a  a?

  25. Example – cont’ • DTD d: • r  aa • a  a?

  26. Example – cont’ • The above FSA recognizes all well-balanced words produced by the above DTD. • But also other well-balanced words (such as ). • There is no automaton recognizing this DTD. • DTD d: • r  aa • a  a?

  27. Recognizable Fully Recursive DTDs Theorem 4.1: The following are equivalent for each fully recursive DTD : (i) is recognizable. (ii) satisfies the conditions of Lemma 4.2. (iii) The set of well-balanced strings accepted by the FSA is precisely .

  28. Recognizable DTDs • Which DTDs are recognizable? • Non-recursive DTDs. • Fully recursive DTDs satisfying the conditions of Lemma 4.2. • And others… • But, characterization in the general case remains an open question. • Partial progress: necessary conditions for recognizability.

  29. Alternative Validation Approaches • 2 alternative approaches for validating DTDs that are not recognizable: • Relax the constant memory requirement. • Refining the original DTD.

  30. Validation with Bounded Stack • Relaxing the constant memory requirement. • Use a stack whose depth is bounded in the depth of an XML document. • Validation done in a single deterministic pass. • Appealing approach in practice. • For each DTD, there exists a deterministic PDA that accepts precisely its language. • Example- the DTD: • r  aa • a  a?

  31. DTD: DTD: Refining the DTD • Refining a DTD means providing in the tags additional information that can be used for validation. • Example: • The refined DTD can be validated by an FSA. • For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.

  32. Summary • First step towards the formal investigation of processing streaming XML. • Provided conditions under which validation can be done in a single pass and constant memory, using an FSA. • Considered alternative approaches, when validation using an FSA is not possible.

  33. Appendix The Standard FSA Construction

  34. The Standard FSA • is inductively constructed starting from the maximal elements of . • Let be a maximal element of . • For each regular expression ( ), a non-deterministic FSA is built. • Disjoint states for different ’s. • Initial state of is , while its final states are

  35. The Standard FSA – cont’ is a maximal element of • Build : • Its states are the union of the states of the FSAs for . • Transitions- for each transition of , add to the transitions: • for the initial state of . • for each final state of . must belong to

  36. The Standard FSA – cont’ • Build for non-maximal elements of , when all FSAs of elements , such that are already constructed: • Unlike the maximal elements case, has transitions ,where (i.e., ). • For such transitions, we add to : • A new disjoint copy of . • for the initial state of . • for each final state of .

  37. The Standard FSA – cont’ • The final FSA is obtained by adding to the FSA of the minimum class (containing the root label ): • A new start state with transition for the start state of . • A final state with transition for each final state of .

  38. The Standard FSA - Lemma Lemma 4.3: For each DTD , let be the automation described. We have: (i) Every word in is accepted by . (ii) can be constructed from in exponential time. • Complexity of ‘s construction: . • is the maximum size of an FSA for a regular expression of . • is the depth of the partial order .

More Related