1 / 34

Streamed Validation

Streamed Validation. Ksenia Rybenko, TU Dresden. XML-processing. Querying Computing running aggregates of streams Validating XML-documents against given DTDs. Statement of the problem.

bunme
Download Presentation

Streamed Validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streamed Validation Ksenia Rybenko, TU Dresden

  2. XML-processing • Querying • Computing running aggregates of streams • Validating XML-documents against given DTDs

  3. Statement of the problem • Verify that an XML-document is valid with respect to a given DTD in a single pass and using a fixed amount of memory, depending on the DTD but not on the XML-document • Validation by FSA performing a pass on XML-document as it streams through the network

  4. Example of DTD

  5. Validation • Strong validation (additionally checks well-formedness) -> strongly recognizable DTDs • Validation -> recognizable DTDs

  6. XML as a tree document • Tree document over ∑ is a finite unranked tree with labels in ∑ and an order on the children of each node • Let T is a set of tree documents. ζ(T) is the language consisting of the string representations of the tree documents in T

  7. String representation • String associated to the tree document denoted [t] • Induction • If t is a single root labeled a then [t]=aā • If t consists of a root labeled a and subtrees t1…tk then [t]=a[t1]…[tk]ā

  8. Context Free Grammar (CFG) • Context-free grammar G = (V, ∑, P, S) • V - sets of nonterminals • ∑ - set of terminals • S in V - start symbol • P - finite set of productions of the form A -> a, where A is in V and a in (V U ∑)* • A tree document with root r over ∑satisfies a DTD d if it is a derivation tree of CFG G = (∑, ø, P, r) for d, where only regular expressions are allowed on the right hand side of the productions

  9. Regular languages • Ø,ε and a for a in ∑ are in Reg∑ • If L, L1, L2 in Reg∑ then so are L1 U L2, L1· L2={u·v | u in L1 and v in L2}, L*={u1…un | n ≥ 0 and ui in L} Example: (ab)*a is regular, while anbn is not

  10. Example of DTD in terms of CFG

  11. Useful notation • a -> Ra is unique for each a in ∑ (a ->Ra1 and a -> Ra2 = a ->Ra1|Ra2) • The set of tree documents satisfying a DTD d is denoted by SAT(d) • ζ(d) is the language consisting of all string representations of elements of SAT(d)

  12. Finite State Automaton • A= (Q,∑,I,Δ,F) • a finite set of states Q • a finite alphabet ∑ • a set of initial states I • a transition relation Δ: Q× ∑ ×Q • a set of final states F • Path in Automaton is a sequence q0a1q1a2…anqn : q0 ->a1..an -> qn, where (qi-1,ai,qi) in Δ • Path is successful if q0 is in I and qn is in F • Accepting language L(A)={w in ∑* | q0 ->w -> qn is a successful path in A} • Kleene‘s theorem: L is recognizable if it is regular

  13. Strong validation of XML-documents • DTD d is strongly recognizable if ζ(d) can be recognized by an FSA • Strong validation includes also checking well-formedness of the XML-document

  14. Example of recognizability not regular regular

  15. Dependency graph for DTDs • Gd construction: • set of vertices is ∑ • a -> Ra - add edge from a to b for each b occurring in some word in Ra • Two labels a and b are mutually recursive if they belong to some cycle of Gd, and a is recursive if it is mutually recursive with itself Gd: (r,a),(a,a),(a,b)

  16. Recursivity of DTDs • DTD d is nonrecursive iff Gd is acyclic • A specialized DTD d = (∑, ∑‘, d’, μ) is nonrecursive iff the DTD d’ over ∑‘ is nonrecursive. • DTD d is fully recursive if all labels from which recursive labels are reachable in Gd are mutually recursive

  17. Recognizability condition (1) • DTD is strongly recognizable iff it is nonrecursive

  18. Validating well-formed XML-documents • Let (Tree) denote the language consisting of all string representations of trees over ∑. The DTD d is recognizable and can be validated by an FSA iff there exists some regular language R such that ζ(d) = ζ(Tree) ∏ R

  19. Example of recognizable DTD

  20. Condition of recognizability (2) • Lemma1: Let d is a recognizable DTD. Then the following holds, where a, b, u, v, w are words over ∑ while x, y, z (possibly subscripted) are individual symbols: Let k be a positive integer and xi, zi, 1 ≤ i ≤ k be mutually recursive symbols of d (not necessarily distinct). If ax1b in Rz1 , a’xkb’ in Rz1 and uixi-1vixiwi in Rzi for 1 ≤ i ≤ k, then ax1v2x2 . . . vkxkb’ must be in Rz1

  21. Example of not recognizable DTD according to the lemma1 for k=2 does not hold. a and b are mutually recursive, Ra contains a and b, Rb contains ab, but Ra does not contain the required ab

  22. Constructing a standard FSA Ad, which accepts ζ(d) • Ad is constructed from the separate automata for every regular expression, connected by additional transitions, with new initial and final states • Procedure is based on the in induction on the order of the edges in Gd, starting from the maximal element

  23. Example of constructing Ad Note: Ad also accepts additional words such as raaāaāāŕ

  24. Condition of recognizability (3) • A fully recursive DTD is recognizable • iff the set of well-balanced strings accepted by the standard FSA Ad is precisely ζ(d) • iff d satisfies the conditions of lemma1

  25. Alternative approaches to validation • Validation with bounded stack • Refining the DTD

  26. Validation with bounded stack • Relaxing the memory requirement • A stack whose depth is bounded in the depth of the XML-document is allowed as auxiliary memory • Formally it can be done by the Pushdown automaton (PDA), it is a finite automaton with control of both an input tape and a stack. The stack is a string of symbols of some alphabet

  27. Validation of a DTD by PDA • The class of languages accepted by PDA’s is precisely the class of contextfree languages. Thus, every DTD can be strongly validated by some PDA

  28. Refining the DTD • Refining the DTD is the problem of providing it in the tags with additional information (specialization) • For every DTD d there exists an equivalent specialized DTD dspec of size quadratic in d such that dspec is recognizable

  29. DTDs with specialization A specialized DTD over ∑ is a tuple d=(∑,∑‘,d‘,μ) where • ∑ and ∑‘ are finite alphabets • d‘ is a DTD over ∑‘ • μ is a mapping from ∑‘ to ∑ • A tree document t over ∑satisfies a specialized DTD d if t is in μ(SAT(d‘))

  30. Example of DTD with specialization μ(c)={ca,cb}

  31. Example of refining a DTD

  32. Conclusion • Conditions under which validation can be done in a single pass and constant memory are provided • Whenever DTD is recognizable it can be validated by the standard FSA • Another options for validation are considered: PDA, specializing a DTD

  33. References • [1] A.V. Aho and J.D. Ullman. Translation on a context free grammar. Information and Cintrol, 19(19):439–475, 1971. • [2] A. Bruggemann-Klein and D. Wood. Regular tree and regular hedge languages over non-ranked alphabets. Hong Kong University of Science and Technology Computer Science Center. ResearchReport HKUST-TCSC-2001-05, 2001. • [3] J. Engelfriet, H.J. Hoogeboom, and J-P. van Best. Trips on trees. Acta Cybernetica, 14:51–64, 1999. • [4] J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979. • [5] Y. Papakonstantinou and V. Vinau. Dtd inference for views of xml data. In ACM PODS, pages 35–46, 2000. • [6] L. Segoufin and V. Vinau. Validating streaming xml documents. In PODS, 2002.

  34. Thank you for your attention

More Related