1 / 111

Typing semistructured data

Typing semistructured data. Serge Abiteboul 2008. Organization. Motivations Automata Automata on words Ranked tree automata Unranked tree automata Automata and monadic second-order logic Automata – to compute XML typing: DTD, XML schema Graphs and bisimulation. Motivation.

Download Presentation

Typing semistructured data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Typing semistructured data Serge Abiteboul 2008 Typing semistructured data

  2. Organization • Motivations • Automata • Automata on words • Ranked tree automata • Unranked tree automata • Automata and monadic second-order logic • Automata – to compute • XML typing: DTD, XML schema • Graphs and bisimulation

  3. Motivation Typing semistructured data

  4. XML typing • Not compulsory • Simplify writing software for XML • Improve interoperability between programs • Improve storage and performance • Ease querying: data guide • Simplify data protection • Reject illegal update – like relational dependencies

  5. Root person company Company works-for managed-by Employee Company c.e.o. Employee name address name string Improve storage Lower-bound schema Store rest in overflow graph Typing semistructured data

  6. Bib paper book address year title title journal author string int string string last name first name zip street city string string string string string Improve performance select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” Typing semistructured data

  7. Type checking • Who checks • XML editor: check that the data conforms to its type • XML exchange, e.g., with Web service • Server when delivering the data • Client/application: when receiving it • Dynamic verification: after the data is produced • Static verification: verification of the program that generates the data

  8. Static verification • Input: input type T and code of function f • f is Xquery, Xpath, XSLT, etc. • Verification of T’ • Is it true that d╞T, f(d)╞T’ ? • Type inference • Find the smallest T’ such that d╞T, f(d)╞T’ • Rapidly undecidable because of “joins”

  9. Example for $p in doc("parts.xml“)//part[color=“red"] return <part> <name>$p/name</name> <desc>$p/desc</desc> </part> Result type (part (name (string) desc (any) )* If the type of parts.xml//part/desc is string (part (name (string) desc (string) )*

  10. Difficulty for $X in Input, $Y in Input do { print ( <b/> } Input: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML schema There is no « best » result • b* •  + b2 b* •  + b2 + b4b* •  + b2 + b4 + b9b* • …

  11. Why tree automata? • XML = unranked trees • No theory for XML • Rich theory for strings: Automata • Extend to rich theory for ranked trees: Tree automata • Nice algorithms • Nice theorems • Can this carry to unranked trees and XML? • Yes!

  12. From strings to trees a a a b b b b b a b b b b b b a b a a a a b a a b b b b Word Binary tree… Unranked tree automata Finite State Ranked tree automata no bound on number of children Automata

  13. Only unranked tree automata? • Missing practical gadgets • Complexity of verification • Goal: typing at reasonable cost • Unranked tree automata + …

  14. Automata Automata on words Typing semistructured data

  15. Finite state automata on words Transitions Alphabet State Initial state Accepting states Typing semistructured data

  16. Nondeterministic automaton: Example a b a - a b a a b - q0 q0 q0 q0 q0 q0 q0 q0 q0 q0 q2 q1 q1 q1 q1 q1 OK KO

  17. Deterministic No  transition No alternative transitions such as Determinization It is possible to obtain an equivalent deterministic automaton State of new automaton = set of states of the original one Possible exponential blow-up Minimization Limitations – cannot do Context-free languages Essential tool – e.g., lexical analysis Reminder

  18. Reminder (2) • L(A) = set of words accepted by automata A • Regular languages • Can be described by regular expressions, e.g. a(b+c)*d • Closed under complement • Closed under union, intersection • Product automata with states (s,s’) where s is from A and s’ is from A’

  19. Automata on words versus trees a Top down Bottom up Left to right b b b b a a b b a a Right to left a b No difference Differences

  20. Automata Automata on ranked trees Typing semistructured data

  21. Binary tree automata • Parallel evaluation • For leaves: • For other nodes: q2 a Bottom up q” q1 b b b b a a q” q’ q q a b q’ q Typing semistructured data

  22. Bottom-up tree automata Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ Accepts is the root is in some state in F Not deterministic if alternatives or -transitions:

  23. Example: deterministic bottom-up

  24. v v v v v 1 1 1 0 0 v 1 1 Boolean circuit evaluation OK

  25. Regular tree language = set of trees accepted by a bottom-up tree automata Typing semistructured data

  26. Regular tree languages The following are equivalent • L is a regular tree language • L is accepted by a nondeterministic bottom-up automata • L is accepted by a deterministic bottom-up automata • L is accepted by a nondeterministic top-down automata Deterministic top-down is weaker

  27. Top-down tree automata Top-down: if a node labeled a is in state q”, then its left child moves to state q (right to q’) Accepts is all leaves are is in states in F Not deterministic if

  28. Why deterministic top-down is weaker? • Consider the language • L = { f(a,b), f(b,a) } • It can be accepted by a bottom-up TA • Exercise: write a BUTA A such that L = L(A) • Suppose that B is a deterministic top-down TA with L = L(B) • Exercise: Show that B also accepts {f(a,a)} • A contradiction Fact: No deterministic top-down tree automata accepts L

  29. Ranked trees automata: Properties • Like for words only higher complexity • Determinization • Minimization • Closed under • Complement • Intersection • Union

  30. But… • XML documents are unranked • The kind of things we want to do: book (intro,section*,conclusion)

  31. Automata Automata on unranked tree Typing semistructured data

  32. Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language

  33. Unranked tree automata (2) Rule: Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}

  34. Building on ranked trees a a b b a b b b a b b b a b b b a b • Ranked tree: FirstChild-NextSibling • F: encoding into a ranked tree • F is a bijection • F-1: decoding

  35. Building on bottom-up ranked trees (2) For each Unranked TA A, there is a Ranked TA accepting F(L(A)) For each Ranked TA A, there is an unranked TA accepting F-1(L(A)) Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement

  36. Determinization always possible for bottom-up Can we use the FirstChild-NextSibling encoding No: it does not preserve determinism Determinization

  37. Top-down? • This is more delicate • Transition (a,q)=A(a,q) • The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node • Accepts if all the leaves are in accepting state

  38. Boolean circuit evaluation It is accepted It rejects by if some state of a leaf is neither 0 with q0 nor 1 with q1 v v v v 1 0 0 v 1 v 0 1 1 1 1 v v 1 0 1 1

  39. Automata Automata and monadic second-order logic Typing semistructured data

  40. Monadic second-order logic a 1 b b a b 2 3 4 5 b b a b 6 7 8 9 • Representation of a tree as a logical structure E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9)

  41. Monadic second-order logic Quantification over a set variable Set variable E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax

  42. Example of MSO Each a node has a b-descendant This corresponds to the formula For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains some y labeled b

  43. Bridge Theorem: for a set L of trees, the following are equivalent L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata L = {T | T satisfies } for some MSO formula  i.e. L is definable in MSO

  44. XML typing DTDs Typing semistructured data

  45. DTD • Describe the children of a node of a label a by a regular expression • Bizarre syntax <!ELEMENT populationdata (continent*) > <!ELEMENT continent (name, country*) > <!ELEMENT country (name, province*)> <!ELEMENT province (name, city*) > <!ELEMENT city (name, pop) > <!ELEMENT name (#PCDATA) > <!ELEMENT pop (#PCDATA) >

  46. DTD and deterministism • Regular expressions in DTD should be deterministic • Complicated definition • Intuition: the corresponding automata should be deterministic • (a+b)*a is not • When reading <a>, one cannot tell whether it is an a from (a+b) or if it is the a of the end • (b*a)(b*a)* is an equivalent expression that is deterministic

  47. Very efficient validation • It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata Aa • Possible to type check the document while scanning it, e.g. with SAX parser

  48. Very efficient validation (2) <a><b><d/><d/></b><c/></a> a b c d d Aa s t u s’ t’ b c t s u Accept d Ab s’ t’ d <!ELEMENT a ( b c ) > <!ELEMENT b ( d+ ) >

  49. Warning The previous example can be checked with a simple automata on words But not the following one <!ELEMENT part ( part* ) > The stack is needed for accepting <a>…<a></a>…</a> n <a> n </a>

  50. Some bad news for DTD • Not closed under union DTD1 … <!ELEMENT used( ad*) > <!ELEMENT ad ( year, brand )> DTD2 … <!ELEMENT new( ad*) > <!ELEMENT ad ( brand )> • L(DTD1)  L(DTD2) cannot be described by a DTD but can be described easily by a tree automata • Problem with the type of ad that depends of its parent • Also not closed under complement • Limited expressive power

More Related