1 / 42

The correction of XML data

The correction of XML data. Approximation and Edit Distance Testers and Correctors Correcting regular binary trees Applications to XML Practical corrector Relative value of documents. Université Paris II & LRI Michel de Rougemont mdr@lri.fr http://www.lri.fr/~mdr. Approximation.

summer
Download Presentation

The correction of XML data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The correction of XML data • Approximation and Edit Distance • Testers and Correctors • Correcting regular binary trees • Applications to XML • Practical corrector • Relative value of documents Université Paris II & LRI Michel de Rougemont mdr@lri.fr http://www.lri.fr/~mdr june 2003

  2. Approximation • Relations Dist (R,S) = # x : if Dist(R,S) < • Edit-distance • Trees: Tree-Edit-Distance Min # Deletions, Insertions Left-deletion Left-insertion june 2003

  3. Classical Tree-Edit-Distance a b Deletion a e c b a Binary trees : p-Distance allows permutation b e c d Insertion f e e c d Dist(T1,T2) =2 p-Dist (T1,T2) =1 Dist (T, L) = Min Dist (T,T’) june 2003

  4. Approximate satisfiability • Satisfiability : Tree |= F • Approximate satisfiability Tree |= F Image on a class K of trees june 2003

  5. Logic, testers, correctors A Tester decides |= for a formula F. A Corrector takes a tree T close to a language L and find T’ in L close to T. This is possible if F follows a simple logic. Theorem. there is linear time corrector for regular binary trees and a constant distance. Given a tree T, k- close to a regular language L, we find in linear time T’ in L, c.k -close to T. General problem: given a language L defined in some Logic, find a corrector. Theorem. (implicit in Alon and al. FOCS2000) There is a linear time corrector for regular words and distance Application to Model-Checking (LICS2002) june 2003

  6. Simple example 1 0 0 0 0 Tester for 0+ 1* 0+ Types of segments: 000000011111110000010000 probably accepted 011110000000110111 rejected with high probability 00000 11111 00011 11100 01100 Corrector for 0+ 1* 0+ 00000001111110000100000 * june 2003

  7. Regular Trees • Tree-automata • Logical definability on trees • Tree grammar • Regular expression r(a,b(a,b(a,b(a,b(a,b(a,b)....) r(a(a,b(a,b(a(a,b),b)....),b) june 2003

  8. Tree automata (q1,q1)q2 (q1,q0)q2 (q2,-) q2 (-,q2) q2 • (q0, q0)  q1 • (q0,q1)  q1 q1 q1 q0 q1 q0 q2 q1 q0 q1 q1 q0 q0 q0 q0 q0 q0 june 2003

  9. Feasible and infeasible subtrees Definition : a subtree t is feasible for L if there are subtrees (for its leaves) which reach states (q1...ql) such that the state of the root q=t(q1...ql) can reach an accepting state (in the automaton for L). A subtree is infeasible if it is not feasible feasible infeasible june 2003

  10. Infeasible subtrees Fact . If then the number of unfeasible subtrees of length a is O(n). Fact. If the distance is small, there are few infeasibles trees. Intuition : make local corrections at the root of the infeasible trees june 2003

  11. Structure of the corrector Phase 1 : (Bottom-up) Marking of * nodes, roots of infeasible subtrees. Phase 2 : (Top-down) Recursive analysis of the * subtrees to make root accept. Phase 3 : Local corrections q0 q1 june 2003

  12. Phase 1 : bottom-up marking • Definitions: • A terminal *-node is the first sink node of a run • A * subtree of a node v is the subtree whose root • is v reaching leaves or *-node • A node v is a *-node if its state is a sink node • when all possible reachable states replace the • *-nodes of its *-subtree. • 4. Compute the size of the subtrees * Runs with all possible reachable states (q,q’) reach a sink. * * O(n) procedure. Lemma 1: If Dist(T,L)<k, there are at most k *-nodes. june 2003

  13. Phase 2 : top-down possible states Let (q,q’) a possible choice at the top *-subtree. Let q’’ a possible state for the *-node of the left *-subtree * * * q’’ instead of * q2 q1 Correction needed. june 2003

  14. Case analysis of the automaton Case 1: One essentially-connected component. Case 2: General case Many components june 2003

  15. Case 1: one component Lemma: if (q1,q2,q’’) are in the same connected component, there is a finite subtree t which can correct. Case a : there is a transition (q,q’) to q’’ with both q,q’ in C: there is a finite tree t1 from q1 to q, a finite tree t2 from q2 to q’ and the correction is: q’’ q’’ q2 q q1 q’ t1 t2 q1 q2 june 2003

  16. Case 1: b and c Case b : there is a transition (q,q’) to q’’ with one of q or q’ being q0: suppose q=q0. The correction uses t2 and cut the left branch. Case c: there is a transition (q0,q0) to q’’. The correction cuts both branches. q’’ q’’ q0 q’ q2 q1 t2 q2 q’’ q’’ q0 q0 q2 q1 june 2003

  17. Correction rules q’’ instead of * q2 q1 june 2003

  18. Case 2 : many components Hypothesis : q1 in Ci q2 in Cj q’’ in Ck Case a: P such that Ci < Ck and Cj < Ck Find t1 and t2 as in case 1.a q’’ q’’ q2 q q1 q’ t1 t2 q1 q2 june 2003

  19. Case 2: b and c Case b,c : P such that Ci >Ck and Cj < Ck Find t2 and let Cp=inf(Ci,Ck). Cut the left branch until Cp. Case d: P such that Ci >Ck and Cj > Ck Let Cp=inf(Ci,Ck). Cut the left branch until Cp. Let Cq=inf(Cj,Ck). Cut the right branch until Cq. q’’ q’’ q’ q2 q1 t2 q2 q’’ q’’ q2 q1 june 2003

  20. Correction rules q’’ instead of * q2 q1 june 2003

  21. Analysis of the corrector Fact 1: finitely many insertions Fact 2: deletions less predictable Lemma: If the cut is large, than the distance must be large. General Corrector: 1. Do the inductive Marking bottom-up. 2. Apply the recursive analysis of compatible states top-down. 3. For each transition (q,q’) -> q’’ apply the correction, compute the distance and select the rule with smallest distance 4. Select the * states with Minimum Dist.. Procedure is O(n), exponential in k and size(Q) june 2003

  22. General result Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k. Proof : # *-nodes < k Case 1: 0 *-node: no correction Case 2: at least 1 *-node. Looking at all possible k-variations will correct the errors in the *-subtree and diminish the *-nodes. june 2003

  23. Unranked trees: XML Labelled trees of large degree. Structure given by a « grammar », or DTD. Generalization of automata: 1. Unranked tree automaton 2. Tree-walking automaton Method: Code an unranked labelled tree with a binary labelled tree. Advantage: the correction table is FINITE. Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k. june 2003

  24. Applications to XML DTD <?xml version='1.0' ?> <!ELEMENT book (chapter*,title,author)> <!ELEMENT chapter (title,para*)> <!ELEMENT title (#PCDATA)> <!ELEMENT para (#PCDATA)> <!ELEMENT author (#PCDATA)> Binary Normal Form l -> l1, a l1 -> c1, t c1 -> c, c1 c1 -> - c -> t, p1 p1 -> p, p1 p1 -> - a -> data t -> data p -> data june 2003

  25. XML tree decomposition XML file transformed into a binary labelled tree. june 2003

  26. XML file with errors june 2003

  27. Corrected XML file No ambiguities on the possible states of q’’ Immediate correction! june 2003

  28. XML Correction rules q’’ instead of * q2 q1 june 2003

  29. Java Implementation Parser: Xerces, Tree structure : DOM Phase 1: look at the parent node of *-node. Propose tags for * (c or f) Phase 2: for each proposal, compute the distance. *=c, distance=1, replacing c with b. *=f, distance=2, replacing c with b and adding an a leaf. Choose the 1st solution. d a * a b DTD: d (c,b,a) or (f,b,a) c (a,b,b) f (a,b,b,a) a b c june 2003

  30. Relative value of documents • Given a DTD, mark the Web documents as follows: • Infinity if there are far • Dist(Document,DTD)=i • Provides a relative valued landscape. Works for boolean combinations • Generalize to • Min{ Dist(D,DTD’) : } june 2003

  31. Distance on words and trees • On words, how can one compute • Dist(w,w’), a P-problem • Is is possible in less than O(n) ? • Yes, STOC 2003 • Dist(w,L) and Dist(L,L’) • Given two trees, how can one compute: • Dist(T,T’) P on ordered trees and NP-complete on unordered trees • p-Dist(T,T’) NP-complete. june 2003

  32. Conclusion • Testers and Correctors • Testers for approximate verification • Correctors • Trees • Regular trees are testable • If T is at distance less than k,then we can correct it. • Theoretical algorithms • Practical algorithms june 2003

  33. Testers, Correctors and formal verification • Two different views of logical verification: • Formal verification. How can we check if a • program satisfies a specification? • Logical proof: theorem proving, • model checking • Design a tester for the specification • (closer to practice: Windows 95 to XP !) • (Blum & Kanan) • Combine the two approaches to approximately • verify a specification (LICS 2002, • Sylvain’s thesis) june 2003

  34. Testers Self-testers and correctors for Linear Algebra Blum & Kanan 1985s Testers for graph properties : k-colorability Goldreich and al. 1995s graph properties have testers Alon and al. 1999 Regular languages have testers Alon and al. 2000s Testers for Regular tree languages (Mdr and Magniez) Corrector for regular trees! june 2003

  35. Blum’s Checker and Tester Checker for f (Blum, Kannan, ~1990) x y P C Correct Buggy A checker is a probabilistic program with an oracle P such that for all x,k : if P=f, C(x,k) = Correct If P(x)!=f(x), Prob[ C(x,k) =Buggy] >1- ½^k june 2003

  36. Self-testing • Distance d(f,g) = | {x : f(x) != g(x)}| / | D| • A self-tester for f is a probabilistic program T(P, ) such that : • If d(P,f)=0, then T(P, )=Correct • If d(P,f) > then T(P, )=Buggy • Corrector. Division (x,y) : Majority { x.r /y.r : r random.} june 2003

  37. Property testing on graphs 2-Colorability G Bipartite 2-colorable H H random subgraph G bipartite  Prob [ H is bipartite] =1 G is -far from bipartite  Prob [ H is non-bipartite] > 2/3 june 2003

  38. Property testing on graphs 3-Colorability G H random subgraph G 3-colorable  Prob [ H is 3-colorable] =1 G is -far from 3-colorable Prob [ H is non 3-colorable] > 2/3 Generalization to k-colorability june 2003

  39. Property testing and descriptive complexity • Which graphs (and matrices) properties have testers? • Alon and al., STOC 99: Sigma 2  testers • Compression. -equivalent june 2003

  40. Property testing on words Word W H random subword F : 0*1* W |= F  Prob [ H |= F’ ] =1 W is -far from F Prob [ H |= not F’] >2/3 june 2003

  41. A testable regular property How can we verify F : 0*1* ? Word W H random subword 0000 1111 0111 ..... F’ distance(w,w’) = Hamming distance W |= F  Prob [ H |= F’ ] =1 W is -far from F Prob [ H |= not F’] >2/3 Many 10 appear in W. Repeating the test will detect it with high probability june 2003

  42. Regular properties are testable Theorem. Regular languages are testable. N. Alon, M. Krivelevich, I. Newman, M. Szegedy FOCS 99. General idea : if a word is far from a regular language, it contains many subwords which are infeasible and can be detected. Theorem. Dyck languages are not testable june 2003

More Related