1 / 81

Schema & Schema Integration

Schema & Schema Integration. Carsten Karl Dennis Schade Thorsten Dollmann. Outline. XTRACT System for inferring DTDs from a set of XML documents Incremental validation of XML Documents. Schema & XML Databases. Databases need a Schema DTDs serve the role of the schema of the document

agoldsberry
Download Presentation

Schema & Schema Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

  2. Outline • XTRACT System for inferring DTDs from a set of XML documents • Incremental validation of XML Documents

  3. Schema & XML Databases • Databases need a Schema • DTDs serve the role of the schema of the document • Efficient storage of XML data • Optimization of XML queries DTDs are not mandatory !!!!

  4. XTRACT • Goal: Infer DTDs from a set of XML documents

  5. Problem Simplification and Abstraction • Infer a DTD for each tag separately • Separate example sequences for each <e> • Infer a “good” DTD for each <e> • Resulting document DTD is a composition of all inferred “tag”-DTDs

  6. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  7. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  8. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  9. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  10. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  11. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  12. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  13. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  14. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  15. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  16. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  17. Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

  18. Candidate DTD Concise Precise (a|b)* (ab|abab|ababab) ab|ab(ab|abab) (ab)* What is a “good” DTD ? Given the example sequence set I={ ab, abab, ababab } Possible DTDs: Yes No No Yes No Yes Yes Somewhat

  19. What is a “good” DTD ? (ctd.) • A good DTD D must satisfy two restrictions • R1: D should be concise • R2: D should be precise • Minimum Description Length quantifies and resolves the tradeoff between R1 and R2

  20. The MDL Principle • MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of • The length of the theory in bits • The length of the data, in bits, when encoded with the help of the theory

  21. Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } MDL Modul Factoring Generalization Overview of XTRACT System Sg = I  { (ab)*, (a|b)*, b*d, b*e } Sf = Sg { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

  22. MDL Subsystem • In order to use the MDL principle, we need to • Define theory description length • Define data description length • Solve the resulting minimization problem

  23. MDL Coding scheme • Description Length of a DTD • Number of characters of the DTD • Cost of encoding the example sequences • encoding of b in terms of DTD a | b | c is 1, cost 1 (position of b in the DTD) • encoding of bbb in terms of DTD b* is 3 (number of repetitions of b), cost 1 • encoding of b in terms of DTD b is , cost 0

  24. MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* abb abbb ab* abbbb abb ab

  25. MDL Subsystem Minimization Input Sequences Candidate DTDs ab 3 = 1* + (1a + 1b) 6 (a|b)* 30 4 abb 5 6 abbb ab* 7 abbbb abb abbbbb

  26. MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 1 abb 1 3 1 abbb ab* 8 1 abbbb 1 abb abbbbb

  27. MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 abb 0 abbb ab* 8 abbbb 3 abb 3 abbbbb

  28. MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 abb abbb ab* 8 abbbb abb 3 ab

  29. Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } MDL Modul Factoring Generalization Overview of XTRACT System Sg = I  { (ab)*, (a|b)*, b*d, b*e } Sf = Sg { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

  30. Generalization Subsystem • Goal: • Infer regular expressions from example sequences • Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)* • Generate more general DTDs • Two heuristics: • DiscoverSeqPattern(s,r): s=abbbbc => ab*c • DiscoverOrPattern(s,d): s=abacbc => (a|b|c)* • Candidate DTDs are generated by calling the above functions for appropriate values of r and d

  31. a b a b a b a b c a b c a b a b c a b a b a b a b c a b c a b a b c a b ( a b ) * c a b c a b a b c a b ( a b ) * c a b c a b a b c ( a b ) * c a b c ( a b ) * c ( a b ) * c a b c ( a b ) * c ( ( a b ) * c ) * DiscoverSeqPattern Example The pattern must occur at least two times: r=2

  32. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c

  33. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  34. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  35. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  36. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  37. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  38. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

  39. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 2: replace pattern a1…an by (a1|..|an)*

  40. DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a ( x | c ) * a c Step 2: replace pattern a1…an by (a1|..|an)*

  41. a ( ((de)*e)* | c ) * a c DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a ( x | c ) * a c x is an auxiliary symbol introduced by DiscoverSeqPattern x = ((de)*e)*

  42. ac | ad | bc | bd Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd }

  43. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd =>

  44. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d)

  45. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) |

  46. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d)

  47. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d) =>

  48. Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d) => (a|b)(c|d) • Reduces MDL description length of the candidate DTDs • Adoption of factoring algorithms for Boolean expressions • Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form

  49. Factoring Subsystem Heuristics • Choose subsets S of candidate DTDs from SG such that • DTDs in S have a common prefix p or suffix s • number of DTDs with this common prefix in SG is high

  50. abc(d*|e*|f*|g*) Factoring Prefixes Candidate DTDs abcddd abcd* abceee abce* abcfff abcf* abcggg abcg* longer prefixes result in MDL cost reduction factored DTD covers all input sequences

More Related