1 / 39

Inference of Concise DTDs from XML data

Inference of Concise DTDs from XML data. Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3. 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg. Outline. Goals & motivation

westfieldj
Download Presentation

Inference of Concise DTDs from XML data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inference of Concise DTDs from XML data Geert Jan Bex1 Frank Neven1 Thomas Schwentick2 Karl Tuyls3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg

  2. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  3. DTD Aims & requirements XML • Problem: infer DTD from XML corpus • Requirements: • Concise: humans can interpret/validate • Work on large data sets • Work on small data sets • Robust to noise

  4. Why DTD inference? • Schema inference • ≈ 50 % of XML documents : no schema [Barbosa et al. 2005] • ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] • Improving existing schemas • “Noisy” XML documents ≈ 90 % of XHTML docs : not valid • Related work • Fails on real-world, large data sets • Results not concise

  5. Why schemas? • Validation : efficiency, security • Optimization : search, processing • Static analysis, type checking (e.g., XQuery) • Software development : modeling,OR-mapping • Integration : (meta-)data sources • Schema matching • Semantics

  6. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  7. … book book title editor year isbn title author author year title (author+ + editor+) year isbn? … … … … … … XML documents Learning regular expression from set of strings

  8. ((b?(a+c))+ d)+ e Learning automata? Well studied, but… Learning automata≠learning regular expressions

  9. < ? a (b* + c) d? ??? < Learning regular languages? S = { abbb, abbd, acd, ac } • abbb + abbd + acd + ac • most specific regex for S • (a + b + c + d)* • most general regex for S positive examples only! generalization vs. specificity Impossible…in general

  10. Subclasses • SingleOccurrenceRegularExpressions • 99 % of regular expression in DTDs/XSDs • CHAinRegularExpressions • 90 % of regular expression in DTDs/XSDs  Infer with iDTD Infer with CRX

  11. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  12. duplicate element names SOREs • What’s a SOREheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?title . (author . affiliation?)+ . abstract • … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract

  13. a b 2T-Inf d e [Garcia & Vidal 1990] c Sample  SOA W = {bacacdacde, cbacdbacde, abccaadcde} SingleOccurrenceAutomaton

  14. < < in general: |S| |L(SOA)| Sample  SOA • SOA size • || + 2 states • O(||2) transitions • Complexity of algorithm • O(||W||) • streaming • Algorithm sound • W L(SOA)

  15. a a b d d d d d b? b? e e e e e c c a+c b? (a+c) ((b? (a+c))+ d)+ e ((b? (a+c))+ SOA  SORE: REWRITE optional b disjunction a, c self-loop b? (a+c) concatenation b?, a+c

  16. REWRITE: properties • Theorem • REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) • Complexity: O(||4) • SORE size • || symbols • O(||) operators

  17. a a b b d d e e c c ((b? (a+c))+ d)+ e REWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} no rules apply !!! almost disjunction a, c Fix: enable-disjunctionenable-optional

  18. iDTD: properties • Theorem • iDTD transforms SOA into SORE such that L(SOA) L(SORE) • iDTD can be parameterized for performance

  19. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  20. CHAREs • Definition: A chain regular expression is a sequence of factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of • (a1 + … + ak) • (a1 + … + ak)? • (a1 + … + ak)+ • (a1 + … + ak)* CRX derives CHAin Regular Expressions Chain Regular expressioneXtraction

  21. not a factor duplicate element names CHAREs • What’s a chainheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs? • … and what’s nottitle . (author . affiliation?)+ . abstracttitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract

  22. Pre-order relation W a b b c c d d e h i c f b d g a e c a a d b f f e e g f h h i CRX run: pre-order relation Sample W a b c c d e c c c a d b f e g b f h i

  23. f e d g h i a b c CRX run: transitive closure a W b and b W c then a W c Sample W a b c c d e c c c a d b f e g b f h i

  24. a,b,c f e d g h i a b c equivalence class CRX run: transitive closure a W b and b W a then a W b Sample W a b c c d e c c c a d b f e g b f h i Symbol occurs in exactly one equivalence class

  25. a,b,c f e d g h i predecessor set successor set CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ |  W ’} Sample W a b c c d e c c c a d b f e g b f h i

  26. a,b,c e g h i d,f CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ |  W ’} Sample W a b c c d e c c c a d b f e g b f h i W: partial order W

  27. a,b,c e g h i ? + ? d,f ? ? . . . . . (a + b + c)+ (d + f) e? g? h? i? CRX run: multiplicity & RE topological sort Sample W a b c c d e c c c a d b f e g b f h i Chain Regular Expression

  28. CRX algorithm: properties • Optimality:W linearly ordered  CHARE r,WL(r) and L(r)L(rW): rW = r • Performance : O(||W|| + |Σ|3) • Training set size:Any CHARE r can be learned from{w | wL(r)w’L(r): |w|  |w’| + 2}

  29. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  30. Related work • XTRACT [Garofalakis et al. 2000] • Pioneer • More general than iDTD • Focuses on regular expressions that don’t occur in real DTDs no concise schemas • Trang: roughly equivalent to CRX • Inconsistent results

  31. Data • Real world regular expressions • SOREs • Non SOREs • Real world data when available • Synthetic data otherwise

  32. real world data

  33. real world regexes

  34. CRX iDTD no repairs Experiments: generalization

  35. CRX iDTD Experiments: generalization

  36. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  37. Extensions • Incremental computation • new data  update internal representation (SOA or partial order) • Noise • Support for element name too small  ignore element • SOA: support for edges too small  delete edges before repair • Numerical predicates • Bookkeeping: minOccurs, maxOccurs • Generating XSDs • Infer data types (integer, double, date,…)

  38. Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

  39. Conclusions • iDTD + CRX • learns robust class of regexes from positive examples • complete in their target class for sufficient data • deals with insufficient data • performs well on real world data • runs efficiently • Future work: inferring XML Schemas

More Related